Tag Archives: data quality

WorldCat (April) Fools

Written by: Peter McCracken
Published: April 1, 2023
Categories: New Content, Uncategorized
Tags: data quality, oclc

This is the first of a few new blog posts. It’s April 1, April Fools Day, but there is, alas, no foolin’ around here. It’s just bad news, start to finish, with the WorldCat subject entity links that have been in the free ShipIndex database since 2009. Read on, to learn more.

When ShipIndex switched from a personal project to a real company, back in 2009, I put all of the citations that had been in the “project” database, into the free database. Anything new was going to go in to the subscription database. I had been in contact with researchers at OCLC, the very large library cooperative that ostensibly helps libraries manage their resources, and shares those holdings, via their publicly available database called WorldCat. I worked with several remarkable people there, who through the years generated a list of all of the “identities” for ships in WorldCat.

This meant we could find books or manuscripts that were by or about ships. So, a book about a ship is easy enough to imagine – the book The Royal Yacht Britannia: The Official History is clearly about that vessel. Having a specific subject heading about that specific yacht makes it easier to differentiate between vessels with the same name. It also created links to books by ships, which often meant logbooks our individually-kept personal journals by people who were on board a vessel. It was a great way of uncovering a lot of useful content about ships that wouldn’t be found otherwise.

But the folks at OCLC said this content needed to be in the free database, not in the then-nascent subscription database. That was fine with me; it was worth including that content and keeping it freely available. The file has been updated occasionally over the past few years, and has always been in the completely free database.

Two or three weeks ago, I was doing some searching, and looked at WorldCat records. I saw notices indicating that the OCLC Identities project, on which these links were based, was going away. This past week, all the links to WorldCat failed. OCLC has ended this project, and with it, links to lots of content that used to be in the database. They’ve also removed linking by Library of Congress Control Number. You’re just searching by phrase now – this seems like the total antithesis of the ideals behind Linked Data.

I have figured out a way to make these links mostly work. The links are now searching by subject headings, rather than by control numbers or identities. As a result, in many cases, they won’t work effectively. In the old file, there was a search to an identity for a ship named “104”, and it specifically went to the entry for a specific ship with that name. Now, the search is for any entry that has both terms “104” and “ship” in a subject heading, so instead of one or two specific results, you get 38 results. Some refer to ‘cruise 104’ of a different vessel. It’s really too bad. Searches for ships like “Mary” are going to terrible, because they’ll include ships named “Mary Rose”, “Mary Ellen”, “Mary & Frank”, “Mary Smith”, and any other ship that has ‘mary’ as just part of its name – instead of going directly to the ship you’re researching. A search for a single, common word ship name, like “Eagle” or “Union” or “James” or “Monitor” or “Wasp” is going to return any record that has that word anywhere in the list of subject headings, even if the term doesn’t have anything to do with a ship name. Connections we’ve made, between specific vessels represented in WorldCat and other citations for those specific vessels, are probably no longer relevant.

OCLC did some work in creating Virtual International Authority File (VIAF) records for some ships, as well. Again, this was great in differentiating between ships with different names. But as far as I can tell, that is also all wiped out.

I’m disappointed and frustrated by this change, as I am with most of what OCLC has done to WorldCat over the past few years.

I’ll leave with this image I collected from WorldCat a few weeks ago, telling me that a copy of a book I wanted was at the State Library of South Australia, but that library is further than the distance to the moon:

My frustration with WorldCat – and OCLC – is ancient news, but it does just keep getting worse. This is really unfortunate. This is NOT a good April Fools joke.

More resource additions, and a few deletions, too

Written by: Peter McCracken
Published: January 16, 2022
Categories: Data Correction
Tags: data quality

I provided a list of new resources added to the ShipIndex.org database, back in November. We’re always adding new content, so I’ll include a list of the new stuff ~~at the bottom of this post~~ in an upcoming post. But I also need to address the fact that we have to remove some stuff, as well.

Monographs, or books, are great as resources, because once they’re added, we know they’re not going anywhere. Those books are in libraries and collections around the world. You may not be able to access them right away, but eventually, you’ll be able to do so. Online resources are great because you can link to them RIGHT NOW. Boom, click, done. Except, when that doesn’t work.

Online resources are great for convenience, but not for reliability. They change and disappear all the time. For some reason, website publishers still don’t realize that if they’re going to change their URLs, they’re going to break access for repeat users. They can include redirects, but rarely do. Too often, website publishers switch from a straightforward linking and searching structure to some fancy search tool that removes prior direct links, and makes new direct links impossible. Tim Berners-Lee, the creator of the World Wide Web, defined five stars for Open Data. One of those is making sure that people can point to your stuff. That is, make sure they can link to it easily. If you are required to do a search to get data, rather than also having a direct link that would get a person to your content, then you’re doing it wrong.

As one example, there’s a brand new “Royal Navy Loss List searchable database” at https://thisismast.org/research/royal-navy-loss-list-search.html. It’s nice that this data is here, and you can do a search for, say, “Indefatigable”, and find a record. But you cannot provide a direct link to the “Indefatigable” results, without going through that search page, which is really annoying, at least for those who care about open data.

Unfortunately, in this case, the MAST Loss List database only meets one of Sir Tim’s five stars toward Open Data. They could — and should — do much better.

But even worse is the total disappearance of online resources. Our data team recently reviewed all online resources in the database, and found quite a few which have disappeared, or are currently offline. We discovered a lot of problems that we’ll need to address. In some cases, the fix is pretty easy because there’s an obvious change to the URLs in the database. This was the case for the Bremen Passenger Lists; we fixed them, and they’re accessible again.

For others, though, we see bigger problems. Take the UK’s “National Small Boat Register”, for instance, which was hosted by the National Maritime Museum in Cornwall. At https://nmmc.co.uk/explore/databases/, you can see that the museum reports in an undated note, “The NSBR is currently offline whilst we create a new and improved website. We will have it up and running again as soon as possible. Please check back for further updates.” WHAT???

I’m all for thoughtful and improved websites, but why take down the old one when you’re building the new one??? Why not just keep it up until the new one is live and working?? The old one worked, didn’t it?? (It obviously did, at one point, when we added it to the database.)

There’s nothing to do but delete the National Small Boat Register contents from the ShipIndex database, and hope we’ll discover the replacement database when — if — it is ever put back online.

The Blue World Web Museum recently disappeared, as did other smaller resources. If you manage a vessel database that you can no longer keep online, please, please, please, contact me at comments (at) shipindex (dot) org, and give me a chance to see if we can save that resource for you.

I think I’ll start a separate blog post that lists the online databases that have disappeared; if you know of new sites for any of these, or contacts for folks who might be willing to offload that work to ShipIndex.org, please do let me know.

This got quite long, so I’ll create a separate post that lists the recently-added new content, in a day or three. (After making the post about lost databases, I suppose.)

Most Popular Vessel Names in the US

Written by: Peter McCracken
Published: July 19, 2020
Categories: New Content, Uncategorized
Tags: data quality

I updated the Merchant Vessels of the United States database today. That’s a big file (~375k entries) and it serves as an interesting collection of personal and merchant vessels.

(There’s a minor error in the import, in that about 10% of the entries – in the Os through Rs – are duplicated. I’m working on correcting that problem. Also, apologies about the layout in this blog post, particularly with the tables. Not sure what the problem is, but I’ll try to correct it.)

Unfortunately, the US Coast Guard has changed their system, and NOAA has dropped their version of the database altogether, so you can no longer link directly to a specific ship. This is very frustrating, but I can’t control other sites’ setups. The URL will take you to the search page, and you can search again for the ship name that you’d found in ShipIndex.

The Coast Guard has also removed tons of personal information about owners of recreational vessels. The remaining information will still be useful to some.

MVUS also creates an interesting opportunity to look at a really large data set, and get a good sense of what vessel names are most appealing to the most people in the US.

Continue reading →

Data correction work at ShipIndex.org

Written by: Peter McCracken
Published: December 24, 2009
Categories: Data Correction
Tags: data errors, data quality

We’ve completed our first initial load of a large pile of content into the premium ShipIndex.org database, and now have 1,231,909 references in the database. That’s a lot of content. We do have tons more to add, and it’ll keep coming in over time. Now, however, we’ll turn to the process of cleaning up some of this data.

Obviously, having all this data in one place is, I believe, a huge benefit, and well worth the subscription price for the premium database. We make it possible to search through well over a million references, from about 125 resources, in less than a second. The quality of much of this data, however, often leaves something to be desired. And now I’m turning to doing some cleanup, which I believe will be an equally valuable benefit provided by our site.

Data problems come from lots of different sources; some resources include prefixes, such as “USS” or “HMS” in front of vessel names, so many American naval vessels are currently listed on the ‘U’ page, as in this screen shot:

They’re also listed in the proper location, under the name of the vessel, so it means there are several places to look. That’s no good, and we’ll fix that.

Another problem is attempts to save space in 19th century printed directories. One will find many entries with apostrophes in them, like the following:

As a subscriber to premium content, you can follow any of these particular links to find that most of these transcriptions accurately reflect what was written in the original publications. But that was done to save space in the printed directory; the stern of the ship certainly read “Duke of Newcastle”, not “D’keof N’wcastle” one year, “D’keofN’wcastle” another year, and “D’ke of N’wc’stle” a third year. (And there is a transcription error, as well: one reads “D’ke of N’woastle” rather than “D’ke of N’wcastle”.) Since no researcher would reasonably think to search for “D’ke”, we’ll work to change all of these to be searchable under “Duke of Newcastle”. (Don’t worry, if you do want to search for “D’ke”, you still can.)

A third problem is simple transcription errors, and there are many of those, from lots of different sources. In addition to the one noted above, several errors appear in the vessel named “D’le pf Suth’rl’nd”. The original source appears as:

So, part of our value-add is correcting these errors. The data quality team at ShipIndex.org has lots of experience with this, since we’ve been doing something similar for the past ten years with magazine titles. (Trust me, they are far more complicated than ship names.)

Of course, with 1.23 million entries, it’ll take us a while to get through the entire database. It’s a fairly slow and meticulous process – though the technology team at ShipIndex has done a great job creating a panoply of tools to simplify the process and speed it up. (The technology team spent much of the past ten years building the tools that the data quality team used when working on magazine titles, so we’ve got it all pretty well covered.) It’ll take time to work through everything, and we’ll definitely be adding more data before we finish this process – meaning it’ll take that much longer – but it will happen. And if you see an error you especially want corrected, please don’t hesitate to let us know.

Thanks for your interest, and have a great holiday season.

Shipindex.org Blog