Tag Archives: data quality

More resource additions, and a few deletions, too

I provided a list of new resources added to the ShipIndex.org database, back in November. We’re always adding new content, so I’ll include a list of the new stuff at the bottom of this post in an upcoming post. But I also need to address the fact that we have to remove some stuff, as well.

Monographs, or books, are great as resources, because once they’re added, we know they’re not going anywhere. Those books are in libraries and collections around the world. You may not be able to access them right away, but eventually, you’ll be able to do so. Online resources are great because you can link to them RIGHT NOW. Boom, click, done. Except, when that doesn’t work.

Online resources are great for convenience, but not for reliability. They change and disappear all the time. For some reason, website publishers still don’t realize that if they’re going to change their URLs, they’re going to break access for repeat users. They can include redirects, but rarely do. Too often, website publishers switch from a straightforward linking and searching structure to some fancy search tool that removes prior direct links, and makes new direct links impossible. Tim Berners-Lee, the creator of the World Wide Web, defined five stars for Open Data. One of those is making sure that people can point to your stuff. That is, make sure they can link to it easily. If you are required to do a search to get data, rather than also having a direct link that would get a person to your content, then you’re doing it wrong.

As one example, there’s a brand new “Royal Navy Loss List searchable database” at https://thisismast.org/research/royal-navy-loss-list-search.html. It’s nice that this data is here, and you can do a search for, say, “Indefatigable”, and find a record. But you cannot provide a direct link to the “Indefatigable” results, without going through that search page, which is really annoying, at least for those who care about open data.

Unfortunately, in this case, the MAST Loss List database only meets one of Sir Tim’s five stars toward Open Data. They could — and should — do much better.

But even worse is the total disappearance of online resources. Our data team recently reviewed all online resources in the database, and found quite a few which have disappeared, or are currently offline. We discovered a lot of problems that we’ll need to address. In some cases, the fix is pretty easy because there’s an obvious change to the URLs in the database. This was the case for the Bremen Passenger Lists; we fixed them, and they’re accessible again.

For others, though, we see bigger problems. Take the UK’s “National Small Boat Register”, for instance, which was hosted by the National Maritime Museum in Cornwall. At https://nmmc.co.uk/explore/databases/, you can see that the museum reports in an undated note, “The NSBR is currently offline whilst we create a new and improved website. We will have it up and running again as soon as possible. Please check back for further updates.” WHAT???

I’m all for thoughtful and improved websites, but why take down the old one when you’re building the new one??? Why not just keep it up until the new one is live and working?? The old one worked, didn’t it?? (It obviously did, at one point, when we added it to the database.)

There’s nothing to do but delete the National Small Boat Register contents from the ShipIndex database, and hope we’ll discover the replacement database when — if — it is ever put back online.

The Blue World Web Museum recently disappeared, as did other smaller resources. If you manage a vessel database that you can no longer keep online, please, please, please, contact me at comments (at) shipindex (dot) org, and give me a chance to see if we can save that resource for you.

I think I’ll start a separate blog post that lists the online databases that have disappeared; if you know of new sites for any of these, or contacts for folks who might be willing to offload that work to ShipIndex.org, please do let me know.

This got quite long, so I’ll create a separate post that lists the recently-added new content, in a day or three. (After making the post about lost databases, I suppose.)

Most Popular Vessel Names in the US

I updated the Merchant Vessels of the United States database today. That’s a big file (~375k entries) and it serves as an interesting collection of personal and merchant vessels.

(There’s a minor error in the import, in that about 10% of the entries – in the Os through Rs – are duplicated. I’m working on correcting that problem. Also, apologies about the layout in this blog post, particularly with the tables. Not sure what the problem is, but I’ll try to correct it.)

Unfortunately, the US Coast Guard has changed their system, and NOAA has dropped their version of the database altogether, so you can no longer link directly to a specific ship. This is very frustrating, but I can’t control other sites’ setups. The URL will take you to the search page, and you can search again for the ship name that you’d found in ShipIndex.

The Coast Guard has also removed tons of personal information about owners of recreational vessels. The remaining information will still be useful to some.

MVUS also creates an interesting opportunity to look at a really large data set, and get a good sense of what vessel names are most appealing to the most people in the US.

Continue reading

Data correction work at ShipIndex.org

We’ve completed our first initial load of a large pile of content into the premium ShipIndex.org database, and now have 1,231,909 references in the database. That’s a lot of content. We do have tons more to add, and it’ll keep coming in over time. Now, however, we’ll turn to the process of cleaning up some of this data.

Obviously, having all this data in one place is, I believe, a huge benefit, and well worth the subscription price for the premium database. We make it possible to search through well over a million references, from about 125 resources, in less than a second. The quality of much of this data, however, often leaves something to be desired. And now I’m turning to doing some cleanup, which I believe will be an equally valuable benefit provided by our site.

Data problems come from lots of different sources; some resources include prefixes, such as “USS” or “HMS” in front of vessel names, so many American naval vessels are currently listed on the ‘U’ page, as in this screen shot:

Many 'USS' listings in ShipIndex.org

They’re also listed in the proper location, under the name of the vessel, so it means there are several places to look. That’s no good, and we’ll fix that.

Another problem is attempts to save space in 19th century printed directories. One will find many entries with apostrophes in them, like the following:

Abbreviated entries in ShipIndex.org

As a subscriber to premium content, you can follow any of these particular links to find that most of these transcriptions accurately reflect what was written in the original publications. But that was done to save space in the printed directory; the stern of the ship certainly read “Duke of Newcastle”, not “D’keof N’wcastle” one year, “D’keofN’wcastle” another year, and “D’ke of N’wc’stle” a third year. (And there is a transcription error, as well: one reads “D’ke of N’woastle” rather than “D’ke of N’wcastle”.) Since no researcher would reasonably think to search for “D’ke”, we’ll work to change all of these to be searchable under “Duke of Newcastle”. (Don’t worry, if you do want to search for “D’ke”, you still can.)

A third problem is simple transcription errors, and there are many of those, from lots of different sources. In addition to the one noted above, several errors appear in the vessel named “D’le pf Suth’rl’nd”. The original source appears as:

ScreenHunter_03 Dec. 24 10.52

So, part of our value-add is correcting these errors. The data quality team at ShipIndex.org has lots of experience with this, since we’ve been doing something similar for the past ten years with magazine titles. (Trust me, they are far more complicated than ship names.)

Of course, with 1.23 million entries, it’ll take us a while to get through the entire database. It’s a fairly slow and meticulous process – though the technology team at ShipIndex has done a great job creating a panoply of tools to simplify the process and speed it up. (The technology team spent much of the past ten years building the tools that the data quality team used when working on magazine titles, so we’ve got it all pretty well covered.) It’ll take time to work through everything, and we’ll definitely be adding more data before we finish this process – meaning it’ll take that much longer – but it will happen. And if you see an error you especially want corrected, please don’t hesitate to let us know.

Thanks for your interest, and have a great holiday season.