The messiest metadata yet…

I’m used to messy metadata (that is, data about data – so in this case, data that describes the contents of the ship register), but today I’ve really hit a snag. I found a very nice resource online that has digitized many years of a useful ship register. But the data describing the data in that register is so bad that I wonder if it’s worth adding to the ShipIndex database at all. In some instances, every second entry is obviously wrong. At this point, I’m up to the “Ae”s (after working through the ship names that started with question marks), and I’ve got a long, long way to go. It didn’t take that long to collect this data, but it’ll take forever to correct it.

What I find so frustrating is that if the compilers of the data had spent, say, just three solid days of work going through this file, they could have corrected tens of thousands of errors before they ever sent it out into the world.

Here are some examples. When you see a series of ship names like

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

it’s easy to see that the first one is not “Aéro-Poatale”.

Or when you see the following (the second field is the launch date; the third is the tonnage):

Affaric 1934 239
Affarie 1934 239

you know they’re they same ship, and it’s easy enough to determine that the ship name is Affaric, not Affarie.

Or just below that, the following series of ship names:

Afghanistan 1940
Afghanistan 1917
Afghantstan 1905
Afghauistan 1917
Afghauistan 1917

(The second, fourth, and fifth all describe the same vessel.) If the vessel you’re searching is the 1905 one, and you use the term “Afghanistan”, you won’t find it via the native interface.

Here’s another good one a bit further down. Apparently they weren’t sure which way the accent should go.

Agnés 1896 120
Agnès 1896 120

(A quick look at the pdfs they link to shows that the second one is correct.)

Speaking of diacritics, who knows how many ship names are inaccurately represented here because the compilers decided to just ditch the diacritics? Here are three different versions of the same vessel:

Hillev?g 1885 877
Hillevåg 1885 877
Hillevg 1885 877

Many diacritics are replaced with questions marks – probably as a result of some hinky encoding issues – but many others are just deleted. When I can find them, I put them back — so someone who knows a ship’s name will be able to find it — but I’m afraid that’s not going to happen most of the time.

Also, numerous blank spaces are missing from ship names. While this reflects how the data appeared in the original resource, it doesn’t consider how people use the database they’ve created. If I’m searching for the 1922 vessel Pacific Commerce, which appears in the database, how do I know that I should also search for “PacificCommerce”, which will also return a result for the vessel I’m seeking? If I don’t fix entries such as these, they’ll create “ships” in the ShipIndex database with names like “PacificCommerce” or “PacificFir” – and later, I’d have to go back and fix them all. And, of course, the fix is not that difficult – just put in the spaces in the appropriate locations. I may use regular expressions to simplify this work, though that does raise the possibility of adding unintentional errors. (But it’d be worth it; it’d fix far, far, far more errors than it’d introduce.)

I certainly wouldn’t expect total accuracy in a project like this. In some cases, the originals that were OCR’d were very poor quality microfilm. But what frustrates me is that a quick pass over the spreadsheet, as I’m doing, would identify tens of thousands of these errors.

Problems are not limited to ship names. There are more than four hundred entries whose build dates are well after the issues were published; all of those are clearly wrong. In other cases, when one build date is 1980 and another one, for a ship with the same name and size, is 1930, it’s easy to know that the latter is correct and the former is wrong. Here’s an example:

Alan Seeger 1943 7208
Alan Seeger 1913 7208

A quick look at the entry for the second one confirms that, while one can see why the OCR software thought it said “1913″, a proofer could easily identify the error (as I did, for instance), and correct it to read “1943″.

And what concerns me is that if I don’t clean up most of this data now, then it’ll get into the ShipIndex.org database, and make a mess that I’ll have to clean up eventually. But I think it will take me many, many hours to go through this and correct it all – and who knows what I’ll miss, and will still get introduced to the database. If I import the data, warts and all, then try to go back and correct it later, there will be that much more to clean up.

I’m quite frustrated by this, because it’s so clear to me how much positive impact cleanup would have had on the original database itself. As it is, I’m making it more reliable to search this database through the ShipIndex interface than through its native interface (for example, the person searching for the 1905 Afghanistan would find it through ShipIndex.org, but not through the original site), but it’ll take a long time before I can get the file done and ready to load.

What a shame.

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

2 thoughts on “The messiest metadata yet…

  1. Do you know if the original scans are available from whatever this was (or available from another source)? Clearly these are OCR errors, I wonder if some of them might be corrected by a) tweaking some of the images so they OCR better. b) some “free” OCR software has a built in limit on accuracy, full versions may be more accurate. Also if this was created using earlier OCR software you might have better luck with the latest versions.

  2. Richard –

    That’s an interesting thought; I hadn’t considered rescanning the resources that are currently available online. One problem, though, is that the site is using URLs based on their scanned ship names to link from the incorrect OCR’d data to the original scans. So, in order to link to the original scans, we need to keep using the URLs that they created. I’ve been correcting the ship name, or the date of launch information, but not the URL – since its inaccurate name doesn’t matter, as long as it gets you to the original scan.

    Peter

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>