I’m used to messy metadata (that is, data about data – so in this case, data that describes the contents of the ship register), but today I’ve really hit a snag. I found a very nice resource online that has digitized many years of a useful ship register. But the data describing the data in that register is so bad that I wonder if it’s worth adding to the ShipIndex database at all. In some instances, every second entry is obviously wrong. At this point, I’m up to the “Ae”s (after working through the ship names that started with question marks), and I’ve got a long, long way to go. It didn’t take that long to collect this data, but it’ll take forever to correct it.
What I find so frustrating is that if the compilers of the data had spent, say, just three solid days of work going through this file, they could have corrected tens of thousands of errors before they ever sent it out into the world.
Here are some examples. When you see a series of ship names like
Aéro-Poatale IV |
Aeropostale I. |
Aéro-Postale I. |
Aéro-Postale II. |
it’s easy to see that the first one is not “Aéro-Poatale”.
Or when you see the following (the second field is the launch date; the third is the tonnage):
Affaric |
1934 |
239 |
Affarie |
1934 |
239 |
you know they’re they same ship, and it’s easy enough to determine that the ship name is Affaric, not Affarie.
Or just below that, the following series of ship names:
Afghanistan |
1940 |
Afghanistan |
1917 |
Afghantstan |
1905 |
Afghauistan |
1917 |
Afghauistan |
1917 |
(The second, fourth, and fifth all describe the same vessel.) If the vessel you’re searching is the 1905 one, and you use the term “Afghanistan”, you won’t find it via the native interface.
Here’s another good one a bit further down. Apparently they weren’t sure which way the accent should go.
Agnés |
1896 |
120 |
Agnès |
1896 |
120 |
(A quick look at the pdfs they link to shows that the second one is correct.)
Speaking of diacritics, who knows how many ship names are inaccurately represented here because the compilers decided to just ditch the diacritics? Here are three different versions of the same vessel:
Hillev?g |
1885 |
877 |
Hillevåg |
1885 |
877 |
Hillevg |
1885 |
877 |
Many diacritics are replaced with questions marks – probably as a result of some hinky encoding issues – but many others are just deleted. When I can find them, I put them back — so someone who knows a ship’s name will be able to find it — but I’m afraid that’s not going to happen most of the time.
Also, numerous blank spaces are missing from ship names. While this reflects how the data appeared in the original resource, it doesn’t consider how people use the database they’ve created. If I’m searching for the 1922 vessel Pacific Commerce, which appears in the database, how do I know that I should also search for “PacificCommerce”, which will also return a result for the vessel I’m seeking? If I don’t fix entries such as these, they’ll create “ships” in the ShipIndex database with names like “PacificCommerce” or “PacificFir” – and later, I’d have to go back and fix them all. And, of course, the fix is not that difficult – just put in the spaces in the appropriate locations. I may use regular expressions to simplify this work, though that does raise the possibility of adding unintentional errors. (But it’d be worth it; it’d fix far, far, far more errors than it’d introduce.)
I certainly wouldn’t expect total accuracy in a project like this. In some cases, the originals that were OCR’d were very poor quality microfilm. But what frustrates me is that a quick pass over the spreadsheet, as I’m doing, would identify tens of thousands of these errors.
Problems are not limited to ship names. There are more than four hundred entries whose build dates are well after the issues were published; all of those are clearly wrong. In other cases, when one build date is 1980 and another one, for a ship with the same name and size, is 1930, it’s easy to know that the latter is correct and the former is wrong. Here’s an example:
Alan Seeger |
1943 |
7208 |
Alan Seeger |
1913 |
7208 |
A quick look at the entry for the second one confirms that, while one can see why the OCR software thought it said “1913”, a proofer could easily identify the error (as I did, for instance), and correct it to read “1943”.
And what concerns me is that if I don’t clean up most of this data now, then it’ll get into the ShipIndex.org database, and make a mess that I’ll have to clean up eventually. But I think it will take me many, many hours to go through this and correct it all – and who knows what I’ll miss, and will still get introduced to the database. If I import the data, warts and all, then try to go back and correct it later, there will be that much more to clean up.
I’m quite frustrated by this, because it’s so clear to me how much positive impact cleanup would have had on the original database itself. As it is, I’m making it more reliable to search this database through the ShipIndex interface than through its native interface (for example, the person searching for the 1905 Afghanistan would find it through ShipIndex.org, but not through the original site), but it’ll take a long time before I can get the file done and ready to load.
What a shame.
Aéro-Poatale IV |
Aeropostale I. |
Aéro-Postale I. |
Aéro-Postale II. |