Monthly Archives: January 2010

The messiest metadata yet…

I’m used to messy metadata (that is, data about data – so in this case, data that describes the contents of the ship register), but today I’ve really hit a snag. I found a very nice resource online that has digitized many years of a useful ship register. But the data describing the data in that register is so bad that I wonder if it’s worth adding to the ShipIndex database at all. In some instances, every second entry is obviously wrong. At this point, I’m up to the “Ae”s (after working through the ship names that started with question marks), and I’ve got a long, long way to go. It didn’t take that long to collect this data, but it’ll take forever to correct it.

What I find so frustrating is that if the compilers of the data had spent, say, just three solid days of work going through this file, they could have corrected tens of thousands of errors before they ever sent it out into the world.

Here are some examples. When you see a series of ship names like

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

it’s easy to see that the first one is not “Aéro-Poatale”.

Or when you see the following (the second field is the launch date; the third is the tonnage):

Affaric 1934 239
Affarie 1934 239

you know they’re they same ship, and it’s easy enough to determine that the ship name is Affaric, not Affarie.

Or just below that, the following series of ship names:

Afghanistan 1940
Afghanistan 1917
Afghantstan 1905
Afghauistan 1917
Afghauistan 1917

(The second, fourth, and fifth all describe the same vessel.) If the vessel you’re searching is the 1905 one, and you use the term “Afghanistan”, you won’t find it via the native interface.

Here’s another good one a bit further down. Apparently they weren’t sure which way the accent should go.

Agnés 1896 120
Agnès 1896 120

(A quick look at the pdfs they link to shows that the second one is correct.)

Speaking of diacritics, who knows how many ship names are inaccurately represented here because the compilers decided to just ditch the diacritics? Here are three different versions of the same vessel:

Hillev?g 1885 877
Hillevåg 1885 877
Hillevg 1885 877

Many diacritics are replaced with questions marks – probably as a result of some hinky encoding issues – but many others are just deleted. When I can find them, I put them back — so someone who knows a ship’s name will be able to find it — but I’m afraid that’s not going to happen most of the time.

Also, numerous blank spaces are missing from ship names. While this reflects how the data appeared in the original resource, it doesn’t consider how people use the database they’ve created. If I’m searching for the 1922 vessel Pacific Commerce, which appears in the database, how do I know that I should also search for “PacificCommerce”, which will also return a result for the vessel I’m seeking? If I don’t fix entries such as these, they’ll create “ships” in the ShipIndex database with names like “PacificCommerce” or “PacificFir” – and later, I’d have to go back and fix them all. And, of course, the fix is not that difficult – just put in the spaces in the appropriate locations. I may use regular expressions to simplify this work, though that does raise the possibility of adding unintentional errors. (But it’d be worth it; it’d fix far, far, far more errors than it’d introduce.)

I certainly wouldn’t expect total accuracy in a project like this. In some cases, the originals that were OCR’d were very poor quality microfilm. But what frustrates me is that a quick pass over the spreadsheet, as I’m doing, would identify tens of thousands of these errors.

Problems are not limited to ship names. There are more than four hundred entries whose build dates are well after the issues were published; all of those are clearly wrong. In other cases, when one build date is 1980 and another one, for a ship with the same name and size, is 1930, it’s easy to know that the latter is correct and the former is wrong. Here’s an example:

Alan Seeger 1943 7208
Alan Seeger 1913 7208

A quick look at the entry for the second one confirms that, while one can see why the OCR software thought it said “1913”, a proofer could easily identify the error (as I did, for instance), and correct it to read “1943”.

And what concerns me is that if I don’t clean up most of this data now, then it’ll get into the ShipIndex.org database, and make a mess that I’ll have to clean up eventually. But I think it will take me many, many hours to go through this and correct it all – and who knows what I’ll miss, and will still get introduced to the database. If I import the data, warts and all, then try to go back and correct it later, there will be that much more to clean up.

I’m quite frustrated by this, because it’s so clear to me how much positive impact cleanup would have had on the original database itself. As it is, I’m making it more reliable to search this database through the ShipIndex interface than through its native interface (for example, the person searching for the 1905 Afghanistan would find it through ShipIndex.org, but not through the original site), but it’ll take a long time before I can get the file done and ready to load.

What a shame.

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

Update on New Content

I’ve added lots of new content to the database in the past few weeks, but I haven’t been good about making a note of that here.

I’ve just finished adding a really significant resource: Warships of the Imperial Japanese Navy, 1869-1945, by Hansgeorg Jentschura, Dieter Jung, and Peter Mickel. This is a 1977 translation of the original work, written in German. It has an enormous amount of information in it, and an extensive index.

There’s also a section titled “Miscellaneous Mercantile Auxiliary Vessels,” which has tons more information, but isn’t included in the index proper. I have, however, added all of the ships mentioned in this section to the ShipIndex.org database. The section has brief information about several thousand vessels, such as the following:

Hinode Maru (Transport): 5256 grt steamer, built 1930; requisitioned 1947; sunk 10 June 1943 north of New Ireland by US submarine Silversides.”

In this case, you’ll find the entry Hinode Maru (Transport) in the ShipIndex database.

Working through this index made me curious about the many vessels that were used in attempting to blockade Port Arthur. I didn’t really know anything about Port Arthur, so I did some quick investigating, and found it’s in Manchuria, and the blockade was part of the start of the Russ0-Japanese War of 1905. You learn all kinds of things doing this stuff!

As one indication of the value of this index, it has added over 3500 completely new vessels to the index. Resources with an Anglo-American focus tend to not add too many new vessels to the index — they usually cite vessels that are already in the index, but this time I’m pleased to be able to extend the coverage of the index quite a bit. To that end, if you know of resources that should be added, especially covering non-US or -UK subjects, please do let me know and I’ll look forward to having an opportunity to add them.

I also added a resource that’s much more relevant to history closer to home: Fiorello La Guardia’s Maritime History of New York, from 1941. Actually, it looks like La Guardia only wrote the introduction, and “sponsored” the publication – it was written by people employed by the Writers Program of the Works Project Administration for New York City. Whereas nearly 75% of the entries from the Jentschura book, above, are new, unduplicated vessels, in this book it’s more like 5%. (It does add the two privateers United We Stand and Divided We Fall, though, which is neat.)

These two titles above were added today. The following were added in the past ten days:

We’ll do a better job of listing resources when they’re added, and we’ll probably also put a “new” note next to these resources in the Resource list for a month or so after we’ve added them.

Again, if you know of resources that should be added, please let me know!