Category Archives: Data Correction

New content added in past few weeks

Here’s an overview of the new content added in the past few weeks. Two collections are of particular note: the Lloyd’s List for 1812, via 1812Privateers.org, and the Dyal Ship Collection. One man, Michael Dun, has digitized and indexed all of the issues of Lloyd’s List for the entire year of 1812. It’s quite a feat. He’s indexed all of the ships and all of the masters for that time, adding up to nearly 26,000 ship citations in all the issues of Lloyd’s List for 1812. He kindly shared his index with me, so I could include links to his resources. Mr. Dun hosts the pages on his servers, and they are accessible to all via that site. While working through the index of ship names that he provided to me, I was able to identify a number of corrections, and I incorporated those into the file I imported.

Working through this file was also an interesting reminder about the challenges we face in trying to make the most of these primary sources. Clearly, the folks who were putting together each issue of Lloyd’s List (it usually came out twice a week, and was published in London) were trying to get information out as quickly as possible, and weren’t too concerned with absolute accuracy, to say nothing of how researchers two centuries later would like them to present information.

As a few examples, each of the following slight spelling variations by the editors are likely the same ship: Misletoe, Misseltoe, and Missletoe (there’s no Mistletoe listed in this year of Lloyd’s!). Or, Nymph, Nymphe, and Nymphen. Or Powhatan, Powahattan, and Powhatton. Or Zenophon and Zenophen, when the proper spelling is Xenophon. Or Tinmouth Castle, most  likely meaning Teignmouth Castle. Or simple errors, like Hepsa instead of Hespa.

Of course, if you’re reading this at a London coffee shop one morning in 1812, you can easily look over these minor errors, and figure out what the editors’ intent was. But for researchers two centuries later, who are trying to mine large amounts of data to see what they can find, these errors cause a problem. So how do we address them? That’s an issue for an upcoming blog post. But, needless to say, we at ShipIndex.org have a solution…

Another interesting addition is the Dyal Ship Collection, but for very different reasons. This is a collection of images and data compiled by a researcher (in this case, a librarian) and added to his institution’s “institutional repository” (IR). An IR is a site, usually maintained by an academic library, where content generated by the institution’s faculty, staff, and students is made available for free. It is, in a large sense, a reaction to the high cost of many academic journals, where an institution’s researchers spend time and money doing and compiling research, then pay to have that published in a scholarly journal, then the institution pays to buy the results back, through a subscription to the journal. The whole discussion is beyond the scope of this blog post, but the point is that IRs are places where interesting and useful information can be stored — but it’s most often quite hidden, unless there’s some effective way of indexing the content.

So, with the encouragement and assistance of the compiler, we’ve created links into the collection of files and images that are stored in Texas Tech University’s institutional repository. Recently, we’ve heard from others who have data they’d like us to include, and we’re looking at ways of doing that effectively. This is just one example of that.

Other items we’ve added are mostly more standard print or online collections. The total list is as follows:

If you have maritime content that you’d like to get online, or is online but needs broader publicity, please let us know. We’d love to find a way to help.

Is there a better way to present this data?

I’ve been working on a big file that’s going to be very useful to ShipIndex.org subscribers, especially those interested in World War II vessels. H.T. Lenton’s tome, British and Imperial Warships of the Second World War, is an incredible resource. Its 750+ pages are absolutely jam-packed with useful content, but it has presented me with a few challenging issues about how to manage this data. I thought I’d describe some of it here, explain what my plan is, and see if the greater good has any better suggestions. There’s still time to modify how this resource is managed. I’ve probably invested at least 30 full hours in preparing this file – and that doesn’t include a significant amount of work done by another person before me – and I still have a long way to go. But that’s what it takes, sometimes, to get a resource like this one ready to add to the database.

The first part of this remarkable volume looks at larger, named vessels, organized by vessel type and class. As one example, the “Corvettes and Frigates” section is divided into entries on the “Flower” class, the “River” class, the “Kil-” class, and four more classes. (The introduction has several fascinating paragraphs about the peregrinations of naming vessels, and shows how complicated the whole process was. A fair bit of background knowledge is required just to understand this section!) After some commentary on the design and development of the class, Lenton provides tables showing brief history information for every vessel in a class. Information may be quite extensive, or it might consist of as little as an indication of the intended builder and the approximate cancellation date (for example, for vessels ordered but not begun before the war ended).

This works fine for named vessels, but creates a conundrum for unnamed vessels. In the LCM (Landing Craft Mechanised) section, for example, the index notes that “LCM.21-118” appear on pg 490; “LCM.119-220” on pg 491, “LCM.221-334” on pg 492, etc. Of the 100+ ships on each page, though, just two to three dozen have any information at all about the vessel, and that information is slight, at best. For the LCMs, most have no Building or Completion information. Of the ones that have “Fate” information, it usually reads something like “Lost cause unknown Algiers ../11/42.” (Meaning it was lost in November 1942, but the exact date and cause is not known.)

To me, this information might be useful to someone, and I don’t want to not include the entry for that vessel. But for each one like that, there are several where no information at all is included, and I believe that adding an entry to ShipIndex.org should imply that at least SOMETHING is available in the resource. So I’ve decided that what I’ll do is expand entries like “LCM.21-118” to be “LCM.21”, “LCM.22”, “LCM.23”, etc., up to “LCM.118”. Then I’ll compare my list with the book itself. If there’s any information at all about the vessel, I’ll keep the entry. If there is no information beyond its listing on the page – nothing about where it was built, or how it was lost, for instance – then I’ll delete it. My thought is that if the volume offers one piece of information, I’ll include the vessel name in the index.

Still, it’s worth noting that for people who are working on an unlisted LCM, the volume may contain information about the LCM class that might be relevant. And if you’re looking for an image of a specific auxiliary vessel, it may be that an image of a different vessel in the same class will do. It appears that the most common vessel type in which this will apply will be the LCMs, of which several thousand were built, but it will be interesting to see how it actually turns out.

Am I doing the right thing? Should I be handling this in some other way? Is there some other way that I should note the amount of information presented? I’d welcome your comments – if there’s a better way of doing it, now’s the time for me to hear about it.

The messiest metadata yet…

I’m used to messy metadata (that is, data about data – so in this case, data that describes the contents of the ship register), but today I’ve really hit a snag. I found a very nice resource online that has digitized many years of a useful ship register. But the data describing the data in that register is so bad that I wonder if it’s worth adding to the ShipIndex database at all. In some instances, every second entry is obviously wrong. At this point, I’m up to the “Ae”s (after working through the ship names that started with question marks), and I’ve got a long, long way to go. It didn’t take that long to collect this data, but it’ll take forever to correct it.

What I find so frustrating is that if the compilers of the data had spent, say, just three solid days of work going through this file, they could have corrected tens of thousands of errors before they ever sent it out into the world.

Here are some examples. When you see a series of ship names like

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

it’s easy to see that the first one is not “Aéro-Poatale”.

Or when you see the following (the second field is the launch date; the third is the tonnage):

Affaric 1934 239
Affarie 1934 239

you know they’re they same ship, and it’s easy enough to determine that the ship name is Affaric, not Affarie.

Or just below that, the following series of ship names:

Afghanistan 1940
Afghanistan 1917
Afghantstan 1905
Afghauistan 1917
Afghauistan 1917

(The second, fourth, and fifth all describe the same vessel.) If the vessel you’re searching is the 1905 one, and you use the term “Afghanistan”, you won’t find it via the native interface.

Here’s another good one a bit further down. Apparently they weren’t sure which way the accent should go.

Agnés 1896 120
Agnès 1896 120

(A quick look at the pdfs they link to shows that the second one is correct.)

Speaking of diacritics, who knows how many ship names are inaccurately represented here because the compilers decided to just ditch the diacritics? Here are three different versions of the same vessel:

Hillev?g 1885 877
Hillevåg 1885 877
Hillevg 1885 877

Many diacritics are replaced with questions marks – probably as a result of some hinky encoding issues – but many others are just deleted. When I can find them, I put them back — so someone who knows a ship’s name will be able to find it — but I’m afraid that’s not going to happen most of the time.

Also, numerous blank spaces are missing from ship names. While this reflects how the data appeared in the original resource, it doesn’t consider how people use the database they’ve created. If I’m searching for the 1922 vessel Pacific Commerce, which appears in the database, how do I know that I should also search for “PacificCommerce”, which will also return a result for the vessel I’m seeking? If I don’t fix entries such as these, they’ll create “ships” in the ShipIndex database with names like “PacificCommerce” or “PacificFir” – and later, I’d have to go back and fix them all. And, of course, the fix is not that difficult – just put in the spaces in the appropriate locations. I may use regular expressions to simplify this work, though that does raise the possibility of adding unintentional errors. (But it’d be worth it; it’d fix far, far, far more errors than it’d introduce.)

I certainly wouldn’t expect total accuracy in a project like this. In some cases, the originals that were OCR’d were very poor quality microfilm. But what frustrates me is that a quick pass over the spreadsheet, as I’m doing, would identify tens of thousands of these errors.

Problems are not limited to ship names. There are more than four hundred entries whose build dates are well after the issues were published; all of those are clearly wrong. In other cases, when one build date is 1980 and another one, for a ship with the same name and size, is 1930, it’s easy to know that the latter is correct and the former is wrong. Here’s an example:

Alan Seeger 1943 7208
Alan Seeger 1913 7208

A quick look at the entry for the second one confirms that, while one can see why the OCR software thought it said “1913”, a proofer could easily identify the error (as I did, for instance), and correct it to read “1943”.

And what concerns me is that if I don’t clean up most of this data now, then it’ll get into the ShipIndex.org database, and make a mess that I’ll have to clean up eventually. But I think it will take me many, many hours to go through this and correct it all – and who knows what I’ll miss, and will still get introduced to the database. If I import the data, warts and all, then try to go back and correct it later, there will be that much more to clean up.

I’m quite frustrated by this, because it’s so clear to me how much positive impact cleanup would have had on the original database itself. As it is, I’m making it more reliable to search this database through the ShipIndex interface than through its native interface (for example, the person searching for the 1905 Afghanistan would find it through ShipIndex.org, but not through the original site), but it’ll take a long time before I can get the file done and ready to load.

What a shame.

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

Data correction work at ShipIndex.org

We’ve completed our first initial load of a large pile of content into the premium ShipIndex.org database, and now have 1,231,909 references in the database. That’s a lot of content. We do have tons more to add, and it’ll keep coming in over time. Now, however, we’ll turn to the process of cleaning up some of this data.

Obviously, having all this data in one place is, I believe, a huge benefit, and well worth the subscription price for the premium database. We make it possible to search through well over a million references, from about 125 resources, in less than a second. The quality of much of this data, however, often leaves something to be desired. And now I’m turning to doing some cleanup, which I believe will be an equally valuable benefit provided by our site.

Data problems come from lots of different sources; some resources include prefixes, such as “USS” or “HMS” in front of vessel names, so many American naval vessels are currently listed on the ‘U’ page, as in this screen shot:

Many 'USS' listings in ShipIndex.org

They’re also listed in the proper location, under the name of the vessel, so it means there are several places to look. That’s no good, and we’ll fix that.

Another problem is attempts to save space in 19th century printed directories. One will find many entries with apostrophes in them, like the following:

Abbreviated entries in ShipIndex.org

As a subscriber to premium content, you can follow any of these particular links to find that most of these transcriptions accurately reflect what was written in the original publications. But that was done to save space in the printed directory; the stern of the ship certainly read “Duke of Newcastle”, not “D’keof N’wcastle” one year, “D’keofN’wcastle” another year, and “D’ke of N’wc’stle” a third year. (And there is a transcription error, as well: one reads “D’ke of N’woastle” rather than “D’ke of N’wcastle”.) Since no researcher would reasonably think to search for “D’ke”, we’ll work to change all of these to be searchable under “Duke of Newcastle”. (Don’t worry, if you do want to search for “D’ke”, you still can.)

A third problem is simple transcription errors, and there are many of those, from lots of different sources. In addition to the one noted above, several errors appear in the vessel named “D’le pf Suth’rl’nd”. The original source appears as:

ScreenHunter_03 Dec. 24 10.52

So, part of our value-add is correcting these errors. The data quality team at ShipIndex.org has lots of experience with this, since we’ve been doing something similar for the past ten years with magazine titles. (Trust me, they are far more complicated than ship names.)

Of course, with 1.23 million entries, it’ll take us a while to get through the entire database. It’s a fairly slow and meticulous process – though the technology team at ShipIndex has done a great job creating a panoply of tools to simplify the process and speed it up. (The technology team spent much of the past ten years building the tools that the data quality team used when working on magazine titles, so we’ve got it all pretty well covered.) It’ll take time to work through everything, and we’ll definitely be adding more data before we finish this process – meaning it’ll take that much longer – but it will happen. And if you see an error you especially want corrected, please don’t hesitate to let us know.

Thanks for your interest, and have a great holiday season.