Archive for the ‘Data Correction’ Category

On Naming Ships and Representing them in ShipIndex

Monday, February 14th, 2011

At present, ShipIndex.org has one point of access: the vessel name. You’d think that would be fairly easy, at least in the case of extant vessels: just look at the stern or the bow, and see what’s written there. Alas, it’s not that simple. There are many reasons for this, and a lot of them are completely understandable. Others can lead to surprisingly interesting stories.

While working through the index to the first 50 years of Steamboat Bill, and its successor, PowerShips, I came across many, many mentions of the Queen Elizabeth 2. Most of these are listed under the very common, abbreviated name, “QE2”. In the ShipIndex database, however, one also finds many entries for a different version of the name, “Queen Elizabeth II”. I read a bit about the ship on its Wikipedia page, and learned some interesting stories about how the name came about. According to the contributors, the name of the ship was not announced before the launching. Cunard intended to name the ship “Queen Elizabeth”, but the Queen, when she launched the ship, stated “I name this ship Queen Elizabeth the Second.”

The next day, newspapers announced the name as “Queen Elizabeth II”, though when the ship was delivered its name read “Queen Elizabeth 2”. According to Wikipedia, “From at least 2002 the official Cunard website stated that ‘The new ship is not named after the Queen but is simply the second ship to bear the name – hence the use of the Arabic 2 in her name, rather than the Roman II used by the Queen’, however, in a change in 2007 this information had been removed.”

In addition, there’s confusion about who the ship is named after. Multiple sources provide multiple suggestions. Some feel the ship is named after the current Queen, and that, in fact, she made that change when she announced its name. Others state that it is named after her mother, the wife of King George VI. Others state it’s named after the previous Cunard ship named Queen Elizabeth.

We need to make it possible for people to find ship names however they might be represented, and so we’ve created functionality that allows one to link between variant names for specific ships. So, for example, when you search for “QE2”, you find entries that cite “QE2”, but you also find a link at the top taking you to entries for other variant names for this ship, specifically “Queen Elizabeth 2” and “Queen Elizabeth II”.

We also have the ability to ‘normalize’ ship names, and in that case, one goes directly from a misspelling of a ship name to the correctly spelled entry. So, by rights, we should ‘normalize’ “QE2” and “Queen Elizabeth II” to “Queen Elizabeth 2”. But I think that, in this case, for this very famous ship, it’s worth maintaining the separate entries and linking them together via the “alternate spelling” links. Maybe I’m wrong; should I just normalize them all together? What do you think?

We also show links for previous and subsequent names of ships. So, if you search for “Euterpe”, you’ll see a “subsequent name” link to “Star of India.” It is important to remember that if there are multiple ships with the name “Euterpe,” the link appears, but doesn’t apply to all of them. Creating a system that separates out all these ships is a big project, but one that we will tackle.

One great thing about the Steamboat Bill files is that they include many previous and subsequent vessel names. Unfortunately, they don’t exactly indicate the order in which vessel names appeared; you’ll see both “Liberte; a) Brasil; b) Volendam; c) Monarch Sun; d) Volendam; e) Island Sun; g) Canada Star h) Queen of Bermuda” and “Queen of Bermuda; a) Brasil; b) Volendam; c) Monarch Sun; d) Volendam; e) Island Sun; f) Liberte; g) Canada Star”, as well as “Island Sun; a) Volendam”. So, some research is needed to figure out the order in which the ship names appeared. Then, I still have a question about whether or not I should include all of the previous and subsequent names in each entry or not. In the above example, if I determine that the actual path of ship name changes was Queen of Bermuda, then Brasil, then Volendam, then Monarch Sun, then Volendam (again), then Island Sun, then Liberte and finally Canada Star”, do I include ‘subsequent name’ links from Brasil to Volendam, Monarch Sun, Island Sun, Liberte, and Canada Star? That creates a lot of links. Or do I just have a link from Queen of Bermuda to Brasil, and on Brasil a link to Volendam?

And if I list all previous or subsequent names for a ship that had the same name twice, then in this case the entry for Brasil (and Queen of Bermuda, and others) will have multiple ‘subsequent name’ links to Volendam. The page for Volendam could conceivably have a link back to itself!

What do you think? What’s the best way to represent this important data?

How variant editions can screw up Google Books links

Monday, December 20th, 2010

As we’ve mentioned in the blog before, you can link to the full text of many, many resources cited in ShipIndex.org. In fact, with a recent addition of a file containing tens of thousands of online ship images, nearly 90% of the citations provide full-text linking. Much of the linking comes through links to online resources, but others are available via links to books in Google Book Search.

A few weeks ago, several of us at ShipIndex were using some of these links, and found that many links for Sherry Sontag’s book Blind Man’s Bluff didn’t seem to work. While the links took one to the page cited in the index, the vessel mentioned in the index wasn’t listed on the page that we ended up at in Google Books. So today I picked up a copy of Blind Man’s Bluff from my local public library, to see if I’d made a lot of mistakes in working through the index.

I found that, in fact, I hadn’t made any mistakes – the page numbers in ShipIndex were the same as the page numbers listed in the back of the book. So then I re-tried some of the Google Book links we offer. Once again, a link to page 57 took me to page 57, but USS Halibut wasn’t mentioned on page 57 in Google. So I checked the copy I’d gotten from the library. That’s where I discovered the problem.

The copy from my public library, and the copy I’d originally used when creating the file to add to ShipIndex, came from the first publication of the book, by Public Affairs, a division of Perseus Books, and first published in 1998. But the copy on Google Books is the paperback edition, published by HarperCollins, in 1999, and the pagination, layout, and nearly every other aspect is completely different between the two. The HarperCollins version has 432 pages, while the Perseus version has 352. While the content may be exactly the same, the pagination is obviously different, so linking doesn’t work the way it should.

So now it seems that, in order to make the Google Books linking continue to work, I need to find an index to the HarperCollins edition of the book, and replace the index I’d compiled from the Public Affairs edition. It’s likely not a big deal to get done, but I thought it was an interesting problem that we may come up against more and more in the future.

New content added in past few weeks

Friday, October 8th, 2010

Here’s an overview of the new content added in the past few weeks. Two collections are of particular note: the Lloyd’s List for 1812, via 1812Privateers.org, and the Dyal Ship Collection. One man, Michael Dun, has digitized and indexed all of the issues of Lloyd’s List for the entire year of 1812. It’s quite a feat. He’s indexed all of the ships and all of the masters for that time, adding up to nearly 26,000 ship citations in all the issues of Lloyd’s List for 1812. He kindly shared his index with me, so I could include links to his resources. Mr. Dun hosts the pages on his servers, and they are accessible to all via that site. While working through the index of ship names that he provided to me, I was able to identify a number of corrections, and I incorporated those into the file I imported.

Working through this file was also an interesting reminder about the challenges we face in trying to make the most of these primary sources. Clearly, the folks who were putting together each issue of Lloyd’s List (it usually came out twice a week, and was published in London) were trying to get information out as quickly as possible, and weren’t too concerned with absolute accuracy, to say nothing of how researchers two centuries later would like them to present information.

As a few examples, each of the following slight spelling variations by the editors are likely the same ship: Misletoe, Misseltoe, and Missletoe (there’s no Mistletoe listed in this year of Lloyd’s!). Or, Nymph, Nymphe, and Nymphen. Or Powhatan, Powahattan, and Powhatton. Or Zenophon and Zenophen, when the proper spelling is Xenophon. Or Tinmouth Castle, most  likely meaning Teignmouth Castle. Or simple errors, like Hepsa instead of Hespa.

Of course, if you’re reading this at a London coffee shop one morning in 1812, you can easily look over these minor errors, and figure out what the editors’ intent was. But for researchers two centuries later, who are trying to mine large amounts of data to see what they can find, these errors cause a problem. So how do we address them? That’s an issue for an upcoming blog post. But, needless to say, we at ShipIndex.org have a solution…

Another interesting addition is the Dyal Ship Collection, but for very different reasons. This is a collection of images and data compiled by a researcher (in this case, a librarian) and added to his institution’s “institutional repository” (IR). An IR is a site, usually maintained by an academic library, where content generated by the institution’s faculty, staff, and students is made available for free. It is, in a large sense, a reaction to the high cost of many academic journals, where an institution’s researchers spend time and money doing and compiling research, then pay to have that published in a scholarly journal, then the institution pays to buy the results back, through a subscription to the journal. The whole discussion is beyond the scope of this blog post, but the point is that IRs are places where interesting and useful information can be stored — but it’s most often quite hidden, unless there’s some effective way of indexing the content.

So, with the encouragement and assistance of the compiler, we’ve created links into the collection of files and images that are stored in Texas Tech University’s institutional repository. Recently, we’ve heard from others who have data they’d like us to include, and we’re looking at ways of doing that effectively. This is just one example of that.

Other items we’ve added are mostly more standard print or online collections. The total list is as follows:

If you have maritime content that you’d like to get online, or is online but needs broader publicity, please let us know. We’d love to find a way to help.

Is there a better way to present this data?

Thursday, April 1st, 2010

I’ve been working on a big file that’s going to be very useful to ShipIndex.org subscribers, especially those interested in World War II vessels. H.T. Lenton’s tome, British and Imperial Warships of the Second World War, is an incredible resource. Its 750+ pages are absolutely jam-packed with useful content, but it has presented me with a few challenging issues about how to manage this data. I thought I’d describe some of it here, explain what my plan is, and see if the greater good has any better suggestions. There’s still time to modify how this resource is managed. I’ve probably invested at least 30 full hours in preparing this file – and that doesn’t include a significant amount of work done by another person before me – and I still have a long way to go. But that’s what it takes, sometimes, to get a resource like this one ready to add to the database.

The first part of this remarkable volume looks at larger, named vessels, organized by vessel type and class. As one example, the “Corvettes and Frigates” section is divided into entries on the “Flower” class, the “River” class, the “Kil-” class, and four more classes. (The introduction has several fascinating paragraphs about the peregrinations of naming vessels, and shows how complicated the whole process was. A fair bit of background knowledge is required just to understand this section!) After some commentary on the design and development of the class, Lenton provides tables showing brief history information for every vessel in a class. Information may be quite extensive, or it might consist of as little as an indication of the intended builder and the approximate cancellation date (for example, for vessels ordered but not begun before the war ended).

This works fine for named vessels, but creates a conundrum for unnamed vessels. In the LCM (Landing Craft Mechanised) section, for example, the index notes that “LCM.21-118” appear on pg 490; “LCM.119-220” on pg 491, “LCM.221-334” on pg 492, etc. Of the 100+ ships on each page, though, just two to three dozen have any information at all about the vessel, and that information is slight, at best. For the LCMs, most have no Building or Completion information. Of the ones that have “Fate” information, it usually reads something like “Lost cause unknown Algiers ../11/42.” (Meaning it was lost in November 1942, but the exact date and cause is not known.)

To me, this information might be useful to someone, and I don’t want to not include the entry for that vessel. But for each one like that, there are several where no information at all is included, and I believe that adding an entry to ShipIndex.org should imply that at least SOMETHING is available in the resource. So I’ve decided that what I’ll do is expand entries like “LCM.21-118” to be “LCM.21”, “LCM.22”, “LCM.23”, etc., up to “LCM.118”. Then I’ll compare my list with the book itself. If there’s any information at all about the vessel, I’ll keep the entry. If there is no information beyond its listing on the page – nothing about where it was built, or how it was lost, for instance – then I’ll delete it. My thought is that if the volume offers one piece of information, I’ll include the vessel name in the index.

Still, it’s worth noting that for people who are working on an unlisted LCM, the volume may contain information about the LCM class that might be relevant. And if you’re looking for an image of a specific auxiliary vessel, it may be that an image of a different vessel in the same class will do. It appears that the most common vessel type in which this will apply will be the LCMs, of which several thousand were built, but it will be interesting to see how it actually turns out.

Am I doing the right thing? Should I be handling this in some other way? Is there some other way that I should note the amount of information presented? I’d welcome your comments – if there’s a better way of doing it, now’s the time for me to hear about it.

The messiest metadata yet…

Sunday, January 24th, 2010

I’m used to messy metadata (that is, data about data – so in this case, data that describes the contents of the ship register), but today I’ve really hit a snag. I found a very nice resource online that has digitized many years of a useful ship register. But the data describing the data in that register is so bad that I wonder if it’s worth adding to the ShipIndex database at all. In some instances, every second entry is obviously wrong. At this point, I’m up to the “Ae”s (after working through the ship names that started with question marks), and I’ve got a long, long way to go. It didn’t take that long to collect this data, but it’ll take forever to correct it.

What I find so frustrating is that if the compilers of the data had spent, say, just three solid days of work going through this file, they could have corrected tens of thousands of errors before they ever sent it out into the world.

Here are some examples. When you see a series of ship names like

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

it’s easy to see that the first one is not “Aéro-Poatale”.

Or when you see the following (the second field is the launch date; the third is the tonnage):

Affaric 1934 239
Affarie 1934 239

you know they’re they same ship, and it’s easy enough to determine that the ship name is Affaric, not Affarie.

Or just below that, the following series of ship names:

Afghanistan 1940
Afghanistan 1917
Afghantstan 1905
Afghauistan 1917
Afghauistan 1917

(The second, fourth, and fifth all describe the same vessel.) If the vessel you’re searching is the 1905 one, and you use the term “Afghanistan”, you won’t find it via the native interface.

Here’s another good one a bit further down. Apparently they weren’t sure which way the accent should go.

Agnés 1896 120
Agnès 1896 120

(A quick look at the pdfs they link to shows that the second one is correct.)

Speaking of diacritics, who knows how many ship names are inaccurately represented here because the compilers decided to just ditch the diacritics? Here are three different versions of the same vessel:

Hillev?g 1885 877
Hillevåg 1885 877
Hillevg 1885 877

Many diacritics are replaced with questions marks – probably as a result of some hinky encoding issues – but many others are just deleted. When I can find them, I put them back — so someone who knows a ship’s name will be able to find it — but I’m afraid that’s not going to happen most of the time.

Also, numerous blank spaces are missing from ship names. While this reflects how the data appeared in the original resource, it doesn’t consider how people use the database they’ve created. If I’m searching for the 1922 vessel Pacific Commerce, which appears in the database, how do I know that I should also search for “PacificCommerce”, which will also return a result for the vessel I’m seeking? If I don’t fix entries such as these, they’ll create “ships” in the ShipIndex database with names like “PacificCommerce” or “PacificFir” – and later, I’d have to go back and fix them all. And, of course, the fix is not that difficult – just put in the spaces in the appropriate locations. I may use regular expressions to simplify this work, though that does raise the possibility of adding unintentional errors. (But it’d be worth it; it’d fix far, far, far more errors than it’d introduce.)

I certainly wouldn’t expect total accuracy in a project like this. In some cases, the originals that were OCR’d were very poor quality microfilm. But what frustrates me is that a quick pass over the spreadsheet, as I’m doing, would identify tens of thousands of these errors.

Problems are not limited to ship names. There are more than four hundred entries whose build dates are well after the issues were published; all of those are clearly wrong. In other cases, when one build date is 1980 and another one, for a ship with the same name and size, is 1930, it’s easy to know that the latter is correct and the former is wrong. Here’s an example:

Alan Seeger 1943 7208
Alan Seeger 1913 7208

A quick look at the entry for the second one confirms that, while one can see why the OCR software thought it said “1913″, a proofer could easily identify the error (as I did, for instance), and correct it to read “1943″.

And what concerns me is that if I don’t clean up most of this data now, then it’ll get into the ShipIndex.org database, and make a mess that I’ll have to clean up eventually. But I think it will take me many, many hours to go through this and correct it all – and who knows what I’ll miss, and will still get introduced to the database. If I import the data, warts and all, then try to go back and correct it later, there will be that much more to clean up.

I’m quite frustrated by this, because it’s so clear to me how much positive impact cleanup would have had on the original database itself. As it is, I’m making it more reliable to search this database through the ShipIndex interface than through its native interface (for example, the person searching for the 1905 Afghanistan would find it through ShipIndex.org, but not through the original site), but it’ll take a long time before I can get the file done and ready to load.

What a shame.

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

Data correction work at ShipIndex.org

Thursday, December 24th, 2009

We’ve completed our first initial load of a large pile of content into the premium ShipIndex.org database, and now have 1,231,909 references in the database. That’s a lot of content. We do have tons more to add, and it’ll keep coming in over time. Now, however, we’ll turn to the process of cleaning up some of this data.

Obviously, having all this data in one place is, I believe, a huge benefit, and well worth the subscription price for the premium database. We make it possible to search through well over a million references, from about 125 resources, in less than a second. The quality of much of this data, however, often leaves something to be desired. And now I’m turning to doing some cleanup, which I believe will be an equally valuable benefit provided by our site.

Data problems come from lots of different sources; some resources include prefixes, such as “USS” or “HMS” in front of vessel names, so many American naval vessels are currently listed on the ‘U’ page, as in this screen shot:

Many 'USS' listings in ShipIndex.org

They’re also listed in the proper location, under the name of the vessel, so it means there are several places to look. That’s no good, and we’ll fix that.

Another problem is attempts to save space in 19th century printed directories. One will find many entries with apostrophes in them, like the following:

Abbreviated entries in ShipIndex.org

As a subscriber to premium content, you can follow any of these particular links to find that most of these transcriptions accurately reflect what was written in the original publications. But that was done to save space in the printed directory; the stern of the ship certainly read “Duke of Newcastle”, not “D’keof N’wcastle” one year, “D’keofN’wcastle” another year, and “D’ke of N’wc’stle” a third year. (And there is a transcription error, as well: one reads “D’ke of N’woastle” rather than “D’ke of N’wcastle”.) Since no researcher would reasonably think to search for “D’ke”, we’ll work to change all of these to be searchable under “Duke of Newcastle”. (Don’t worry, if you do want to search for “D’ke”, you still can.)

A third problem is simple transcription errors, and there are many of those, from lots of different sources. In addition to the one noted above, several errors appear in the vessel named “D’le pf Suth’rl’nd”. The original source appears as:

ScreenHunter_03 Dec. 24 10.52

So, part of our value-add is correcting these errors. The data quality team at ShipIndex.org has lots of experience with this, since we’ve been doing something similar for the past ten years with magazine titles. (Trust me, they are far more complicated than ship names.)

Of course, with 1.23 million entries, it’ll take us a while to get through the entire database. It’s a fairly slow and meticulous process – though the technology team at ShipIndex has done a great job creating a panoply of tools to simplify the process and speed it up. (The technology team spent much of the past ten years building the tools that the data quality team used when working on magazine titles, so we’ve got it all pretty well covered.) It’ll take time to work through everything, and we’ll definitely be adding more data before we finish this process – meaning it’ll take that much longer – but it will happen. And if you see an error you especially want corrected, please don’t hesitate to let us know.

Thanks for your interest, and have a great holiday season.