Category Archives: Data Correction

Updated OCLC WorldCat data – 20% more, and more accurate

I’ve updated an important resource, adding 20% to its contents, and improving the accuracy of all of the data in it. When we converted ShipIndex.org from a hobby to a business, we worked with OCLC to get a file of books by or about ships. For more about how these records are used, see the first of two posts about WorldCat records, here.

In any case, we agreed with OCLC that these records would remain in the free database, rather than the newly-created subscription database. There were about 40,000 records in that file. Last month, I had the opportunity to visit OCLC’s headquarters, in Dublin, Ohio. While there, I received an updated version of this file, which now contains over 50,000 authority records for ships.

I worked through the file, doing cleanup and corrections, and spent a few tries at loading the file into the ShipIndex.org database. It wasn’t as easy as other files, because the OCLC records are fully Unicode compliant. The database likes UTF-8, but Unicode is a bit beyond its abilities. (Actually, not in its abilities to display vessel names, but in its abilities to store them.) I replaced vessel names in Cyrillic, Japanese, Chinese, etc., with their transliterated names, and also removed a lot of the Unicode characters that were causing problems.

I also fixed a lot of names that I hadn’t fixed the first time around. Most of these were ship names with prefixes attached, like “USS Daffodil” or “HMS Daffodil” or “S/S Daffodil”. It’s always best to search without those prefixes. I have cleanup still to do on those leftover ship names, but the new records are live and I can do the cleanup later.

So now, as a result, the OCLC WorldCat resource has grown from about 40,000 to about 50,000 citations, and the metadata is much improved. All of these citations are in the free database. This is a big improvement all around. Thanks again to OCLC for creating this file for me!

ShipIndex as a Vessel Name Authority File

[This entry was written long ago, but not posted, because I was having problems with uploading images. As you'll see, images are a critical part of this post! Now that I've gotten that problem resolved, I will add a few more posts soon. PMc]

Last May, I finally completed one very large file for import. This file was incredibly tough to process, but I learned a lot about how one can use the database, and I thought I’d share that information here.

The database is Mariners and Ships in Australian Waters, and it is a collection of transcribed passenger lists for thousands of voyages to Australia, primarily in the 2nd half of the 19th century. Because most records were handwritten, and then transcribed by volunteers, many, many errors crept into the database.

The database has 58,311 records in it. (I believe more are always being added to the website itself, as transcribers complete their work.) One major difference between this and every other resource is that each voyage has a separate entry. In the Ellis Island Database, a user searches by ship name, then goes in deeper by voyage date. In this case, the collection is organized by arrival year, then arrival month, then ship name – so I had to create a separate entry for each voyage, to be able to link to each transcription.

I quickly realized that there were many, many, many errors in the transcription of vessel names. Just looking over the ship names as they appeared in the spreadsheet, it was easy to spot typos – especially with the additional information I had about masters and tonnage, which helped connect a misspelling to a correct spelling.

After correcting numerous such misspellings, I did a test import of the file and found 1707 new ship names would be added to the database. I started to investigate each of those, and found that many were not actually new ship names – they were simply additional mistranscriptions of the passenger lists. As the ShipIndex.org database grows, it’s important to try and minimize the introduction of incorrect ship names.

For example, I saw this entry, which the transcriber recorded as “Maealsar”. The master’s name had been transcribed as “C M de Boer”, and the vessel size as 305 tons.

authblog1
I thought it looked a bit like “Macassar”, but there were no other “Macassar”s in that file. I did a search in ShipIndex.org for Macassar (http://www.shipindex.org/ships/macassar), and found an entry from the American Lloyd’s Register of American and Foreign Shipping for the same year, and found a Macassar there, with a captain C. M. De Boor, and tonnage of 306. Obviously, these are the same ship.

authblog2
I corrected the vessel name, but kept the mis-transcription, too, just in case I was wrong. So the entry now looks like this: “Macassar (corrected; listed as “Maealsar”) (of Amsterdam, C M de Boer, Master, 305 tons, from the port of Balaves to Sydney, New South Wales, 23 Mar 1861)”.

Another example was this name, which had been transcribed as “Magport”:

authblog-3

I thought it looked like it started with an “N”, but found no “Nagport” already in the database. However, a search for “nagp*” turned up “Nagpore”, among others, and a link to the entry of Record of American and Foreign Shipping for the same year returned these two ships:

authblog-4

One has the same master and tonnage as the one in the transcription. It then becomes clear that there’s an “e” hiding behind the bar on the page, rather than a “t”.

 

I felt like it became a combination of genealogy and authority record work. I tried to find sufficient documentation to prove that my analysis was more accurate than the original. And because I had both the entire set of metadata from the source, and the 2.3 million citations already in the ShipIndex.org database, I could more easily determine that various transcriptions were incorrect.

I recognized that ShipIndex.org is beginning to serve as an authority file for vessels. It is certainly my goal to improve the database along those lines, and I will use another blog post to discuss this further.

 

I found many instances of doing this sort of research, and while it took a very long time, it was actually quite fun to nail down a correction. Some were surprising – I guess I can see why one might read this as “Princess of Water”:

authblog-5

 But why in the world would you not recognize that “Princess of Wales” makes infinitely more sense for a ship name?

 

I’ll provide two last examples here. This first one shows how I used the existing metadata for the resource itself to determine the correct ship name.

The beautiful handwriting on this one made it easy to read, and it’s not surprising that it was transcribed as “Oasby”. But there was only one entry in the entire file for “Oasby”, and none in the existing ShipIndex.org database, so it made me wonder.

authblog-6A search through the metadata for the captain’s name, however, found 17 entries with Kennedy as captain (as had been noted in the transcription for this entry), for ship “Easby”, and the full resource has at least 70 other entries for “Easby”. Tonnage data is the same, and after learning of the existence of “Easby”, it’s easy to see that that’s what the ship name was; and the top of the dramatic ‘E’ was lost in the digitizing process.

This made the next new ship name, “Oaton Hall”, easy to resolve to “Eaton Hall”.

Finally, I dealt with this challenging entry by using the existing ShipIndex.org database:

authblog-7I tried searching for “waurego”,  but that returned no ships. By searching for “*rego”, I found all the citations that had a word in the ship name that ends in “rego”. I could easily locate “Warrego”, and confirm that’s the right ship.

There’s other searching that could be done here, too. If I change the search to “*rego$” it returns only the ship names that actually end in “rego”, deleting several, like “Trego Renneger” or “Effrego Ventus”, from the result list.

I’ll put together another post in the next few weeks with more examples of changes and corrections I was able to make, along with a discussion of the importance of authority data for ship names.

 

Deleting data – sometimes it must be done

I had to delete content from the database this morning. I’ve delayed doing it for a long time, but it had to be done. The “Property Management & Archive Record System” database, created by the US Department of Transportation’s Maritime Administration, was actually a very useful database, but was removed temporarily – and then permanently – so I really had no choice but to remove its contents from the ShipIndex.org database.

I had written the following description of the database:

This resource, called “PMARS”, is the official repository of records about vessels that are or were parts of US Maritime Administration’s Naval Defense Reserve Force. As a result, it focuses on ships from World War II to the present. Only a few hundred vessels are still in NDRF, but PMARS contains information about nearly all ships (over 7000) that were included in NDRF at some point.

While the database contains “basic ship data” about each vessel, the “Custody Cards” and “Disposal Cards” are of particular interest. These are images of the printed, typed, or handwritten notes regarding disposition of each vessel.

I had a great experience at a library conference once, using the PMARS database. A special collections librarian from Occidental College, in California, wanted to learn more about a Victory ship called “Occidental Victory”, named after her institution. (Victory ships were slightly larger and more powerful than Liberty ships; both were quickly-built cargo ships used extensively during World War II, and critical to Allied success in the war.) We looked up “Occidental Victory” in the ShipIndex.org database, and found a record from PMARS. It included digitized images of the ship’s Disposal Card, which showed the history of the ship and its final outcome.

The database also showed that the Maritime Administration still owned the binnacle for the ship, and was willing to loan it to museums and libraries for exhibits! She was thrilled to discover this, and said she wanted to create an exhibit about the ship, and of course borrow the binnacle for the exhibit. I don’t know that this ever happened, but to discover the binnacle was available was, I thought, really neat.

The digitized Disposal Cards and Custody Cards were great items, too, and it’s such a shame that these things are no longer available online. One might think that in our digital environment, such items wouldn’t be lost or taken off-line. But when it happens (and it happens more often than one might think), the data is lost for good, because it wasn’t backed up elsewhere, such as in the form of multiple physical copies in many different libraries.

For a while, the PMARS links redirected you to a page that said something to the effect of, “for more information, contact ____.” So I did. A little over a year ago I contacted people at the US Maritime Administration to ask what had happened to PMARS, and if it was coming back. I got a nice, quick response, and was told that PMARS had been taken off-line “due to security concerns”, that great bugaboo of meaninglessness. It was expected to return in mid-2012, in the form of two different databases, but that didn’t happen.

Now, the links are simply dead, and take you nowhere. If PMARS does come back, in whatever form, I’ll quickly return it to the ShipIndex.org database. Until then, I feel the proper thing to do is to remove the content from the database.

But I do anticipate adding a lot of new content in the very near future; I have a project going on that should, if all goes well, add lots of great new content in the next ten days. It won’t replace the content lost from the loss of the PMARS database, but perhaps that will, in fact, come back some day.

On Naming Ships and Representing them in ShipIndex

At present, ShipIndex.org has one point of access: the vessel name. You’d think that would be fairly easy, at least in the case of extant vessels: just look at the stern or the bow, and see what’s written there. Alas, it’s not that simple. There are many reasons for this, and a lot of them are completely understandable. Others can lead to surprisingly interesting stories.

While working through the index to the first 50 years of Steamboat Bill, and its successor, PowerShips, I came across many, many mentions of the Queen Elizabeth 2. Most of these are listed under the very common, abbreviated name, “QE2”. In the ShipIndex database, however, one also finds many entries for a different version of the name, “Queen Elizabeth II”. I read a bit about the ship on its Wikipedia page, and learned some interesting stories about how the name came about. According to the contributors, the name of the ship was not announced before the launching. Cunard intended to name the ship “Queen Elizabeth”, but the Queen, when she launched the ship, stated “I name this ship Queen Elizabeth the Second.”

The next day, newspapers announced the name as “Queen Elizabeth II”, though when the ship was delivered its name read “Queen Elizabeth 2”. According to Wikipedia, “From at least 2002 the official Cunard website stated that ‘The new ship is not named after the Queen but is simply the second ship to bear the name – hence the use of the Arabic 2 in her name, rather than the Roman II used by the Queen’, however, in a change in 2007 this information had been removed.”

In addition, there’s confusion about who the ship is named after. Multiple sources provide multiple suggestions. Some feel the ship is named after the current Queen, and that, in fact, she made that change when she announced its name. Others state that it is named after her mother, the wife of King George VI. Others state it’s named after the previous Cunard ship named Queen Elizabeth.

We need to make it possible for people to find ship names however they might be represented, and so we’ve created functionality that allows one to link between variant names for specific ships. So, for example, when you search for “QE2”, you find entries that cite “QE2”, but you also find a link at the top taking you to entries for other variant names for this ship, specifically “Queen Elizabeth 2” and “Queen Elizabeth II”.

We also have the ability to ‘normalize’ ship names, and in that case, one goes directly from a misspelling of a ship name to the correctly spelled entry. So, by rights, we should ‘normalize’ “QE2” and “Queen Elizabeth II” to “Queen Elizabeth 2”. But I think that, in this case, for this very famous ship, it’s worth maintaining the separate entries and linking them together via the “alternate spelling” links. Maybe I’m wrong; should I just normalize them all together? What do you think?

We also show links for previous and subsequent names of ships. So, if you search for “Euterpe”, you’ll see a “subsequent name” link to “Star of India.” It is important to remember that if there are multiple ships with the name “Euterpe,” the link appears, but doesn’t apply to all of them. Creating a system that separates out all these ships is a big project, but one that we will tackle.

One great thing about the Steamboat Bill files is that they include many previous and subsequent vessel names. Unfortunately, they don’t exactly indicate the order in which vessel names appeared; you’ll see both “Liberte; a) Brasil; b) Volendam; c) Monarch Sun; d) Volendam; e) Island Sun; g) Canada Star h) Queen of Bermuda” and “Queen of Bermuda; a) Brasil; b) Volendam; c) Monarch Sun; d) Volendam; e) Island Sun; f) Liberte; g) Canada Star”, as well as “Island Sun; a) Volendam”. So, some research is needed to figure out the order in which the ship names appeared. Then, I still have a question about whether or not I should include all of the previous and subsequent names in each entry or not. In the above example, if I determine that the actual path of ship name changes was Queen of Bermuda, then Brasil, then Volendam, then Monarch Sun, then Volendam (again), then Island Sun, then Liberte and finally Canada Star”, do I include ‘subsequent name’ links from Brasil to Volendam, Monarch Sun, Island Sun, Liberte, and Canada Star? That creates a lot of links. Or do I just have a link from Queen of Bermuda to Brasil, and on Brasil a link to Volendam?

And if I list all previous or subsequent names for a ship that had the same name twice, then in this case the entry for Brasil (and Queen of Bermuda, and others) will have multiple ‘subsequent name’ links to Volendam. The page for Volendam could conceivably have a link back to itself!

What do you think? What’s the best way to represent this important data?

How variant editions can screw up Google Books links

As we’ve mentioned in the blog before, you can link to the full text of many, many resources cited in ShipIndex.org. In fact, with a recent addition of a file containing tens of thousands of online ship images, nearly 90% of the citations provide full-text linking. Much of the linking comes through links to online resources, but others are available via links to books in Google Book Search.

A few weeks ago, several of us at ShipIndex were using some of these links, and found that many links for Sherry Sontag’s book Blind Man’s Bluff didn’t seem to work. While the links took one to the page cited in the index, the vessel mentioned in the index wasn’t listed on the page that we ended up at in Google Books. So today I picked up a copy of Blind Man’s Bluff from my local public library, to see if I’d made a lot of mistakes in working through the index.

I found that, in fact, I hadn’t made any mistakes – the page numbers in ShipIndex were the same as the page numbers listed in the back of the book. So then I re-tried some of the Google Book links we offer. Once again, a link to page 57 took me to page 57, but USS Halibut wasn’t mentioned on page 57 in Google. So I checked the copy I’d gotten from the library. That’s where I discovered the problem.

The copy from my public library, and the copy I’d originally used when creating the file to add to ShipIndex, came from the first publication of the book, by Public Affairs, a division of Perseus Books, and first published in 1998. But the copy on Google Books is the paperback edition, published by HarperCollins, in 1999, and the pagination, layout, and nearly every other aspect is completely different between the two. The HarperCollins version has 432 pages, while the Perseus version has 352. While the content may be exactly the same, the pagination is obviously different, so linking doesn’t work the way it should.

So now it seems that, in order to make the Google Books linking continue to work, I need to find an index to the HarperCollins edition of the book, and replace the index I’d compiled from the Public Affairs edition. It’s likely not a big deal to get done, but I thought it was an interesting problem that we may come up against more and more in the future.

New content added in past few weeks

Here’s an overview of the new content added in the past few weeks. Two collections are of particular note: the Lloyd’s List for 1812, via 1812Privateers.org, and the Dyal Ship Collection. One man, Michael Dun, has digitized and indexed all of the issues of Lloyd’s List for the entire year of 1812. It’s quite a feat. He’s indexed all of the ships and all of the masters for that time, adding up to nearly 26,000 ship citations in all the issues of Lloyd’s List for 1812. He kindly shared his index with me, so I could include links to his resources. Mr. Dun hosts the pages on his servers, and they are accessible to all via that site. While working through the index of ship names that he provided to me, I was able to identify a number of corrections, and I incorporated those into the file I imported.

Working through this file was also an interesting reminder about the challenges we face in trying to make the most of these primary sources. Clearly, the folks who were putting together each issue of Lloyd’s List (it usually came out twice a week, and was published in London) were trying to get information out as quickly as possible, and weren’t too concerned with absolute accuracy, to say nothing of how researchers two centuries later would like them to present information.

As a few examples, each of the following slight spelling variations by the editors are likely the same ship: Misletoe, Misseltoe, and Missletoe (there’s no Mistletoe listed in this year of Lloyd’s!). Or, Nymph, Nymphe, and Nymphen. Or Powhatan, Powahattan, and Powhatton. Or Zenophon and Zenophen, when the proper spelling is Xenophon. Or Tinmouth Castle, most  likely meaning Teignmouth Castle. Or simple errors, like Hepsa instead of Hespa.

Of course, if you’re reading this at a London coffee shop one morning in 1812, you can easily look over these minor errors, and figure out what the editors’ intent was. But for researchers two centuries later, who are trying to mine large amounts of data to see what they can find, these errors cause a problem. So how do we address them? That’s an issue for an upcoming blog post. But, needless to say, we at ShipIndex.org have a solution…

Another interesting addition is the Dyal Ship Collection, but for very different reasons. This is a collection of images and data compiled by a researcher (in this case, a librarian) and added to his institution’s “institutional repository” (IR). An IR is a site, usually maintained by an academic library, where content generated by the institution’s faculty, staff, and students is made available for free. It is, in a large sense, a reaction to the high cost of many academic journals, where an institution’s researchers spend time and money doing and compiling research, then pay to have that published in a scholarly journal, then the institution pays to buy the results back, through a subscription to the journal. The whole discussion is beyond the scope of this blog post, but the point is that IRs are places where interesting and useful information can be stored — but it’s most often quite hidden, unless there’s some effective way of indexing the content.

So, with the encouragement and assistance of the compiler, we’ve created links into the collection of files and images that are stored in Texas Tech University’s institutional repository. Recently, we’ve heard from others who have data they’d like us to include, and we’re looking at ways of doing that effectively. This is just one example of that.

Other items we’ve added are mostly more standard print or online collections. The total list is as follows:

If you have maritime content that you’d like to get online, or is online but needs broader publicity, please let us know. We’d love to find a way to help.

Is there a better way to present this data?

I’ve been working on a big file that’s going to be very useful to ShipIndex.org subscribers, especially those interested in World War II vessels. H.T. Lenton’s tome, British and Imperial Warships of the Second World War, is an incredible resource. Its 750+ pages are absolutely jam-packed with useful content, but it has presented me with a few challenging issues about how to manage this data. I thought I’d describe some of it here, explain what my plan is, and see if the greater good has any better suggestions. There’s still time to modify how this resource is managed. I’ve probably invested at least 30 full hours in preparing this file – and that doesn’t include a significant amount of work done by another person before me – and I still have a long way to go. But that’s what it takes, sometimes, to get a resource like this one ready to add to the database.

The first part of this remarkable volume looks at larger, named vessels, organized by vessel type and class. As one example, the “Corvettes and Frigates” section is divided into entries on the “Flower” class, the “River” class, the “Kil-” class, and four more classes. (The introduction has several fascinating paragraphs about the peregrinations of naming vessels, and shows how complicated the whole process was. A fair bit of background knowledge is required just to understand this section!) After some commentary on the design and development of the class, Lenton provides tables showing brief history information for every vessel in a class. Information may be quite extensive, or it might consist of as little as an indication of the intended builder and the approximate cancellation date (for example, for vessels ordered but not begun before the war ended).

This works fine for named vessels, but creates a conundrum for unnamed vessels. In the LCM (Landing Craft Mechanised) section, for example, the index notes that “LCM.21-118” appear on pg 490; “LCM.119-220” on pg 491, “LCM.221-334” on pg 492, etc. Of the 100+ ships on each page, though, just two to three dozen have any information at all about the vessel, and that information is slight, at best. For the LCMs, most have no Building or Completion information. Of the ones that have “Fate” information, it usually reads something like “Lost cause unknown Algiers ../11/42.” (Meaning it was lost in November 1942, but the exact date and cause is not known.)

To me, this information might be useful to someone, and I don’t want to not include the entry for that vessel. But for each one like that, there are several where no information at all is included, and I believe that adding an entry to ShipIndex.org should imply that at least SOMETHING is available in the resource. So I’ve decided that what I’ll do is expand entries like “LCM.21-118” to be “LCM.21”, “LCM.22”, “LCM.23”, etc., up to “LCM.118”. Then I’ll compare my list with the book itself. If there’s any information at all about the vessel, I’ll keep the entry. If there is no information beyond its listing on the page – nothing about where it was built, or how it was lost, for instance – then I’ll delete it. My thought is that if the volume offers one piece of information, I’ll include the vessel name in the index.

Still, it’s worth noting that for people who are working on an unlisted LCM, the volume may contain information about the LCM class that might be relevant. And if you’re looking for an image of a specific auxiliary vessel, it may be that an image of a different vessel in the same class will do. It appears that the most common vessel type in which this will apply will be the LCMs, of which several thousand were built, but it will be interesting to see how it actually turns out.

Am I doing the right thing? Should I be handling this in some other way? Is there some other way that I should note the amount of information presented? I’d welcome your comments – if there’s a better way of doing it, now’s the time for me to hear about it.

The messiest metadata yet…

I’m used to messy metadata (that is, data about data – so in this case, data that describes the contents of the ship register), but today I’ve really hit a snag. I found a very nice resource online that has digitized many years of a useful ship register. But the data describing the data in that register is so bad that I wonder if it’s worth adding to the ShipIndex database at all. In some instances, every second entry is obviously wrong. At this point, I’m up to the “Ae”s (after working through the ship names that started with question marks), and I’ve got a long, long way to go. It didn’t take that long to collect this data, but it’ll take forever to correct it.

What I find so frustrating is that if the compilers of the data had spent, say, just three solid days of work going through this file, they could have corrected tens of thousands of errors before they ever sent it out into the world.

Here are some examples. When you see a series of ship names like

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

it’s easy to see that the first one is not “Aéro-Poatale”.

Or when you see the following (the second field is the launch date; the third is the tonnage):

Affaric 1934 239
Affarie 1934 239

you know they’re they same ship, and it’s easy enough to determine that the ship name is Affaric, not Affarie.

Or just below that, the following series of ship names:

Afghanistan 1940
Afghanistan 1917
Afghantstan 1905
Afghauistan 1917
Afghauistan 1917

(The second, fourth, and fifth all describe the same vessel.) If the vessel you’re searching is the 1905 one, and you use the term “Afghanistan”, you won’t find it via the native interface.

Here’s another good one a bit further down. Apparently they weren’t sure which way the accent should go.

Agnés 1896 120
Agnès 1896 120

(A quick look at the pdfs they link to shows that the second one is correct.)

Speaking of diacritics, who knows how many ship names are inaccurately represented here because the compilers decided to just ditch the diacritics? Here are three different versions of the same vessel:

Hillev?g 1885 877
Hillevåg 1885 877
Hillevg 1885 877

Many diacritics are replaced with questions marks – probably as a result of some hinky encoding issues – but many others are just deleted. When I can find them, I put them back — so someone who knows a ship’s name will be able to find it — but I’m afraid that’s not going to happen most of the time.

Also, numerous blank spaces are missing from ship names. While this reflects how the data appeared in the original resource, it doesn’t consider how people use the database they’ve created. If I’m searching for the 1922 vessel Pacific Commerce, which appears in the database, how do I know that I should also search for “PacificCommerce”, which will also return a result for the vessel I’m seeking? If I don’t fix entries such as these, they’ll create “ships” in the ShipIndex database with names like “PacificCommerce” or “PacificFir” – and later, I’d have to go back and fix them all. And, of course, the fix is not that difficult – just put in the spaces in the appropriate locations. I may use regular expressions to simplify this work, though that does raise the possibility of adding unintentional errors. (But it’d be worth it; it’d fix far, far, far more errors than it’d introduce.)

I certainly wouldn’t expect total accuracy in a project like this. In some cases, the originals that were OCR’d were very poor quality microfilm. But what frustrates me is that a quick pass over the spreadsheet, as I’m doing, would identify tens of thousands of these errors.

Problems are not limited to ship names. There are more than four hundred entries whose build dates are well after the issues were published; all of those are clearly wrong. In other cases, when one build date is 1980 and another one, for a ship with the same name and size, is 1930, it’s easy to know that the latter is correct and the former is wrong. Here’s an example:

Alan Seeger 1943 7208
Alan Seeger 1913 7208

A quick look at the entry for the second one confirms that, while one can see why the OCR software thought it said “1913″, a proofer could easily identify the error (as I did, for instance), and correct it to read “1943″.

And what concerns me is that if I don’t clean up most of this data now, then it’ll get into the ShipIndex.org database, and make a mess that I’ll have to clean up eventually. But I think it will take me many, many hours to go through this and correct it all – and who knows what I’ll miss, and will still get introduced to the database. If I import the data, warts and all, then try to go back and correct it later, there will be that much more to clean up.

I’m quite frustrated by this, because it’s so clear to me how much positive impact cleanup would have had on the original database itself. As it is, I’m making it more reliable to search this database through the ShipIndex interface than through its native interface (for example, the person searching for the 1905 Afghanistan would find it through ShipIndex.org, but not through the original site), but it’ll take a long time before I can get the file done and ready to load.

What a shame.

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

Data correction work at ShipIndex.org

We’ve completed our first initial load of a large pile of content into the premium ShipIndex.org database, and now have 1,231,909 references in the database. That’s a lot of content. We do have tons more to add, and it’ll keep coming in over time. Now, however, we’ll turn to the process of cleaning up some of this data.

Obviously, having all this data in one place is, I believe, a huge benefit, and well worth the subscription price for the premium database. We make it possible to search through well over a million references, from about 125 resources, in less than a second. The quality of much of this data, however, often leaves something to be desired. And now I’m turning to doing some cleanup, which I believe will be an equally valuable benefit provided by our site.

Data problems come from lots of different sources; some resources include prefixes, such as “USS” or “HMS” in front of vessel names, so many American naval vessels are currently listed on the ‘U’ page, as in this screen shot:

Many 'USS' listings in ShipIndex.org

They’re also listed in the proper location, under the name of the vessel, so it means there are several places to look. That’s no good, and we’ll fix that.

Another problem is attempts to save space in 19th century printed directories. One will find many entries with apostrophes in them, like the following:

Abbreviated entries in ShipIndex.org

As a subscriber to premium content, you can follow any of these particular links to find that most of these transcriptions accurately reflect what was written in the original publications. But that was done to save space in the printed directory; the stern of the ship certainly read “Duke of Newcastle”, not “D’keof N’wcastle” one year, “D’keofN’wcastle” another year, and “D’ke of N’wc’stle” a third year. (And there is a transcription error, as well: one reads “D’ke of N’woastle” rather than “D’ke of N’wcastle”.) Since no researcher would reasonably think to search for “D’ke”, we’ll work to change all of these to be searchable under “Duke of Newcastle”. (Don’t worry, if you do want to search for “D’ke”, you still can.)

A third problem is simple transcription errors, and there are many of those, from lots of different sources. In addition to the one noted above, several errors appear in the vessel named “D’le pf Suth’rl’nd”. The original source appears as:

ScreenHunter_03 Dec. 24 10.52

So, part of our value-add is correcting these errors. The data quality team at ShipIndex.org has lots of experience with this, since we’ve been doing something similar for the past ten years with magazine titles. (Trust me, they are far more complicated than ship names.)

Of course, with 1.23 million entries, it’ll take us a while to get through the entire database. It’s a fairly slow and meticulous process – though the technology team at ShipIndex has done a great job creating a panoply of tools to simplify the process and speed it up. (The technology team spent much of the past ten years building the tools that the data quality team used when working on magazine titles, so we’ve got it all pretty well covered.) It’ll take time to work through everything, and we’ll definitely be adding more data before we finish this process – meaning it’ll take that much longer – but it will happen. And if you see an error you especially want corrected, please don’t hesitate to let us know.

Thanks for your interest, and have a great holiday season.