New Content: Am Nep index, 1991-95

I just uploaded content from a five-year index to American Neptune, covering 1991 through 1995, to the premium database. Adding this sort of content is likely the most useful for the most users; it’s great to have one place where you can locate content online from numerous sources, but locating print-only content is a lot trickier.

I have a number of other journal indexes to add, and I want to get them done as quickly as I can. I’m working on it…

The ships mentioned in the 50-year index to American Neptune remain in the freely-available database; they’ll stay there permanently. Newly added content, however, will be going into the premium database.

As always, let me know of content you think should be added.

Lots of new content…

I just finished uploading 5000+ additional citations today, which reminds me that it’s about time for an update on what’s been added in the past few weeks. With the most recent import, I’ve added content from three of the five books in Paul Silverstone’s “U.S. Navy Warship Series”. The series covers the history of the US Navy from 1775 to 2007, in a series of five attractive and comprehensive books, published by the Naval Institute Press and Routledge.

I’ve added Civil War Navies, 1855-1883; The Navy of World War II, 1922-1947; and The Navy of the Nuclear Age, 1947-2007. I still need to work through and add index content from The Sailing Navy, 1775-1854 and The New Navy, 1883-1922. Of the ones I’ve added so far, the WWII volume (added today) and the Nuclear Age volume each have over 5,000 entries in their indexes. The Civil War volume has many fewer, and unfortunately doesn’t include any merchant vessels in the index, which is certainly a shame.

Anyway, here’s a list of most of what I’ve added since the last listing of newly-added content, nearly a month(!) ago:

That’s a pile of stuff! Multiple Navy Records Society volumes, which are particularly valuable for those studying British naval history; the Silverstone volumes and the PMARS database for those working on US naval history; Early South Carolina Newspapers Database for those interested in Southern US colonial history; several resources for steamship buffs (especially the steamship postcards available in Newman’s online collection); Mains’l Haul for Western and general history, and some random things, as well. In the past month, it looks like I’ve added content from two journal indexes, two online resources, and a pile of books.

You can always see new content added to the database on the resources page. Any content added in the past 45 days will have a “NEW!” icon next to it. As you can see from that page, that adds up to a lot of new stuff.

In addition, I’ve reimported most (but not all) of the freely-available files, so that they’ll show the illustration icon when they’ve got one. Those files were added to the database before we had the illustration and “main entry” icons, and you can still tell that an entry has an illustration — usually when the page number is in italics — but it didn’t show the icon. By processing and reimporting those files, the icons are now appearing. I’m still working on one big file, but I’ve covered a lot of the others. That’s some of what’s going on at ShipIndex world headquarters.

As always, let me know if there’s content you’d like to see added (more NRS volumes are on the way, as are a couple of important journal indexes), or if you have any other items to share.

Cool new enhancements!

Well, we’ve done a ton of stuff since coming back from Boston. While in Boston at the ALA Midwinter conference, Mike and I met with about fifteen different people to get feedback on how to improve the site. Each meeting was about 45 minutes long, and the whole experience was really fantastic. We met with academic reference librarians, public librarians, electronic resources librarians, genealogy librarians, authors, content providers, folks with library services businesses that we admire, and tons more. We came away with pages and pages and pages of modifications to make.

Some of these changes are/were easy, and some will be a lot tougher. On Saturday, Mike put new code up on the site, and many of the changes are now visible there. Since we do a lot of iterative releases, we don’t use ‘release numbers,’ but if we did, all the enhanced functionality that has just gone live would definitely deserve a ‘dot version’ – like, say, from 2.1 to 2.2. And, in fact, it probably would deserve an upgrade from version 2.x to 3.0, because of the new institutional access that I’ll get to later. (That doesn’t have much front-end visibility, but it has been a huge change on the back end.)

Here are a few of the changes you’ll see:

  • A “new” icon next to any item added in the last 45 days.
  • Better layout on the results pages
  • Better diacritics management
  • Links to resources open in new windows
  • More, and updated, information on the webpage, especially regarding individual subscriptions
  • A completely new “librarians” tab, with information for librarians, regarding our new institutional service

In addition, he created a number of tools that will help us better identify and proactively correct data issues.

With the new importing tools, I’ve imported several new files in the last few days, and have also started to go back to improve and reimport some of the older files. There are a number of files in the freely-accessible collection that have illustrations but don’t indicate that on the results pages. I’ve already corrected a few of those, and more will be corrected soon. Those don’t count as “new” resources, and they remain freely-accessible.

The biggest deal, though, is INSTITUTIONAL ACCESS! We can now offer subscriptions via IP-authentication, for institution-wide access. Check out our librarians page for more information about this. If you’re interested in a setting up a trial for your institution, please drop us a line at sales (at) shipindex (dot) org. Or recommend us to your local librarian! We can provide access for academic, public, special, and other libraries. And, to top it off, we’re offering “plankowner” discounts for institutions that join us before June. Contact us soon for more information.

This release is a big deal all around for us, and it’ll lead to a lot more content being added (two completely new resources have already been added today, and four have been improved and updated over the past two days). Results will be easier to use, and of course institutions can now subscribe, as well.

We’ve got more improvements and enhancements in the works, so let us know about any changes you’d like to see.

Last night’s dream

So, I don’t usually remember my dreams. It’s just the way I am. When I do, though, I try to pay attention.

Last night, I dreamt that I was visiting a library, and meeting with librarians there. Not too unusual, except for a few things. First, there was a freeway running through the library. Well, not running through it — I think the library and freeway were built at the same time, so really, they were part of each other. You could say the freeway had a library built around it. It did mean, though, that there were some pretty weird twists and turns to the building.

Anyway, while meeting with the librarians, one showed me an index I’d always hoped existed, but had never actually seen. She thought I’d be interested in it, and I certainly was. It was a spiral-bound index to the New York Times, on various special subjects. It was an annual volume, so presumably there were many, many others — hopefully one for every year since 1851, or maybe a bit more recent.  There were tabs to different subjects covered by the index, and one of them, about two-thirds of the way through, was an index to — wait for it — wait for it — ships, mentioned in the NYT. Ah… love at first sight. Truly.

I had looked for such a thing in the past. Well, not really, actually — I’d looked for ships listed in the annual volumes of the NYT Index, but I’d never looked for a separate, supplemental index to the NYT. Could such a thing exist? Sure it could. It’s the NYT, after all. So I was absolutely thrilled to find this. I wrote down as much bibliographic information as I could, so I could find a library that owned such a thing once I got home, and then review every single volume of it, to collect citations for every vessel mentioned in the New York Times.

When I woke up, there was, of course, no such piece of paper next to my bed. So, alas, I still don’t have an index to ships mentioned in the NYT. But if it existed in my dreams, it seems there might be a very, very small chance that it exists in real life, right? If you know of such an index, please, please, please let me know. I’ll be forever in your debt…

The messiest metadata yet…

I’m used to messy metadata (that is, data about data – so in this case, data that describes the contents of the ship register), but today I’ve really hit a snag. I found a very nice resource online that has digitized many years of a useful ship register. But the data describing the data in that register is so bad that I wonder if it’s worth adding to the ShipIndex database at all. In some instances, every second entry is obviously wrong. At this point, I’m up to the “Ae”s (after working through the ship names that started with question marks), and I’ve got a long, long way to go. It didn’t take that long to collect this data, but it’ll take forever to correct it.

What I find so frustrating is that if the compilers of the data had spent, say, just three solid days of work going through this file, they could have corrected tens of thousands of errors before they ever sent it out into the world.

Here are some examples. When you see a series of ship names like

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

it’s easy to see that the first one is not “Aéro-Poatale”.

Or when you see the following (the second field is the launch date; the third is the tonnage):

Affaric 1934 239
Affarie 1934 239

you know they’re they same ship, and it’s easy enough to determine that the ship name is Affaric, not Affarie.

Or just below that, the following series of ship names:

Afghanistan 1940
Afghanistan 1917
Afghantstan 1905
Afghauistan 1917
Afghauistan 1917

(The second, fourth, and fifth all describe the same vessel.) If the vessel you’re searching is the 1905 one, and you use the term “Afghanistan”, you won’t find it via the native interface.

Here’s another good one a bit further down. Apparently they weren’t sure which way the accent should go.

Agnés 1896 120
Agnès 1896 120

(A quick look at the pdfs they link to shows that the second one is correct.)

Speaking of diacritics, who knows how many ship names are inaccurately represented here because the compilers decided to just ditch the diacritics? Here are three different versions of the same vessel:

Hillev?g 1885 877
Hillevåg 1885 877
Hillevg 1885 877

Many diacritics are replaced with questions marks – probably as a result of some hinky encoding issues – but many others are just deleted. When I can find them, I put them back — so someone who knows a ship’s name will be able to find it — but I’m afraid that’s not going to happen most of the time.

Also, numerous blank spaces are missing from ship names. While this reflects how the data appeared in the original resource, it doesn’t consider how people use the database they’ve created. If I’m searching for the 1922 vessel Pacific Commerce, which appears in the database, how do I know that I should also search for “PacificCommerce”, which will also return a result for the vessel I’m seeking? If I don’t fix entries such as these, they’ll create “ships” in the ShipIndex database with names like “PacificCommerce” or “PacificFir” – and later, I’d have to go back and fix them all. And, of course, the fix is not that difficult – just put in the spaces in the appropriate locations. I may use regular expressions to simplify this work, though that does raise the possibility of adding unintentional errors. (But it’d be worth it; it’d fix far, far, far more errors than it’d introduce.)

I certainly wouldn’t expect total accuracy in a project like this. In some cases, the originals that were OCR’d were very poor quality microfilm. But what frustrates me is that a quick pass over the spreadsheet, as I’m doing, would identify tens of thousands of these errors.

Problems are not limited to ship names. There are more than four hundred entries whose build dates are well after the issues were published; all of those are clearly wrong. In other cases, when one build date is 1980 and another one, for a ship with the same name and size, is 1930, it’s easy to know that the latter is correct and the former is wrong. Here’s an example:

Alan Seeger 1943 7208
Alan Seeger 1913 7208

A quick look at the entry for the second one confirms that, while one can see why the OCR software thought it said “1913”, a proofer could easily identify the error (as I did, for instance), and correct it to read “1943”.

And what concerns me is that if I don’t clean up most of this data now, then it’ll get into the ShipIndex.org database, and make a mess that I’ll have to clean up eventually. But I think it will take me many, many hours to go through this and correct it all – and who knows what I’ll miss, and will still get introduced to the database. If I import the data, warts and all, then try to go back and correct it later, there will be that much more to clean up.

I’m quite frustrated by this, because it’s so clear to me how much positive impact cleanup would have had on the original database itself. As it is, I’m making it more reliable to search this database through the ShipIndex interface than through its native interface (for example, the person searching for the 1905 Afghanistan would find it through ShipIndex.org, but not through the original site), but it’ll take a long time before I can get the file done and ready to load.

What a shame.

Aéro-Poatale IV
Aeropostale I.
Aéro-Postale I.
Aéro-Postale II.

Update on New Content

I’ve added lots of new content to the database in the past few weeks, but I haven’t been good about making a note of that here.

I’ve just finished adding a really significant resource: Warships of the Imperial Japanese Navy, 1869-1945, by Hansgeorg Jentschura, Dieter Jung, and Peter Mickel. This is a 1977 translation of the original work, written in German. It has an enormous amount of information in it, and an extensive index.

There’s also a section titled “Miscellaneous Mercantile Auxiliary Vessels,” which has tons more information, but isn’t included in the index proper. I have, however, added all of the ships mentioned in this section to the ShipIndex.org database. The section has brief information about several thousand vessels, such as the following:

Hinode Maru (Transport): 5256 grt steamer, built 1930; requisitioned 1947; sunk 10 June 1943 north of New Ireland by US submarine Silversides.”

In this case, you’ll find the entry Hinode Maru (Transport) in the ShipIndex database.

Working through this index made me curious about the many vessels that were used in attempting to blockade Port Arthur. I didn’t really know anything about Port Arthur, so I did some quick investigating, and found it’s in Manchuria, and the blockade was part of the start of the Russ0-Japanese War of 1905. You learn all kinds of things doing this stuff!

As one indication of the value of this index, it has added over 3500 completely new vessels to the index. Resources with an Anglo-American focus tend to not add too many new vessels to the index — they usually cite vessels that are already in the index, but this time I’m pleased to be able to extend the coverage of the index quite a bit. To that end, if you know of resources that should be added, especially covering non-US or -UK subjects, please do let me know and I’ll look forward to having an opportunity to add them.

I also added a resource that’s much more relevant to history closer to home: Fiorello La Guardia’s Maritime History of New York, from 1941. Actually, it looks like La Guardia only wrote the introduction, and “sponsored” the publication – it was written by people employed by the Writers Program of the Works Project Administration for New York City. Whereas nearly 75% of the entries from the Jentschura book, above, are new, unduplicated vessels, in this book it’s more like 5%. (It does add the two privateers United We Stand and Divided We Fall, though, which is neat.)

These two titles above were added today. The following were added in the past ten days:

We’ll do a better job of listing resources when they’re added, and we’ll probably also put a “new” note next to these resources in the Resource list for a month or so after we’ve added them.

Again, if you know of resources that should be added, please let me know!

Data correction work at ShipIndex.org

We’ve completed our first initial load of a large pile of content into the premium ShipIndex.org database, and now have 1,231,909 references in the database. That’s a lot of content. We do have tons more to add, and it’ll keep coming in over time. Now, however, we’ll turn to the process of cleaning up some of this data.

Obviously, having all this data in one place is, I believe, a huge benefit, and well worth the subscription price for the premium database. We make it possible to search through well over a million references, from about 125 resources, in less than a second. The quality of much of this data, however, often leaves something to be desired. And now I’m turning to doing some cleanup, which I believe will be an equally valuable benefit provided by our site.

Data problems come from lots of different sources; some resources include prefixes, such as “USS” or “HMS” in front of vessel names, so many American naval vessels are currently listed on the ‘U’ page, as in this screen shot:

Many 'USS' listings in ShipIndex.org

They’re also listed in the proper location, under the name of the vessel, so it means there are several places to look. That’s no good, and we’ll fix that.

Another problem is attempts to save space in 19th century printed directories. One will find many entries with apostrophes in them, like the following:

Abbreviated entries in ShipIndex.org

As a subscriber to premium content, you can follow any of these particular links to find that most of these transcriptions accurately reflect what was written in the original publications. But that was done to save space in the printed directory; the stern of the ship certainly read “Duke of Newcastle”, not “D’keof N’wcastle” one year, “D’keofN’wcastle” another year, and “D’ke of N’wc’stle” a third year. (And there is a transcription error, as well: one reads “D’ke of N’woastle” rather than “D’ke of N’wcastle”.) Since no researcher would reasonably think to search for “D’ke”, we’ll work to change all of these to be searchable under “Duke of Newcastle”. (Don’t worry, if you do want to search for “D’ke”, you still can.)

A third problem is simple transcription errors, and there are many of those, from lots of different sources. In addition to the one noted above, several errors appear in the vessel named “D’le pf Suth’rl’nd”. The original source appears as:

ScreenHunter_03 Dec. 24 10.52

So, part of our value-add is correcting these errors. The data quality team at ShipIndex.org has lots of experience with this, since we’ve been doing something similar for the past ten years with magazine titles. (Trust me, they are far more complicated than ship names.)

Of course, with 1.23 million entries, it’ll take us a while to get through the entire database. It’s a fairly slow and meticulous process – though the technology team at ShipIndex has done a great job creating a panoply of tools to simplify the process and speed it up. (The technology team spent much of the past ten years building the tools that the data quality team used when working on magazine titles, so we’ve got it all pretty well covered.) It’ll take time to work through everything, and we’ll definitely be adding more data before we finish this process – meaning it’ll take that much longer – but it will happen. And if you see an error you especially want corrected, please don’t hesitate to let us know.

Thanks for your interest, and have a great holiday season.

A Page A Day – Moby-Dick

I somehow stumbled across an interesting site today, called “One Drawing for Every Page of Moby-Dick”, in which an amateur artist is creating a drawing based on the text of each page of Melville’s Moby-Dick. The overview shows sets of each pages that have been done so far, and the blog provides info on the more recent pages. Each work is done on “found paper” — discarded books, actually — and done with whatever type of materials the artist chooses. He does about 20-25 pages per month.

Interesting.

Hot Snot! ShipIndex is back in business!

As Doc Hudson says when he takes over as Lighting McQueen’s crew chief in the Piston Cup tie-breaking race, “Hot Snot! We are back in business!”

Over the past ten days or so, the crew at ShipIndex.org had some technical issues that we had to address, but we worked on ‘em, and we solved ‘em. Over the course of today, you’ll see a dramatic increase in the number of references in the index; assuming nothing else goes haywire, there should be over ONE MILLION references in the index by the end of tomorrow. We’re adding content from one major resource, and will be adding content from many other resources, as well, through the course of the next two days.

Keep an eye on the number of entries in the premium database through the course of the day. At the moment, it’s at 713,476, but it’ll be growing rapidly.

Reimporting data over the next few days

We’re doing some more tweaking to the content in ShipIndex.org, and will need to do some reimporting of some data — OK, a lot of data. Initially, a pile of premium data will disappear, but worry not — we’ll add it all, and much, much more, in the next few days. It’ll go in just as quickly as the machine will allow, but there is a huge pile of data. No free data will disappear. Stick with us!

Thanks, Peter