Monthly Archives: October 2011

Why indexing matters

I’m a huge fan of indexes, especially to magazines (aka serials, or journals), and it frustrates me quite a bit when I find useful journals that don’t have indexes to them. Here’s why.

The most important reason, most definitely, is because an index makes old issues of a magazine useful and accessible. Generally, a person receives and (hopefully) reads a particular issue. After that, the issue is stored, and eventually recycled.

(Or, perhaps, left at the local public library, if it’s not too old. I’m writing this in my local public library, and I have several recent issues of magazines to drop off in the ‘magazine exchange’ area. But the library has an understandable rule that no magazine left here be more than six months old. If that rule weren’t in place, the magazine area would be overrun with decade-old copies of magazines that no one wants, and the library would be left with the work of sorting through and recycling them all.)

When a library receives a magazine, it gets stored on shelves for a while. In niche areas like maritime history, it will likely eventually be sent to an off-site storage facility, as well. If there’s no guide to finding what’s in a given issue, then there’s basically no chance of finding anything in any particular issue. Consider a library catalog’s entry for, say, American Heritage magazine. Published for over 60 years, its subject coverage is represented in bibliographic data by basically a dozen words – and a third are in French, and two thirds of the remaining ones are duplicated. The only unique English words are “United States History Civilization Periodicals”. But with hundreds of thousands of pages in those 60 years, there’s an enormous wealth of information. Which is why they publish their own index to their magazine. Now, all those hundreds of thousands of pages are accessible to anyone with access to the index.

Maritime history publications would do well to make note of this, and to consider how their data is accessed when it’s more than a few issues old. Organizations that publish quality indexes to their resources, and then make that information as available as possible, are to be commended. As one specific example, consider the San Diego Maritime Museum’s publication, Mains’l Haul. Not only do they publish a current index to their journal, they make that publication freely available online. This is so vitally important, and should be aggressively emulated by every maritime history organization, regardless of their size.

People will be seeking articles from the entire run of Mains’l Haul for decades to come, because they take the time to make an index available to all. While it may cost money to do this (though some institutions are able to take advantage of volunteer indexers), I think it’s easy to see ways that that money will be returned in spades, and for decades to come, as people discover that past articles mention something of interest to them, and publishers of such works can then offer reprint services for those articles at reasonable fees, essentially indefinitely.

If a researcher doesn’t know that a person or a vessel is mentioned in a past article, they will not put that publication to use, and that’s a loss to the publisher, to the article’s author – whose work would be useful but won’t be found – and to history in general.

I’d like to make two additional comments:

First, don’t rely on a commercial abstract and indexing service to do this for you; while it’s great to get one’s content indexed in large databases, they will provide, at best, only a cursory summary of each article. They will not be sufficient for someone seeking a mention of a person, ship, or location that’s mentioned in, but not central to, a given article.

Second, a listing of the articles in an issue is NOT an index. (I’m looking at you.) It’s a list of article titles, and nothing more. While I suppose it’s better than nothing, it misses infinite opportunities to guide researchers to the incredible wealth of information that’s contained in a quality scholarly publication.

Please, magazine publishers: index, Index, INDEX! And if you’re really forward-thinking, make the index available for free, to anyone. Put it online as a pdf, as a searchable database, and as a text file that anyone can download and use elsewhere. What you lose in the cost of creating and distributing the index, you’ll more than make up in revenue from providing reprints and back issues, and (perhaps more importantly) in promoting and displaying the importance, value, and reputation, of the journal in question.

The death of the semantic web

I came across some interesting notes while going through old emails the other day. A message from NISO, the National Information Standards Organization, reported that the semantic web is dead, citing a post on semantico. The semantic web is a concept of presenting data in a structured format, usually as ‘triples’ (I am, absolutely, not an expert – or even that knowledgeable – on this stuff, so don’t quote me too far), so a computer can better understand what each term means.

For example, when a computer sees the word “Magellan”, it just sees a word. It doesn’t know if the word refers to an explorer, to a spacecraft, to a mutual fund, a “progressive metal/rock” band, or something else. By defining, through triples, what one means, the computer can realize that one page is talking about the explorer while another is talking about a mutual fund company.

Such semantic definitions have been used extensively in some subject areas, but not at all in most. And one of the great challenges with it is/was solving problems among the “upper ontology” – that is, the layer that connects concepts in zoology with concepts in art history with concepts in electrical engineering with concepts in maritime history, etc. One field may work hard to define its ontology, but if that schema doesn’t mesh with other ontologies, then the systems aren’t really connected.

So I was interested to read of the effective death of the semantic web, and its replacement by schema.org. Schema.org is a nascent project being put together by representatives from the search teams at Google, Yahoo, and Microsoft’s Bing. It uses microformat HTML tags, added to a page’s markup text, to define what something is. This is done for the benefit of search engines – so a “Magellan” that is marked with the tags

  <div itemscope itemtype="http://schema.org/Person">
    <span itemprop="name">Ferdinand Magellan</span>

is clearly a person, while the Magellan that’s tagged

  <div itemscope itemtype="http://schema.org/Product">
    <span itemprop="name">Fidelity Magellan Fund>/span>

is something you can buy. (Note the differences in the end of each first line; the first is “/Person”, and the second is “/Product”.)

(Also: I defined the Magellan Fund as a ‘product’, because one can buy a share of it, but it might more appropriately be an ‘organization’, since there is a ticker symbol associated with it, and schema.org currently has a “tickerSymbol” attribute for Organizations.)

The current schema.org structure is quite limited, and focuses primarily on people, organizations (especially local businesses), creative works, events, and locations. But it’s certainly extensible, and – if it’s generally adopted, as triples were not – it will clearly expand to other fields.

I’d love to take on extending it to vessels. It’d be pretty easy for us to modify our HTML to include these microtags, and if that helps people find the information they’re seeking, then all the better for all involved. But I’m not sure what the proper levels should be. One doesn’t want to have too many levels in a structure like this, but I think that going straight from “Thing” to “Vessel” might be a bit of a jump. I imagine an intermediate step of, perhaps, “Vehicle”, would be appropriate. Then those with interest in cars, trains, airplanes, bicycles, scooters and lots more, would build out their schemas, while we could start a layout of sailing vessels.

It seems simple, but immediately becomes fairly complex. You could, for instance, split up “Vessel” entries to “HumanPowered”, “WindPowered”, and “MechanicallyPowered”, perhaps, then divide by vessel type – canoe, kayak, paddleboat; sloop, ketch, yawl, schooner, brig, brigantine, barkentine, ship, bark, hermaphrodite brig; paddlewheel steamer, ferryboat, fishing boat, battleship, oceanliner; etc., etc. Is that too much differentiation? How do you define a vessel that’s been re-rigged, from a ship to a bark, for example? How, even, do you make it clear that when you’re talking about a ‘ship,’ you’re talking about a three-masted vessel with square sails on the furthest-aft mast, rather than something that floats and is bigger than a boat?

Lots of other terms could be added or defined over time. When the computer can understand what the term means, rather than just presenting the term to the world, it will make it much easier for individuals to draw understanding and make connections from within large bodies of marked-up data.

It would appear that this system, because it’s fairly easily applied, has a much better chance of success than did the original ‘triples’ approach. I look forward to watching it with interest.