The Semantic Web

August 2009: How Google beat Amazon and Ebay to the Semantic WebA work of fiction. A Semantic Web scenario. A short feature from a business magazine published in 2009.
A Response to Clay Shirky's “The Semantic Web, Syllogism, and Worldview”Clay Shirky, a well-regarded thinker on the social and economic effects of Internet technologies, has published an essay called “The Semantic Web, Syllogism, and Worldview,” a critical appraisal of the semantic web which claims, in essence, that the Semantic Web is a technological pipe dream: an over-specified solution in search of a problem. As someone who has spent long hours attempting to fathom the standards which define the Semantic Web (see and ), I can empathize with Shirky's frustration, particularly his frustration with the more lofty of the Semantic Web evangelist's claims. That said, I believe that there is much of value in the Semantic Web framework which can be applied to real-world problems, and I find many of Shirky's arguments to be misguided attacks against Semantic Web straw men. After summarizing several claims made concerning the Semantic Web by various parties, Shirky defines the Semantic Web as “a machine for creating syllogisms.” This is an over-simplification. The Semantic Web cannot “create”, any more than the current Web can create. Humans create data, and computer programs may process that data in order to create new data, but to assign agency to the Semantic Web is a mistake. Neither is the Semantic Web associated with any pre-defined process, so it is false to call it a “machine” Most notably, the means whereby nearly all automated reasoning is accomplished on the Semantic Web is not syllogistic reasoning, which has hardly been used since Descartes. In the case of languages based on first order predicate logic processing, the method used is usually resolution reasoning (as in Prolog), and in the case of description logics, like OWL, the means is tableau reasoning. In his opening paragraph, Shirky links to, and subsequently dismisses, the World Wide Web consortium's definition of the Semantic Web. That definition, in full, is: The Semantic Web is the representation of data on the World Wide Web. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming. I think the phrase “representation of data on the World Wide Web” is confusing, and the focus on XML puts the emphasis on specific syntax, which is misleading. A simpler way to say it might be: The Semantic Web is a framework that rigidly defines a means for creating statements of the form “Subject, Predicate, Object” or “triples,” in a machine-readable format, where each of Subject, Predicate, Object is a URI. The means of “rigid definition” is a series of standards published by the World Wide Web consortium, namely RDF, RDF Schema, and OWL. Shirky writes, “Despite their appealing simplicity, syllogisms don't work well in the real world, because most of the data we use is not amenable to such effortless recombination. As a result, the Semantic Web will not be very useful either.” It is true that very few of us, before we kiss our lovers, or mothers, say things like “I only like to kiss living women; this woman is alive; therefore I shall kiss her.” Much of life is lived in a place that is not easily captured by first-order predicate logic. But logical reasoning does work well in the real world—it's just not identified as such, because it often appears in mundane places, like library card catalogs and book indices, and because we've been trained to automatically deduce certain assumptions from signifiers which do not much represent the (S,P,O) form. Let's say you're reading the book Defenders of the Truth, which is sociologist Ullica Segerstråle's intellectual history of the debate over sociobiology. Interested in finding what the book has to say about Steven Jay Gould, you turn to the index, and find: Gould, S.J. and adaptationism 117-18 and Darwinian Fundamentalists 328 and Dawkins 129-31 Ever Since Darwin (1978) 118 and IQ testing 229-31 Marxism 195, 226 unit of selection dispute 129 (p 482, example much abridged) If you are an experienced reader with some knowledge of the field of sociobiology, you can make a variety of deductions using the index. Take the third item, “and Dawkins 129-31.” Looking at this statement, and drawing on your memory, you could deduce: SubjectPredicateObject DawkinsIs a synonym forRichard Dawkins Steven Jay GouldInteracted in some way withRichard Dawkins Information on Steven Jay Gould's interactions with Richard DawkinsCan be found onPages 129, 130, and 131 Pages 129, 130, and 131Are found inThe book Defenders of the Truth And so forth. Internally, you wouldn't go to so much effort; if you had to think using predicate logic, it'd be hard to get out of the house in the mornings. As Shirky writes: When we have to make a decision based on this information, we guess, extrapolate, intuit, we do what we did last time, we do what we think our friends would do or what Jesus or Joan Jett would have done, we do all of those things and more, but we almost never use actual deductive logic. And this is true, if you say, as Shirky seems to, that “deductive logic” is the conscious explication of logical facts which lead, via syllogistic reasoning, to a logically valid conclusion. But come back to our book index: good indexes are the product of quite a bit of craft and expertise, and the result of quite a bit of logical thinking. Professional indexers take a long block of narrative text—a book—identify the subjects the text describes (Steven Jay Gould, Richard Dawkins), formalize the names for those topics (Gould, S.J., Dawkins, R.), and then specify narrower subjects which relate to that subject, cross-indexing items where they feel it will be valuable to the inquisitive reader. As such a reader, the process whereby I decide which page to which I will turn is a very logical one. When I look up Steven Jay Gould, the index returns a list of sub-topics related to that subject. I then choose none, one, or several of those sub-topics, seeking those which are relevant to my interests, and turn to the page which corresponds to the given sub-topic. The book presents data using a formal semantics—in this case the semantics of index structure and typography, much of which is reliant on alphabetization. I establish a goal, seek within the index's data set, and refine my goal based on my preliminary search results, then obtain my result in the form of a page number. I perform these acts in a given sequence, a sequence to which I've become accustomed over time. Indexes only work because human beings are comfortable with the logical conventions of a book. We are taught how to use books in elementary school, learning about alphabetization, indices, and how to find words in a dictionary. As we advance as readers, we come to understand concepts like sections, headers, page numbers, and topics. None of these things have any meaning unto themselves; rather, we learn to interpret their semantics. The large, bold type that appears after a blank page means we have come into a new chapter. The little number after a line corresponds to a footnote. The text bounded by squiggles is a quotation. Indexes are a kind of taxonomy, a classification of the ideas in the book into topics and subtopics, and indexes work because readers can be counted on to understand their layout and function, to interpret the symbols contained in an index to mean “if I am interested in Steven Jay Gould's ideas regarding adaptationism, I should turn to page 117.” In order to function, they are very dependent on a human being's ability to perform acts of reasoning. They lend themselves to deduction: from reading the index above, I can deduce that Steven Jay Gould had an opinion on adaptationism, IQ testing, the unit of selection dispute, and Marxism, had some kind of relationship with Dawkins and/or Dawkin's ideas regarding sociobiology, and is somehow involved with the book Ever Since Darwin, which was published in 1978. Imagine that 30 or 40 books about sociobiology are available on the web. Since these books cover similar topic matter, it would be ideal if they could all be indexed together: that is, rather than have multiple indices spread across dozens of books, why not have a single index for all of them? For a serious researcher, this would be extremely useful. And in fact, indices of periodical literature, a staple of the library orientation required of most college freshman, perform this exact function: they take a large number of the journals and magazines published over time and create a master index, so that, were you looking for information on Steven Jay Gould, you could look up his name and find all of the articles that discuss him or his work that were published during the span of time covered by the index. Those multi-volume meta-indices have real precedents: prior to the age of the CD-ROM and the Internet, the best way to distribute our periodical reviews of literature was to issue annual volumes, and the serious scholar would go year by year, looking up her topic. Occasionally expensive volumes appear with annotated bibliographies covering a single topic—I once helped a librarian at my alma mater format his bibliography on Iceland, which described hundreds of books on that country, organized by into chapters that covered history, geography, political systems, and so on. But now, given the ubiquity of computers and the ease by which large databases can be created, such works are increasingly being created and distributed digitally. No longer is the index of periodical literature divided, by the constraints of book technology, by year; rather, a search returns results for perhaps 100 years, organized by date. To put this data into a computer, developers had to translate the conventions of typography into a set of semantic boundaries that corresponded to the meaning of the datum in question. It's not enough to simply italicize a book title and assume that the computer can make sense of it. A computer is an unsubtle beast, and needs help knowing how to sort data so that when it is asked a question, it can make sense of it and answer in a meaningful way. So you create a BOOK database with fields like TITLE, AUTHOR, PUBLICATION_DATE, SUBJECT, and so forth, and put your book data in there. Now you can state a question, specified in a database query language such as SQL, that asks, in effect: ”show me all the books that have the subject of 'Iceland',” and the computer will produce a list of matches as a result. What do you put in the AUTHOR field, though? If you have an author named “John Smith,” you'll run into a problem: there is likely to be more than one author named John Smith. So instead, you make a totally different table called "AUTHOR" and you put John Smith in there, and the system creates a unique identifier for him--usually a number like “103.” Now, in the AUTHOR field of the BOOK database, you insert that number instead of his name. Then, when you have another John Smith to add, you create a new record for him, and receive another number, and use that in the BOOK database. Now, when you're looking at the record for the book about Iceland written by John Smith, and you say “show me all the books by the author who wrote this book,” the system doesn't simply go out and get any book written by any John Smith—it only gets those written by the John Smith who wrote the book on Iceland. While you're at it, you should create a table called SUBJECT, and give Iceland a special ID. Because if suddenly there's a pop group called Iceland, and books are written about that pop group, you'll have the same problem you had with John Smith. This is, more or less, how relational databases work, and if you work out from what I've described, adding many layers of complexity, you'll arrive at web sites like Amazon.com or Ebay.com, which store all of their information this way. These database-backed sites operate on the principle that human beings are capable of logical reasoning, and they use basic web links to express logical relationships between different resources (namely, web pages). When you go to Amazon, you are presented with a search box. Enter the word Ulysses into the search box, and you will see 8590 results. Too many. However, you know that Ulysses is a book, so you narrow your search to “Books,” then search for Ulysses, and see a much more manageable list of the three items. If you click on the first result, you're taken to a web page that tells you that Ulysses is by James Joyce. Clicking on the link to Joyce gives you another list, of the books written by him. In this sequence of events, you've made quite a few assumptions which are logical in nature: Searching for “Ulysses” returns 8590 results, which is too many, and a limited search will return fewer results: Amazon allows me to limit my searches to Book. Ulysses is a book. Therefore, I should limit my search to books. I would like to know more about James Joyce: The Ulysses page contains the text “by James Joyce.” Links to authors on Amazon provide a list of the works written by that author. Therefore, if I click on the link I will see a list of the works written by James Joyce. I had to learn these processes when I first started using Amazon, just as I had to learn how to use book indices. I trusted that Amazon was arranged according to some logical principles, and learned the semantics of its different links. If I clicked on James Joyce was taken to a page on NASCAR racing, it would be surprising, and illogical. Taking this a little further, I think that many links, search boxes, and other interface elements on the Web have semantics—that is, the text of the link and the context in which it appears indicate the sort of resource to which it links. The semantics of these links are quite arbitrary and vary from site to site. One link to James Joyce on a site might show me a list of the books he wrote, but the same link on a different site might show me his biography. On EBay the same link will take me to James Joyce-related items up for auctions. As an Amazon user, I have come to understand that a link to an author shows me a listing of that Author's works. That's fine if I want to find a list of works by James Joyce. Amazon sells books, and I understand that. I would be surprised if it offered me blow up James Joyce dolls when I clicked on a James Joyce author link. But even in the domain of books, this is narrow: authors can always be subjects, so if you wanted to find the biography of James Joyce, you'd have to back up and search in a different way. In the system defined by Amazon, subjects exist in another database table from authors, and the twain does not meet. But there's no reason why this should be. It's simply not part of Amazon's design, but it's totally feasible if you stop thinking in terms of relational databases. Instead of just a listing of books, the James Joyce page could present a list of: Books by James Joyce Books about James Joyce Books about the books of James Joyce If it went deeper, I might see: Books that were influenced by James Joyce Writers who worked with or knew James Joyce Books that cover the same subject matter as James Joyce It doesn't do this because the James Joyce ID is defined purely in terms of the Author—in the model of the world that Amazon used to build its database: SubjectPredicateObject James JoyceIs anAuthor James JoyceIs aSubject But both of those James Joyces are completely different things, as far as the computer is concerned. If you've built relational databases, you'll understand how this happens: in general, unique identifiers are limited by the column in your table. Every subject is unique, and every author is unique. Shirky, addressing this problem space, says that the Semantic Web does not offer an answer. Is your "Person Name = John Smith" the same person as my "Name = John Q. Smith"? Who knows? Not the Semantic Web. The processor could "think" about this til the silicon smokes without arriving at an answer. But the Semantic Web is designed to address this exact issue. So how would you make a complete, interlinked, data-rich James Joyce page like the one I described above? The answer is in creating a unique, independent identifier for James Joyce. You can't use numerical unique IDs, because my #205 might be a lampshade, and yours might be Genghis Khan. So you use URIs, the addresses that allow us to point to different resources on the Web. URIs give us namespaces—they give us a way to be very specific so that our chickens and our Genghis Khans don't get mixed up. For Joyce, you can create a URI like so: http://amazon.com/authors#JamesJoyce And if there was another James Joyce, you might call him: http://amazon.com/authors#JamesJoyce_02 That URL doesn't mean anything unto itself. Like our numeric IDs, it's just a convenient way to say “this is a unique thing, even if it is described by the same words as another thing.” James Joyce is no longer a single datum inside of a database of authors, and another datum in a database of subjects; rather, he is a free-floating, unique identifier that exists outside of any specific database, called “http://amazon.com/authors#JamesJoyce” (let's call that #JamesJoyce for short). Now take that a step further, and let's say that Ulysses has the unique identifier: http://amazon.com/books#Ulysses and Richard Ellman's biography of Joyce, James Joyce, is http://amazon.com/books#JamesJoyce Now our database, instead of a table, is a set of logical statements like so: SubjectPredicateObject /books#UlyssesHas author/authors#JamesJoyce /books#DublinersHas author/authors#JamesJoyce /books#JamesJoyceHas subject/authors/#JamesJoyce /books#JamesJoyceHas author/authors#RichardEllman And so on, for quite a while. Now, when we want to build our James Joyce page, instead of saying “show me all the books written by James Joyce,” our query is something like: Make a list of all the triples where James Joyce is an object or a subject, and sort them by predicate. Then, taking each predicate in turn, perform an operation that displays something useful to the user. If you need to, go back to the database and get information on the subjects or objects, as the case may be, in the triple. That's a lot, but it's a lot easier to ask that question of the computer in practice, using a standard RDF query language. And in practice, it can lead to some pleasant results. Take the page on this web site, for example. Scroll down a bit, to where it says “Links Related To The Chinese Room Thought Experiment” and take a look at everything after that. None of that content is actually part of the piece. It is culled from the small database of facts that is automatically derived from Ftrain. The links come from different parts of the site, and are automatically pulled in and sorted by date. The text at the end of the piece is created by traversing another set of triples. And after that, the list of semantic relationships is created from the same set of triples: the source, author, related subjects, place in hierarchy, and so forth, are all pulled out of a Subject-Predicate-Object database. If, on that page, under the link “The Turing Test Doesn't Work,” you click on the link to Nova Spivack, you'll be taken to the Nova Spivack page. As you can see, the text under “The Turing Test Doesn't Work” is now highlighted, because that was the link you clicked to get to this page. If you scroll down a bit more, you'll see that Nova Spivack is a a human being. Clicking on that link brings you to a list of all the human beings on the site, with organized by subcategories. I've become very fond of that sort of inter-linking. I'm still figuring out what to do with it, but I think it's worth pursuing. So when Shirky quotes a particularly perplexing syllogism and says: [This syllogism] illustrates the kind of world we would have to live in for this form of reasoning to work, a world where language is merely math done with words. I disagree entirely. I am a writer by avocation and trade, and I am finding real pleasure in using Semantic Web technologies to mark up my ideas, creating pages that link together. What I do is not math done with words. It's links done with semantics, and it forces me to think in new ways about the things I'm writing. Let me give you another example, so that you might get a better sense of the power of the system. Let's say that every week, I published a summary of the the week's news. But instead of just writing the news up, sentence after sentence, I marked up the news as a series of events, with the times they occurred, and linked the different events to the topics they discussed. Here's a faked-up paragraph, from November 14, 2000 (or so). The results of the election between George W. Bush and Al Gore remained uncertain. It appeared that the presence of Ralph Nader in the electoral race cost Al Gore a clear majority. George W. Bush began to refer to Laura Bush as “First Lady Bush.” Now, you can't see this, but each one of those three sentences is marked up as an “Event.” If you click on the link to Al Gore, it will show you a timeline-sorted list of the two events that relate to Gore, and if you click on the link to George Bush, it'll do the same. Working out from here, it's easy to imagine how you could take all of your weekly reports, build one master database and publish it like “The Timelines of History.” You could issue queries like “show me all the events that involve 'George Bush' and 'Iraq.'” Doing this the old-fashioned way, with a relational database, is a true pain. Rolling the database yourself, like I did, is very difficult and no fun at all, and writing it in a language like XSLT, which is also something I did, is about as dumb as it gets. Nope, having tried everything, I'm increasingly of the opinion that the right way to do it, as far as I can see, is to turn your events into Semantic Web-friendly RDF statements, store them in an RDF database, and query them there. When you've got a big pile of semantically tagged interlinked data as a nail, the Semantic Web framework is the best hammer around. Just having a bunch of linked events isn't the answer. Those links need some place to point to. So we create a sort of meta-index, which doesn't just contain subjects and sub-topics, but also defines relationships between them (which you can do using triples). As opposed to being events with links, this meta-index, or ontology, is far more formally specified. It has relationships like: SubjectPredicateObject George W. BushIs president ofthe United States (from 2001-?) The Middle EastContainsIraq, Iran, Israel, etc. You might also add facts about Bill Clinton, George Bush, Sr, and other presidents, and you also need to tell the system that the “Contains” preposition means if something relates to the thing that is contained, then it also relates to the container—for instance, if there is war in Israel, then there is war in the Middle East. Given a database of such facts, you might want to ask a question like: “show me all the events that involve a president of the United States and the Middle East.” Because you have an ontology, your system can reason something like what follows: The set of things contained by the Middle East includes Iraq, Iran, Israel, and so forth. The set of presidents includes George W. Bush, Bill Clinton, and so forth. Therefore every event which refers to at least one of the set of presidents, and at least one country contained by the Middle East can be said to answer our question. The system then goes ahead and finds all of those matches, via the dark arts of resolution reasoning, applied graph theory, and set theory, and returns a big list of events as a result. What you do with these events is up to you: you might organize them in a timeline, for instance. As above, doing this with relational databases is a pain. A storage layer that understands logic is far preferable. Some of this sort of linking and programming is easy, but a good bit of it is mind-bending. I find it hard to code up my own ontologies. A geography ontology is a good example: I don't want to encode every region, country, state, prefecture, and so forth. I'd much rather take someone else's ontology of all the countries, states, and so forth (like the one my pal Jack Rusher is using on his web site). So what I'll do instead, when I want that view of the world, is get it from Jack, or from where Jack got it, in RDF format, and drop it into my triple store, and address my links to the unique URIs specified within that small geographic ontology. What's good about that is that now, if we want to, Jack and I can take all of our web pages, spit them out in RDF, and merge them together. All the attributions stay the same, but if we've both written something about Italy, the Italy page will contain links to each of our pieces. It can do this because we shared the ontology of nations. So when Shirky says: No one who has ever dealt with merging databases would use the word 'simply'. If making a thesaurus of field names were all there was to it, there would be no need for the Semantic Web; this process would work today. It's hard to see where he's coming from. That is the point for the Semantic Web, and merging RDF databases is not as easy as, say, drinking chilled white wine in the summertime, but it's definitely not as hard as unifying multiple relational databases. The OWL language, which allows you to define ontologies, has all manner of trickery for saying which URIs are synonyms of each other, and how they relate. So what you can do, if you choose, is merge your databases, and then write up a series of OWL statements explaining how the different databases relate. Then an OWL-aware system (of which there are admittedly remarkably few, but more on the way) glues the databases together for you. What the Semantic Web framework does is admit that it is really hard to unify databases, and gives you a language for unifying them that doesn't require you to muck around too much in the details. You can focus on the semantics, not the actual formatting of the data, and approach the problem quite strategically. This situation is much more friendly than the one Shirky describes: But [the meta-data we generate] is being designed a bit at a time, out of self-interest and without regard for global ontology. It is also being adopted piecemeal, and it will bring with it with all the incompatibilities and complexities that implies. This assumes that people would prefer to endlessly re-create ontologies that describe well-known subjects, like geography, or authors, or books. Personally, I like my way better: if there's a nice-looking ontology about geography, and I can get my hands on it, I'll just plug that thing into my site and start using it. There are many other points in Shirky's essay that I disagree with, and I originally set out to refute them point by point, but essentially, I disagree with every one of his major conclusions, and find them to be based on incomplete understanding of what the Semantic Web is and how its researchers work. If you search Citeseer for papers on RDF, the Semantic Web, and related technologies, you'll find a wide variety of prior art that addresses many of the issues he discusses, and you'll also find that the Semantic Web community is nowhere near as ignorant of the problems he describes as he suggests. Quite a bit of work has been done on trust metrics, semantic disambiguation, ontology exchange, triple storage, and query semantics. Some of it is doubtlessly going down the wrong path, but some is equally likely to prove worthwhile. 50 years of AI research has not given us a computer that thinks, but it hasn't been wasted time, either. Neural nets, Bayesian algorithms, and the other fancy stuff that is trapping spam, girding up search engines, and performing other useful tasks are a direct result of the long years of research into AI. The Semantic Web is a classic AI project, but on a much larger, less predictable scale than ever before. By sneering at a few researchers, Shirky maligns the patient, methodical work of hundreds of others. For every quote he presents that shows the Semantic Web community as glassy-eyed, out-of-touch individuals suffering from “cluelessness,” I could give a list of many other individuals doing work that is relevant to real-world issues, who have pinned their successful careers on the concepts of the Semantic Web, sometimes because they feel it is going to be the next big thing, but also because of sheer intellectual excitement. The work being done at the UMD Mindswap lab, which employs two friends of mine, Bijan Parsia (well, Bijan is more of a likeable nemesis, but anyway) and Kendall Clark, whom I can personally vouch for as down-to-earth individuals keenly aware of the limits of computing, is definitely worth noting. Companies like Radar Networks are working on a truly usable Semantic Web platform. Sensible individuals see the Semantic Web as an enabling technology for all manner of applications. Individuals like Edd Dumbill of XML.com (which publishes me from time to time), Dave Beckett of Semantic Web Action Development Europe, who has put together the promising Workshop on Semantic Web Storage and Retrieval, happening this week in Amsterdam, and many, many others are pragmatic technologists who share information freely and believe strongly that building a Semantic Web is a worthwhile pursuit. My money's on them. They know what they're talking about, and aren't afraid to admit what they don't know. Postscript: on December 1, on this site, I'll describe a site I've built for a major national magazine of literature, politics, and culture. The site is built entirely on a primitive, but useful, Semantic Web framework, and I'll explain why using this framework was in the best interests of both the magazine and the readers, and how its code base allows it to re-use content in hundreds of interesting ways.
A New Website for Harper's MagazineOn December 1, 2003, a new website for Harper's Magazine launched at Harpers.org. This site was conceptualized, programmed, and designed by myself, under the management of Harper's senior editor Roger D. Hodge. I also wrote some copy for the site, and have been editing the Archive of pre-1900 articles. I desperately need a nap, but I thought I'd tell you a bit about the site first. The site looks like this, but larger. It's been noted that Harpers.org looks like Ftrain. It's actually the other way around: Ftrain looks like Harpers.org. I've been using you, the Ftrain reader, as a guinea pig for about 5 months, testing ideas I developed for Harper's, finding out what JavaScript worked in which browser, which interface ideas were too baffling to include, and seeing how you dealt with different sorts of links. Thanks for that. Because Ftrain readers are free with both praise and criticism, this turned out to be a good way to craft a site that was accessible, worked in most browsers, and was enjoyable to use (with some practice). Now that Harpers.org is up and its design is stable, Ftrain can change according to my whim, and I can begin to break things here in new ways. Now, I am going to blow my own horn. Trumpet provided by Leslie Harpold. Ms. Harpold is Ftrain's preferred visual archivist and art director. Features of Harpers.org The regular list of new-age website tomfoolery applies: XHTML/CSS/QXNYTLRPK, accessible for the people, JavaScript zip-zap, validating RSS hoo-ha, etc. The framework is solid XML and XSLT2.0, and plays nice with others. But also: Remixing Narrative Harper's is built upon a Semantic Web framework—albeit a primitive one. I've written about what the Semantic Web is, and why it matters before, if you're curious, so I won't rehash that here. Using this framework, Harper's is divided into two parts: narrative content, like the Features and the Weekly Review, and a taxonomy (or ontology, depending on your preferred term), called Connections. The taxonomy is a big list of interconnected topics—examples are Dolly the Sheep, Monkeys, and Satan. The Weekly Review, which is narrative content, is description of the events of the past week, published every Tuesday (see an example). We cut up the Weekly Review into individual events (6000 of them, going back to the year 2000), and tagged them by date, using XML and a bit of programming. We did the same with the Harper's Index, except instead of events, we marked things up as “facts.” Then we added links inside the events and facts to items in the taxonomy. Magic occured: on the Satan page, for instance, is a list of all the events and facts related to Satan, sorted by time. Where do these facts come from? From the Weekly Review and the Index. On the opposite side, as you read the Weekly Review in its narrative form, all of the links in the site's content take you to timelines. Take a look at a recent Harper's Index and click around a bit—you'll see what I mean. The best way to think about this is as a remix: the taxonomy is an automated remix of the narrative content on the site, except instead of chopping up a ballad to turn it into house music, we're turning narrative content into an annotated timeline. The content doesn't change, just the way it's presented. Everything is in the Taxonomy Harpers.org makes almost no distinction between data and metadata. Any block of text can have multiple blocks of text living inside of it (as when the Weekly Review contains events), and these blocks in turn can contain multiple blocks, and so forth. What this means in practice is that in addition to events and facts, I can add any arbitrary kind of data to the site. Links, Litigation, Questions, Answers, Lies, Photos, Crimes, any sort of boundary you can think of. By linking from inside of these boundaries to pages in the taxonomy, the taxonomy pages know to automatically list and sort them, whatever they are. How you display them is up to the XSLT code, and to the way the ontology is structured. Let's skip over that part. Another example: the Bookstore is just another part of the site, and the ads for books are automatically generated from the bookstore. Advertising and editorial are produced with the same system using the same linking mechanism. In theory, this would allow by-topic sponsorships similar to keyword-based advertising on search engines: “I'd like this ad to run next to the religion category and on all pages related to religion.” There's other stuff under the hood, and I have many plans (dynamically generated maps! queryable content! news-trackers!), but actions speak louder, etc. No Banner Ads Banner ads are terrible for both readers and advertisers. We got rid of them for Harper's, put non-blinking ads to the side, and made them half image, half copy, flexibly sized. This is, I believe, good for brand-building—the advertiser's message is prominently displayed and persistently there as the user reads. Because this message is not obtrusive and animated, it need not be ignored. Because it is graphical and bold, and lives in its own place to the right, it can be seen as content unto itself, not simply tacked on to make a few spare bucks. The ads are an integral part of the page. (I wouldn't mind animated ads, but they should only animate when the user mouses over them and shows interest. “The reader's freedom is a holy thing,” says William Gass, and I agree.) Constructing Harpers.org This was originally going to be a case study of how Harper's came into being. But only I care about that, and my pomposity has some limits. So I'll give you the entire thing in 4 bullets. Harper's has been very patient. Roger is an editor who knows XML and how to program. His kind are slightly more rare than talking dogs. He took on a great deal of complicated work in order to make this site happen, without even a shrug. 3,000 facts, 6,000 events, 12,000 links, 500 topics, and over 939 separate HTML pages. 300,000 words. I finished coding the first draft of the site by annotating printouts of XSLT code with a pencil, by propane light, in a 100-year-old log cabin in West Virginia, while muttering. Next Now that everything is working fairly well, it's time to tear the guts out of the code and start again. A small team of Java coders and I are planning to take the work done on Harper's, and in other places like Rhetorical Device, and create an open-sourced content management system based on RDF storage. This will allow much larger content bases (the current system will start to get gimpy at around 30 megs of XML content—fine for Harper's, but not for larger sites), and for different kinds of content to be merged. When this will be completed is open to discussion, of course. But it seems like the right next step, if we can just figure out how to find the time to get it done. More later. Talk to Me Now that I've done this, I'd love to talk about it. If you're in publishing or a related industry and want to talk to me about any of the work that went into the Sitekit code, for whatever reason, with the idea that I might help you think about your own content, please contact me, arrange a meeting, take me to lunch, throw paper airplanes over the East River from Manhattan in the hope that they will come in through the hole in my screen window, and so on. It's time for a career change.
Learning to Fear the Semantic WebZotero is an open-sourced bibliography-management tool that runs inside Firefox-based browsers (see screencast). It helps you keep track of your research. I've enjoyed using it as I work on writing projects. From the about page: Zotero is a production of the Center for History and New Media at George Mason University. It is generously funded by the United States Institute of Museum and Library Services, the Andrew W. Mellon Foundation, and the Alfred P. Sloan Foundation. Nice! Except today, a good bit after the fact, I learned of a peculiar lawsuit that information and news giant Thomson Reuters Inc. filed last month against the makers of Zotero. From the website of The Chronicle of Higher Education, October 3, 2008, by Jeffrey R. Young (links added): Thomson Reuters Inc. sued George Mason University in a Virginia court this month, arguing that a free software tool made by the university makes improper use of the company’s EndNote citation software.... Thomson Reuters argues that the latest release of George Mason’s software, which can import files created by EndNote and turn them into files that can be used and shared online using Zotero, “is willfully and intentionally destroying Thomson’s customer base for the EndNote software.” The company seeks $10-million in damages for each year the university has offered the software and to stop the university from distributing versions of Zotero that can convert EndNote files. One person who commented on the lawsuit is Michael Feldstein, who writes a blog about online learning. He posted the following on October 5: Apparently, the Zotero team did create their own style format and is crowd-sourcing the creation of import styles. As you can see from this Zotero developer discussion thread, the developers considered and explicitly rejected supporting the redistribution of Thomson-supplied EndNote conversion files. In fact, while Zotero can read EndNote style files, it specifically does not convert them into Zotero’s own format, in large part to discourage the redistribution (deliberately or accidentally) of Thomson-created files. What the import feature does facilitate is (a) users who have already licensed EndNote and want to migrate to Zotero can use the EndNote styles that they have already paid for, and (b) Zotero users can take advantage of the EndNote import styles that individual journal publishers (as opposed to Thomson itself) make available for the convenience of their subscribers. These uses strike me as totally within bounds. (More is available from the Disruptive Library Technology Jester blog.) Given my biases this lawsuit seems like an anachronistic, hamfisted attempt to block competition. While as a programmer I love being able to adapt open-source software to my particular needs, I use a mix of closed-source and open-source software without many qualms. That said, non-standard, closed-source document formats are awful stuff that block competition between software vendors and, worse, waste god-awful amounts of my time. If you wish to dispute me on this then come to my office tomorrow to help me, over the course of several hours, yank a magazine's-worth of text out of Quark XPress, using a mix of applications and balky emacs macros. (Imagine if you could take back all the time spent wrangling closed, proprietary document formats. You could finish Perl 6; you could probably write it in Arc.) I'm not an Endnote user and I don't like to borrow trouble (which is why I've been avoiding this blog; blogging is a great way to borrow trouble). But not only does this lawsuit invoke the dread specter of legally-enforced proprietary data formats, it raises questions about Thomson Reuters's legal attitude towards the data produced by its other software offerings—including, in this case, a piece of software called OpenCalais. OpenCalais is a web-based application that consumes text and returns special Semantic Web-style metadata that you can use to do interesting, Semantic Web-style things, like: create topic pages, improve search, or enhance local taxonomies. It has a Facebook group and its website features both video of straight-talking bearded coders and a creatively borrowed terms of service statement: We based these Terms of Service under those released by Automattic under a Creative Commons Sharealike license. Thanks to Automattic and WordPress.com for sharing. I have a quarter-million-page corpus at work and I'm looking for simple, inexpensive ways to enhance it, so I've followed the development of their platform for some time—joining the FaceBook group, signing up for an account, and using their free endpoint for testing (go ahead and give it a spin). My grand, entirely unrealized plan was to include a direct hook to OpenCalais in our content management system. The OpenCalais team seem trustworthy, progressive, and smart, and committed to openness. But, at least for now, the lawsuit against Zotero has scared me off using the product. This despite, as pointed out by the Panlibus blog at Talis, in a post on OpenCalais as it relates to the Zotero lawsuit, the following statement from the OpenCalais folk: We want to make all the world’s content more accessible, interoperable and valuable. Some call it Web 2.0, Web 3.0, the Semantic Web or the Giant Global Graph—we call our piece of it Calais. So why am I overreacting? Well, that “our piece of it” bit is a little tricky, but I think I get what they mean, and the Endnote people and the OpenCalais people are in different parts of a very large organization and working on different projects with different goals. But the parent company is the same, and, professionally I feel required to overreact, because in every situation—as editor, coder, designer, and so forth—I to my great regret must always concern myself with liability. I hate that part of my job. From worrying about copyright and fair use, to questioning whether we can reuse art or prose from our own archives, to sending out cease and desists—it all fills me with gloom and despair, the sense of being a culpable cog in a lumbering legal machine. It's the opposite of creative, interesting work, but if you get something wrong the consequences can be dire, so worrying about getting sued is something that has to be done, every day, even on the subway. I'm worried about getting sued right now, sitting here, typing this. If you've had someone threaten you with a lawsuit, you know the sort of fear and second-guessing it engenders. Even if I am certain that I have followed every ethical and legal guideline, it's an instant panic attack to see the words “contacting a lawyer” or “liable for damages” in an email; it leads to second-guessing, and I know there will be phone calls, meetings, and several months of followups to comply with the needs of insurers. If I can see the shadow of a lawsuit anywhere I am obligated to shine a light upon it and freak out at least a little; otherwise I'm not doing my job. And that's what's going on here. This recent lawsuit against George Mason/Zotero immediately brought to to mind a scenario: Thomson Reuters maintains control over the taxonomy, the thesaurus, of terms used in OpenCalais, and they do the indexing of content to associate that content with terms. The use pattern I was considering was as follows: Create text within a content management system; Send that text to OpenCalais; Store the metadata it returns; Over time, use aggregated metadata, integrated with our existing ~80,000 subjects, to create a local taxonomy for faceted search and automatically-compiled topic pages, along with other interesting interfaces. Share as much of the taxonomy as possible as downloadable RDF; Make sure to provide links back to OpenCalais wherever possible, on their terms, as defined in their Terms of Service (TOS) document. That's probably not a big deal. I doubt anyone would even notice. But... is it at all possible, conceivable, even a tiny bit that at some point in the future Thomson Reuters could claim that we were misusing their data in step (4), above? From the TOS: If you syndicate, publish or otherwise transmit any content containing, enhanced by or derived from Calais-generated metadata you will use your best efforts to incorporate the correct Calais-provided Globally Unique Identifier [GUID] in that content. It seems straightforward, but that “best efforts....” The truth is, I don't really know exactly what they mean there. Also from the TOS: You will not use any metadata or GUIDs produced by Calais to create a metadata retrieval service similar to Calais. And could they claim that we were somehow creating a derivative work without permission and distributing it in step (5)? I would say, based on my far-from-authoritative reading of the TOS, and given the suit against George Mason University, there is now a precedent; that is, it is within the realm of possibility that if I passed thousands of web pages through OpenCalais and decided to adapt the resultant format for my own use in a way that Thomson Reuters disliked, I could get a fat letter from some lawyer someday demanding damages, accusing me of creating a derivative work based on their proprietary taxonomy, in violation of their terms. I'm not saying it's likely; I'm not saying I'm right; I'm not even saying that Thomson Reuters would be legally or ethically wrong to sue for damages. I would bet $10,000 right now against my fears coming to pass. But IANAL, which is exactly my problem here. And this is not a call to boycott anything, nor an attempt to get personalized service out of OpenCalais, where the developers are doing some very fine Semantic Web-bootstrapping work. I know Thomson Reuters could give a damn about me, and in that they are justified—I'm just another API key hash in their database, and even if I upgraded to their for-pay service I'd never represent more than a balance-sheet rounding error. My only purpose in writing today is to point out how a lawsuit can have unintended chilling effects, at least for me. We're in a remarkable downturn, and people are being told to “get real or go home.” One way corporations get “real” is to sue the living shit out of everything that blinks. It's probably a good time to review the terms of service for all of your critical software to make sure you're in compliance; I wonder if a lot of Web 2.0 mashup decentralized goodwill is going to go to good-faith heaven as companies under financial strain start to look closely at their patent portfolios and vendor agreements, and decide that printing out lawsuits is even cheaper than deploying to EC2. And now that the “Semantic Web,” or “Web 3.0,” or the “Linked Data Web,” or the “Web of Really, That's How to Query Over an rdf:Bag?” or whatever they're calling it, is viable enough that you can't shrug off legal worries—now that the Semantic Web is no longer just a research project, if someone owns the taxonomy you're using and changes it up on you, what rights do you have in the matter? Who owns the GUIDs? Your honor, I just wanted to build a hierarchy of topic pages. I never meant to hurt nobody. And so forth. To summarize: working in web publishing, I have a healthy fear of lawsuits bordering on the insanely paranoid; and I wish it were not so, but that is now part of the job, as the web of ideas has given way to the web of pricks; and finally, actions speak louder than Creative Commons-licensed terms of service. You can still get handed a subpoena while you're riding the Cluetrain. Now that I got the fear, do I want to go to the effort to (1) educate a few people in management, none of whom would have great interest in the subject except as a soporific, about the far-fetched risks of using externally-generated taxonomies to organize our content; and do I (2) want to spend a number of hours in the near future educating myself over the completely nebulous rights issues connected to taxonomies, linking, and file formats, thus taking even more time away from code and prose to give it to the law; and do I possibly even (3) want to allocate the budget to work with a lawyer on taxonomy-related issues? All the while knowing that I'm overreacting and that this is probably pointless? Not really. I'd rather let other people do that and read the judges' opinions. Let deeper pockets set the precedent; what I do want to do is to port the CMS to Django, an open-sourced CMS published by a foundation, get the search into Solr, also published by a foundation, and introduce hierarchy to the 80,000 subjects we already have indexed. I'm just going to put OpenCalais away for a while and start looking at DBpedia again, then see how that whole Zotero suit works out over the next few months or decades. In one way, this is all great because I love the Semantic Web to the point of stupidity—to the point of building a custom content management system entirely based on alpha-level technology using RDF for storage, creating a framework even slower than Rails. So I'm grateful to Zotero for taking the brunt of the lawsuit, because it gave me reason to take off my rose-tinted Linked Data goggles, and made me aware that all of my planned Semantic Web taxonomy-sharing fun could come crashing down if I don't carefully track the provenance of every one of my triples, erring always on the side of raving terror. Know what else is great? Now, finally, ten years on, I know that the Semantic Web is real and viable, because I'm afraid I'll get sued for using it. That's the true measure of a maturing technology—eat it, Gartner hype cycle. I believe, as in don't-get-him-started, that taxonomy-driven interactive editorial is essential to the future of the web, and thus to storytelling and narrative in general. Clearly a great deal of money is being spent by major companies in pursuit of the golden triple: It appears the AP is working on taxonomy tools, and Rupert Murdoch's Dow Jones has Synaptica and publishes a cute taxonomy cookbook. A number of other companies are out there, building massive thesauri and indexing tools, hacking parsers and coding semantic disambiguators like mad, banging their heads against pronouns. There will be many, many competitors seeking to add their own structure our increasingly Web-content-driven reality, and we will, if we use their services, find ourselves beholden to their methods of indexing, with all manner of legal compliance and copyright issues as of yet untested in courts. Creating good, broad, world-describing taxonomies is extraordinarily expensive, because reality is large, and these companies will need to strike a balance between sharing their work and protecting it, so I imagine this will be a subject I'll revisit, professionally, many times over the next few decades (barring complete societal breakdown, or a personal spiritual awakening that allows me to stop thinking about this sort of thing). Such questions could keep a librarian up at night, staring at the wall, petting his or her sleek gray cat Otlet and wondering what, for instance, a political campaign looks like when all of the news and columns are automatically classified before being published. Competition, he or she might conclude, must be encouraged between these platforms; there must be a free, and yet somehow regulated (perhaps by the W3C, or preferably by an organization with a more attractive website), market of taxonomies—you can't have people claiming to own concepts conjoined to unique identifiers, can you? Can you? You probably can? Oh. But there's likely no reason to worry; and I am just borrowing trouble; and maybe the Semantic Web won't matter that much after all. Even if taxonomies do become increasingly important in our web of linked data, thank God we live in a society with an enlightened understanding of intellectual property, and that we can trust the tiny handful of organizations that control the world's supply of news, as they become software providers as well as content providers, to do the right thing when it comes to serving the needs of a wider populace, in a culture that would rather foster dialogue, discussion, and mutually beneficial resolutions than use the ugly, blunt tool of potentially profitable lawsuits. I'm sure—really, I am—that mine is an overreaction. And onward, to progress.