Up: Box of Ideas for Future Language Toys | [Related] «^» «T» |
Thursday, December 7, 2000
Etymologizer
By Paul Ford
A mass-etymological breakdown tool that I'd create if I could, but I can't.
As a writer, what I would love most of all, more than freshly baked raisin bread prepared each morning and served to me on a bedside table with real butter from well-loved hill-dwelling tinkling-bell cows, is a mass etymological breakdown tool. With such a tool, you could feed it something you've written, a few pages worth, and it could tell you the origins of all your words - how many Anglo-Saxon-rooted words, how many Latinate terms, how much French and how much Spanish - until you had a clear sense of the historical patterns of speech you'd been picking up on your linguistic antennae. You'd learn about your personal set of language influences. Hemingway, for instance, eschewed Latinate words as much as possible; a huge proportion of the words in his works are raw hard Anglo-Saxon. It would be great, I think, to compare Hemingway's percentage of Anglo Saxon to Fitzgerald's.
Let's say you're working on an advertisement for a power drill. You feed it in to the Etymologizer, and find that you're using, say, 60% Latinate verbs and nouns. You'd know you had a problem, since the main power drill market isn't all that Latinate. That's a little forced, but if you were having a doctor give a speech in a screenplay, or a scholar from Samuel Johnson's era, and historical accuracy mattered, you might want to influence his speech with Latin, and the tool was connected to an etymologically informed semantic thesaurus, it could suggest other, more era-and-culture-appropriate terms, including euphemisms. Other users might be Civil War re-enactors and fantasy gamers of a historical bent who want to fit their language exactly into the bent of the game.
A smart, smart system could analyze all the works of Jane Austen, put them into a randomly accessible "lookup hash," and go through your text and find individual words and replace them with Austenite synonyms. Thus, the Austenizer. If you could write sentences in the rhythm of Austen, the system could suggest replacement words that Austen actually used, and you end up with truly Austenish prose. It would be fun to run Austen's Emma through the Eliotizer and see what the process of Middlemarchization does to Emma's romantic travails. It wouldn't work, exactly, but it would do something interesting, like annoy literary types.
I'm just playing (although I did spend four long days trying to build a prototype of such a tool a month ago, with no success), and there are many problems with the ideas above, but I keep wondering when we're going to begin using computers to truly process the amazing wealth of knowledge we have, to truly synthesize. When will we go native, and begin to use all this processing power? The problem is that data and information, like source code, is extremely proprietary and difficult to encode.
Now, the entire text of the Oxford Unabridged English Dictionary is encoded in SGML. This makes it, in essence, a giant database, ready for searching, manipulating, sorting, and otherwise fiddling with. But people - libraries, consumers, you, and until a few weeks ago, me (but not necessarily the OED people themselves, who are exceptionally forward-thinking) - still perceive it as a book, or a set of books, even in its electronic form. It's not.
It would be great to get my hands on that data; it would make the Etymologizer much simpler to create, or at least feasible. The problem is that the value of the OED database is too great to just allow people to fool with it as they willed, and licenses cost hundreds or thousands of dollars just to look at the thing, without any access to the "source code" of the dictionary. Plus, as far as I know, they won't sell "slices" of the dictionary; you can't buy the Etymology slice of all 9 trillion words. Some information-must-be-free types might argue that the OED, and other extremely valuable cultural documents, should go "open source" and be freely available, so that their usefulness can be magnified amongst the peoples of the English-speaking earth. But why? What have you and I done for the OED?
(I often wonder what could happen if the community around Open Source - think Slashdot - advocated decent health care for the poor instead of complaining about Microsoft. Pushed for campaign finance reform. Tried to make direct change in international foreign policy. Went after the FCC for its absolute sell-out to corporate interests. What a difference they could make.)
So let's write off using the OED as a pipe dream; the OED's sacred trove of SGML-encoded word-ideas is as far from my hands
as [insert metaphor about something far-away here]. The closest "open-source" equivalent of the OED is the
The DICT Development Group
As has been relentlessly beaten into our heads by role-playing-games Libertarian types, the advantage of Open-Source software, typified by Linux circa 1995, before the Linux penguin put on make-up and a cheap nylon dress and went down to the docks of Corporationville and whored itself, is that you can go in and muck with things; you don't have to conform to the standards of software, if you're willing to learn a huge mess of arcane nonsense. I can make my application windows look any way I want; they can look like cubes, or Fiat automobiles; I can have my computer greet me by saying "You fuckwit! Get to work!" and build custom applications to make things function in an interesting, engaging fashion according to my own principles and beliefs, as long as I can find a manual.
Most software is still proprietary, but a significant amount isn't, because of the open source movement - enough to put together a working system, enough to build Ftrain. However, nearly all current formalized data, or information, or knowledge, what have you, comes pre-packaged with a set interface, even on the Web, because that data is proprietary and the companies feel they have more to gain by setting up walls than sharing. Often, they're right. Thus, electronic dictionaries and encyclopedias, e-books, and so forth all have their own encoding and database formats and secret methods of access, and if you want to re-use the data, you have to pay a large licensing fee, if they'll let you use it at all. Usually, they will, if you don't compete with them directly; licensing is free money. However, for the freely-available-on-the-Web Etymologizer, that's impossible. There's no money to spend past, say, $100-$200 I could dig up in quarters under the bed. So I'm caught trying to massage an Etymological dictionary out of a poorly encoded source.
If I'm running into this bind, as an amateur programmer and half-assed Web writer, then others will too. The answer is to create new ways to share information, which has been the goal of the XML "community," but what they're doing is closed off to the commonfolk because it's confusing, and no one has come down off the mountaintop at the W3C to make it clear what they're actually up to over there, yet:
PEF: "I don't understand how all this XML/XHTML/XLink/XPointer/XPath/XSL/SVG/FO stuff is going to work together, what the goals are, where the vision is. I mean, it's all great, don't get me wrong. I use it to build Ftrain.com."
W3C: "Just look at the standard and all will be manifest."
PEF: "But it's 9000 pages, and is filled with Backus-Naur grammar statements. I'm a human, not a computer! What are you guys really trying to do? What vision are you trying to promote?"
W3C: "We're trying to build <bigbrightlights>The Semantic Web</bigbrightlights>"
PEF: "But what is it? Can Ftrain.com be part of <bigbrightlights>The Semantic Web</bigbrightlights>?"
W3C: "Whether you wish to or not, all must belong to <bigbrightlights>The Semantic Web</bigbrightlights>."
PEF: "You're transforming into a giant terrifying aluminum robot!"
W3C: "<loud>Must...<louder> have... <loudest>corporate...<loudest-yet> funding... </loudest-yet> </loudest> </louder> </loud>"
W3C: "Just look at the standard and all will be manifest."
PEF: "But it's 9000 pages, and is filled with Backus-Naur grammar statements. I'm a human, not a computer! What are you guys really trying to do? What vision are you trying to promote?"
W3C: "We're trying to build <bigbrightlights>The Semantic Web</bigbrightlights>"
PEF: "But what is it? Can Ftrain.com be part of <bigbrightlights>The Semantic Web</bigbrightlights>?"
W3C: "Whether you wish to or not, all must belong to <bigbrightlights>The Semantic Web</bigbrightlights>."
PEF: "You're transforming into a giant terrifying aluminum robot!"
W3C: "<loud>Must...<louder> have... <loudest>corporate...<loudest-yet> funding... </loudest-yet> </loudest> </louder> </loud>"
The W3C wants to connect all data through semantic pathways. It's not enough, they feel, to put a site up on the Web with proprietary content; you need to find ways to make that content into objects that can fit into other people's objects, and vice versa. Good luck to 'em. I think you'd need to change the culture, first; do Americans really want to share? Does anyone want to share with the Americans?
So, in any case, back to the Etymologizer: an etymological analyzer with a complete database of words and their word histories, like in the Oxford English Dictionary, connected to a large semantic WordNet that would pluck out synonyms by traversing a variety of different linguistic trees, could tell you if your speakers were anachronistic or not. It could catch the speaker in England describing trunks for boots and other tiny, narrow things that keep writers in horror before the screen.
Here's how it could work:
First, take text and break it into sentences, using familiar routines. The Perl computer language, for instance, makes this feasible.
Then, scan the sentences with a link grammar parser or similar technology. This identifies the parts of speech of the sentences -- nouns, verbs, adjectives, etc.
Now, use an etymological dictionary to look up each word and code it according to its origins. There is a large range of problems here that need to be solved.
First, as I've stated above, there's the lack of a manipulatable electronic dictionary. The best candidate is the Merriam Webster dictionary from 1911, from Project Gutenberg, which has been encoded into a somewhat clean HTML form by the GNU Dictionary project. The problem, here, is that only root words include etymology, so "anger" might include etymological records, while "angry" won't. The best approach I can come up with is the either use some sort of word stemming technology - I think there's a Perl module for this - and when that doesn't work, simply keep cutting the words back, if they don't have etymological information, until you can guess. So when I come across "relentless," which has no etymological information, I check to see if I have information for "relentles" (no), "relentle", (no), "relentl" (no), and "relent" (yes) and cross reference the definition for "relent" to the word "relentless." Okay, but I'm screwed for fury/furious; it'll think that the root of furious is the same as the root for fur.
Now, of course, the link grammar isn't perfect, but it does a fairly good job of guessing which words are which. Notice how, in the two sentences "Is the owl's cloaca properly lubricated?" and "I applied proper lubrication to the owl's cloaca," the link parser can tell the difference in part-of-speech between "lubricated" (verb) and "lubrication" (noun).
+------------------------------Xp-----------------------------+ | +----------------------Pv----------------------+ | | +----------SIs---------+ | | +---Qd--+ +-Ds-+--YS-+--D*u--+ | | | | | | | | | | LEFT-WALL is.v the owl.n 's.p cloaca[?].n [properly] lubricated.v ? +-------------------------------------Xp------------------------------------+ | +-------------MVp-------------+ | | +---------Os---------+ +----------Jp---------+ | +---Wd---+---Ss--+ +-----A----+ | +-Ds-+--YS-+--D*u--+ | | | | | | | | | | | | LEFT-WALL i[?].n applied.v proper.a lubrication.n to the owl.n 's.p cloaca[?].n .
Pretty neat, eh? Now, assuming (big assumption) that all works, we have to put it together. And here's the real problem - etymology, word history, is a range of values. Is a word with a Latin root but French inflection French or Latin? Words have traveled through millennia to get to us, through some awkward paths; at the root is a sort of Indo-European metalanguage from which all water flows. There's almost no way to look at a term and point out a definitive year and place for its usage.
My take, since we're analyzing texts, tracing back habits of the writer, would be a large corpus analysis for all the languages that influenced English. You find out when certain words were being used most, and create a kind of frequency table; show word chains and common roots for writer's prose. Then you'll know when words were most common. Then you can track the language use of your text back to certain eras. Run Shakespeare through and find out how much Latin and Greek really affected him (quite a bit, we already know). Run translations of Horace or Ovid or Homer through to find out how much the translators use native-language-inflected words. Hours of fun - beats the hell out of, say, cribbage.
My overall point, even though I didn't set out with this as my overall point, isn't that the world needs an Etymologizer, although it desperately does. It's that computers can provide tools to do things we never thought we needed, not just things that we did in the "real world," with other technologies. Too much computing is like Photoshop, based originally on the real-world process of photolab work and bound into that metaphor forever after, making an awful lot of magazine illustrations less interesting in the process, or MSWord, which confuses printing with writing. The metaphor of the desktop, the file folder, the garbage can, the relentless insistence that things be familiar and intuitive in ways that make the most immediate sense to bored administrative assistants, these all ultimately bound us to some of the blandest aspects of the "real world" - aspects of sorting, manipulating, processing, key-stroking drudgery - that computers are supposed to eliminate; such interfaces lack soul just as a filing cabinet lacks soul, and if there is a place where people are changing this - not just putting a new face on it, but looking to build a whole new set of abstract tools for dealing with knowledge and ideas - I wish I could find it, and go there, and sit and drink coffee.