Unnatural Selection

January 28, 2009

Alas, the SS Vaio may have been lost with all hands aboard.  Why do these things always happen when you’re traveling?  Of course, I got it at half price from eBay (“refurbished”), and figured out pretty quickly that there was a big section of the hard drive that sounded like someone chewing granola if you put too large a file on the drive.  So I got an external hard drive (also refurbished, hmm, I should invest in a new one…) and all is well as far as data goes, at least.  And with gmail and WordPress, my work is safe – hooray for the cloud.  Thankfully I still have a job and therefore a work laptop I can use for the basics.

Before the crash, I did manage to start on my reading from Computational Linguistics.  It’s times like this that you think to yourself, why couldn’t I have picked something easier to write about?  I’m reading Geraldine Brooks’ People of the Book, a novel about the history of the Sarajevo Haggadah – all she had to do was learn the minutiae of book restoration and the comprehensive history of Judaism in Europle, the lucky girl.  I’d picked some “easy” pieces to read first – lifetime achievement awards and obituaries, the forms that inherently push their authors into “overview” mode.  Digging into the award acceptance by Yorick Wilks, I found myself over my head pretty quickly, and not in the form I thought I would – it wasn’t the computer aspect of the work that threw me; I’m a computer power user, I’ve dabbled in enough SQL and Basic programming to get the concepts about how programs are organized – it was the linguistics part that made me realize, holy crap, if I am serious about this I have sooo much to learn… and I’m doubtful about my chances of mastering an advanced academic discipline in my free time.  Take for instance this paragraph about an experimental computer analysis of some text:

It consisted of an analysis of five metaphysical texts (by Wittgenstein, Spinoza, Descartes, Kant, and Leibniz) along with five randomly chosen passages from editorials in the London Times, as some sort of control texts.  The vocabulary was only about 500 words, but this was many years before Boguraev declared the average size of vocabularies in working NLP systems to be 36 words. The semantic structures derived—via what we would now call chunk parsing—consisted of tree structures of primitives (from a set of about 80), one tree for each participating word sense in the text chunk, that fitted into preformed triples called templates. These templates were subject–predicate–object triples that defined well-formed sequences of the triples of trees (i.e., the first tree for the sense of the subject, the second for the action and so on), whose tree-heads had to fit those of the template’s three primitive items in order. The overall system selected the word senses that fitted into these structures by means of a notion of “semantic preference” (see subsequent discussion), and then declared those to be the appropriate senses for the words, thus doing a primitive kind of WSD.

So now I need to find out what chunk parsing, tree structures, preformed triples, tree-heads, and semantic preference are…and that’s just from one paragraph.  Why couldn’t I be writing a novel about an idealistic, square-jawed young man in some kind of suit-wearing job who finds himself caught in a web of intrigue?

I’ve had two big obstacles to success in my life – laziness and depression.  The laziness has stemmed from the fact that I was always naturally “good enough” at enough things to get by, to be applauded at work and to get novels written that were suitable for publication – but I’d never work hard enough to advance, to write something that would receive critical acclaim, for instance.  The depression bit is real, people who don’t believe in it and talk about bootstraps all the time can’t be convinced otherwise and I won’t bother.  But it’s that “black dog,” as Churchill called it, who says, forget it, you’re not good enough anyway, you’ll do all that work and it’ll all end in tears.  My midlife crisis consisted of realizing that writing the same f’in gay novel over and over again for the rest of my life was not going to satisfy me any more, so here I am, facing the monolith.  I have to keep reminding myself to break things down into doable parts – yeah, so it’ll take years to write the book, probably.  Well, the idea has been sitting around waiting for me to get to work on it for years, so at least it’s moving forward now.  And I’m in the fictional version of the position Darwin was in when he got the ms. from Wallace – holy crap, I’ve got to move on this before someone else does, or, in my case, before reality overtakes fiction.

I did learn something immediately from the above, that (at least at one time) the “average size of vocabularies in working NLP systems [is] 36 words.”  I’m not sure what it means in context, but it sounds to me like it means a “working” system has either 36 keywords it can use to access its phrase bank of replies, or that it has 36 non-ambiguously defined words it can work with without confusion, or it has rule sets that it knows how to apply to 36 words to make sentences, or…hey, I’m guessing, but to me, that’s when learning is fun, when you know something, and you guess what comes after, and you look it up, and you’re right.  (The article goes on to say – I think – that the program analyzed the texts well enough to prove that “God” as inserted into a sentence by Spinoza could be replaced with the word “Nature” and retain the same meaning – i’m presuming that Wittgenstein, Descartes, Kant, and Leibniz had lots to say on that subject to provide the context, since I can’t imagine many definitions of nature came from Times of London editorials.) 

Time for work – I’ll finish writing about this article in the next post.

