Back to my reading from Computational Linguistics. This morning’s text is the introductory article to an issue on “Semantic Role Labeling,”
The sentence-level semantic analysis of text is concerned with the characterization of events, such as determining “who” did “what” to “whom,” “where,” “when,” and “how.” The predicate of a clause (typically a verb) establishes “what” took place, and other sentence constituents express the participants in the event (such as “who” and “where”), as well as further event properties (such as “when” and “how”). The primary task of semantic role labeling (SRL) is to indicate exactly what semantic relations hold among a predicate and its associated participants and properties, with these relations drawn from a pre-specified list of possible semantic roles for that predicate (or class of predicates). In order to accomplish this, the role-bearing constituents in a clause must be identified and their correct semantic role labels assigned, as in:
[The girl on the swing]Agent [whispered]Pred to [the boy beside her]Recipient
The item to really catch my interest was the FrameNet project:
In the FrameNet project (Fillmore, Ruppenhofer, and Baker 2004), lexicographers define a frame to capture some semantic situation (e.g., Arrest), identify lexical items as belonging to the frame (e.g., apprehend and bust), and devise appropriate roles for the frame (e.g., Suspect, Authorities, Offense). They then select and annotate example sentences from the British National Corpus and other sources to illustrate the range of possible assignments of roles to sentence constituents for each lexical item (at present, over 141,000 sentences have been annotated).
The Corpus, according to the website, is “a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.” It allows you to type in a single word or a more complex query, and get 50 sample sentences utilizing the word in various contexts. The sentences are drawn 90% from newspapers, magazines, journals and other print media, and 10% from “conversational” context, through transcriptions of meetings, radio talk shows, and conversations held by volunteers.
The Corpus, in other words, would be data mining heaven for an AI searching for context in a conversation. I typed the word “assembly” into the search, being a word that has numerous context-specific languages (a gathering, a manufacturing process, a computer programming term, etc.). Heavy weight in the 50 returned samples (out of 5,348 total results) was given to “assembly” in the context of a governmental organization, i.e., lots of refererences to various “National Assemblies,” UN General Assembly, etc. It returned a few other contexts, like “the tension assembly” and “some assembly work,” but clearly the sources used are more interested in politics.
All the same, a “smart” AI could use the Corpus – take the user’s keyboard input, such as “It’s a bear to debug, it’s written in Assembly,” and search for “bear*debug / debug*assembly / bear*assembly” (regrettably, the authors have chosen to call the query language “Corpus Query Language, pronounced ‘sequel’” – guess they didn’t check the Corpus to see if that usage was taken). Debug*assembly comes up with eight returns, and assembly*debug returns six more. Of course, then the AI has to go to the next level of programming to weight the rest of the words in the results to select which one provides its best “return statement,” but once it adds up the number of instances where the word “compiler” is used in the results, it can then proceed to use “compiler” in its response, making a more reasonable and interesting statement than it would otherwise, by injecting a new, relevant word into the conversation instead of just “reflecting.”
The AI could “learn” to exclude certain sources – i.e., many of the 50 returned results for “assembly” came from “Keesing’s World News Archive,” so the machine could learn when to include or exclude Keesing’s results when the overall frame of the conversation had been tagged as “computer science” or “industry” rather than “politics.”
The BNC isn’t free, unfortunately – 75 pounds ($105 today) for a single-user copy which would provide the full returned results (i.e. 5000+ for “assembly” vs. the 50 samples in the online version). It’s from Oxford University, which is a private institution, so I don’t know how much tax money went into the project. Still, it’s another resource for Christopher in the creation of Alex, and as I go, I’m seeing just how much of a thief (“pirate” being the polite term these days) Christopher could be forced into being – so many resources out there, so much information to adapt and combine, and if a substantial amount of it isn’t free, that could definitely cramp the style of your adventurous programmer unaffiliated with a large, deep-pocketed corporation or university.