Lexicography from Scratch: Quantifying meaning descriptions with feature engineering
Orion Montoya
Friday, October 20 at 3pm
Volen 101
When computational linguistics wishes to engage with the meaning of words, it asks the experts: lexicographers, who analyze evidence of usage and then record judgments in dictionaries, in the form of definitions. A definition is a finely-wrought piece of natural language, whose nuances are as elusive to computational processes as any other unstructured data. Computational linguists nevertheless squeeze as much utility as they can out of dictionaries of every stripe, from Webster’s 1913 to Wordnet. None of these resources had computational analysis of lexical meaning in mind when they were conceived or created. Despite the immense human cognitive effort that went into making them, most lexical resources constrain their computational users to a few simplistic lookup tasks.
If a lexical resource is designed, from its origins, to serve all the diverse human and computational applications for which dictionaries have been repurposed in the digital era, it might yield significant improvements both theoretically and practically. But who wants to make a dictionary from scratch? The theme of the 2017 Electronic Lexicography conference (Leiden, September 19-21: http://elex.link/elex2017/) was “Lexicography From Scratch”. This talk assembles a number of isolated recent innovations in lexicographical practice — often corpus-driven retrofits on to existing dictionary data — and attempts to map out a lexicographical process that would connect them all.
Such a process would yield meaning descriptions that are quantified, linked to corpus data, decomposable into individual semantic factors, and conducive to insightful comparison of lexicalized concepts in pairs and in groups. We describe a cluster-analysis framework that shows promise for automating the fussier parts of this by reducing cognitive loads on the lexical analyst. If aspects of lexical analysis can be automated through feature engineering, we may produce computational models of lexical meaning that are more useful for NLP tasks and more maintanable by lexicographers.
Bio: Orion Montoya graduated from the Brandeis CL MA program in 2017, with the thesis Lexicography as feature engineering: automatic discovery of distinguishing semantic factors for synonyms. Before coming to Brandeis, he spent fifteen years in and around the lexicography industry, computing with lexical data in all of its manifestations: digitizing old print dictionaries, managing lexicographical corpora, linking old lexical data to new corpus data. He also has a BA in Classics from the University of Chicago.