Brandeis University has received a major grant to expand the LAPPS Grid Project that seamlessly connects open-source computer programs to quickly analyze huge amounts of language from diverse sources and genres.
Brandeis University has been awarded a two-year, $390,000 grant from the Andrew W. Mellon Foundation to lead an international collaboration to link the two major American and European infrastructures for the computational analysis of natural language. The resulting meta-framework has the potential to transform scholarship and development across multiple disciplines in the sciences, language and social sciences, and digital humanities by enabling scholars in Europe, the US, and Asia to work seamlessly across a massive range of software tools and data resources, developed separately by the American and European efforts. Led by James Pustejovsky, the TJX/ Feldberg Professor of Computer Science at Brandeis, the project team includes Nancy Ide (Vassar College), Erhard Hinrichs (University of Tübingen), and Jan Hajic (Charles University Prague).
The Language Applications (LAPPS) Grid Project—a collaborative, NSF-funded effort among Vassar, Brandeis, Carnegie Mellon University, and the Linguistic Data Consortium at the University of Pennsylvania—and the European Common Language Resources and Technology Infrastructure (CLARIN) are both frameworks (“grids”) that create and provide access to a broad range of computational resources for analyzing vast bodies of natural language data: digital language data collections, digital tools to work with them, and expertise for researchers to use them. Within each framework, members adhere to common standards and protocols, so that tools and data from different projects are “interoperable”: users can access, combine, and chain data from different repositories and tools from different sources to perform complex operations on a single platform with a single sign-on.
But the LAPPS Grid and CLARIN are not themselves interoperable. Researchers using data and tools in one framework cannot easily access or add data and tools from the other. LAPPS Grid users cannot access CLARIN’s multi-lingual services for digital humanities, social sciences, and language technology research and development, like Prague’s tools for search of oral history archives (developed to support their hosting the USC Shoah Archive), or Tübingen’s WebLicht services for data mining political and social science documents. CLARIN users don’t have access to the LAPPS Grid’s state-of-the-art tools for English and, through the LAPPS Grid’s federation with five Asian grids, to services providing a broad spectrum of capabilities for work in Asian languages. Scholars manually annotating a text corpus with CLARIN’s WebAnno (developed at TU-Darmstadt) would love to feed their work through iterative machine learning and evaluation facilities in the LAPPS Grid—but can’t.
The new Mellon Foundation funding will enable the project team to make the two grids interoperable on three levels:
- Infrastructural: While the LAPPS Grid and CLARIN are both committed to open data and software, they do provide secure access to licensed resources, including the vast majority of the language data available over the web. The team will create a “trust network” between the two services, enabling single-authentication sign-on;
- Technical: The LAPPS Grid and CLARIN have different underlying architectures and data exchange formats. The team will map these architectures and formats onto one another, enabling communication between the two frameworks over the web;
- Semantic: To combine differently curated datasets, the data needs not only to share or be converted into a common format, but must also share a vocabulary for describing basic linguistic structures (a common language ontology) that tells computers how to combine the data into meaningful statements. The project team will extend the common exchange vocabulary developed by the LAPPS Grid to the web services of both frameworks and implement a set of conversion services.
The project will dramatically extend the power and reach of both the European and American frameworks and put their combined resources at the direct disposal of scholars from a broad range of fields in the humanities and social sciences, without requiring them to be computer programmers. “It will effectively create an ‘internet of language applications’ for the everyday computer user,” explained Dr. Pustejovsky. “We’re going to give every scholar access to a toolkit that’s now only available to the largest corporations.”