Getting Started

This summer, I’m working at the PanLex project, which is a non-profit group under the Long Now Foundation. The goal of our organization is to preserve linguistic diversity and to increase linguistic knowledge, especially in diminishing and non-studied languages. While there are around 7,000 human languages, globalization has caused our world to focus on only the leading languages within industry and academics. This drives people to intensely focus on these top 10 languages, which they believe will open up a better future for them, and increasingly skews the ratios of how many people speak each language, leading to language extinction because there is not enough benefit to using their heritage language. In order to counteract this issue, PanLex is building a database of symmetrical dictionaries between languages. These parallel dictionaries serve to preserve languages that are dying or extinct so that reconstruction of the language could be possible, and to increase the information available so that translational programs and devices could allow conversation between people with different languages without having to prioritize one language or the other.


Because of our project director’s connection with the University of California, Berkeley, we are currently housed in the Berkeley language labs at Dwinelle Hall on campus. The interns here sit around a large table with televisions connected to them in order to facilitate discussion and collaboration with each other rather than the typical office cubical. Here, we hook up our computers to show our work and ideas on the televisions for troubleshooting periods and meetings; write code to extract and standardize linguistic data; and debate over classifications and properties.


The first week was filled with an overview of the different tracks that we could focus on during our time here. I chose to be a part of the assimilation team, which discovers, corrects, interprets, assembles, and standardizes lexical translations in attested sources. In our database, we have thousands of sources available to us that have first been vetted by our acquisition team. From there, we are allowed to choose any source to work on, which allows for the personal freedom to pursue languages that we are interested in. Currently I’m working on sources in Carib, which is a language spoken by the Kalina people of South America, more specifically a version spoken in Suriname; and Wemba Wemba, which is an language spoken by an indigenous group within the Victoria state of Australia. Carib is a threatened language, and Wemba Wemba is an extinct language, which means that it no longer has any L1 speakers, or native speaker of the language. Because these languages are dying or dead, it’s important that PanLex have a record of the language within its database for preservation before there is no data left.

Screen Shot 2016-07-07 at 12.35.51 AM

Throughout this summer, I hope to gain a greater skill in creating code that will be able to parse panlexical data in a way that standardizes information effectively; however, I think that the thing that I’m looking forward to the most is learning more about the languages that I work with as I research to better understand how to classify the words and morphological makeup.

Sooyoung Jeong ’18