Wrapping Up the Summer

In these last couple of weeks, I made so many new friends and really got to explore the character of San Francisco. Now that my internship, along with the summer, has come to an end, I’m so grateful for the time that I got to spend there. At times it was hard and tedious scripting inside when I knew that the weather outside was so nice, but the sense of accomplishment when you finished a project was more than enough to fuel my progress.


I would say that I’ve met my learning goals because I have learned so much in terms of information extraction from working with sources with all sorts of formats and different languages; and source analysis, especially since the projects that I was working on were a part of a much large collective project to collect and document linguistic information. Unfortunately, I didn’t get to learn as much about the translational algorithms that we use as I would have liked because of the time constraint, but it was still interesting to argue and research about semantic ambiguity and sense disambiguation in order to provide the best translating through our database. But, I think that I learned the most by absorbing information from the collective experiences of the wonderful staff that I worked with.

I think that this summer has made it clear that I am capable of data extraction work, but I also learned that if the sources are too similar to each other, the work eventually became tedious to do because at that point, you aren’t writing code but rather changing variables and conditional statements. I tried to combat that by switching which types of sources that I was working on as well as the language that I was processing through so that the challenges that I would face would be different. This internship has shown me that I am still very interested in how a computer understands languages, but I would rather process information that is not as regular as the dictionaries, webinaries, and sources that I have been working on over the summer. I’ve learned that I’m also very much into researching different ways to tackle a problem and debating with someone the pros and cons of implementing within a system.

2016-09-05 18.17.08My advice to anyone who would be interested in working at PanLex is to be really interested in the work that they are doing, and to take initiative to research and bring up projects that you would like to do with the staff. The staff is very open to different views and ideas as long as you can support why this would be more beneficial than the current way. Furthermore, take advantage of all the resources and opportunities that come with working for a branch of a larger parent organization, and the fact that you are in San Francisco. I went to talks that were held by the Long Now Foundation, including one on Quantum Computing and the Rosetta Project, and have gone to different conferences, such as IMUG, with PanLex. As for the field, at some times, the work will be tedious, and others you will be trying to debug a problem for hours without making progress. Take it one step at a time, and try to set mini goals for yourself. Don’t be afraid to ask questions or ask someone to look over your code, and most of all, don’t be afraid to take breaks. Sometimes, it’s a matter of being in a different mindset, and looking at the problem with fresh eyes.

I think that the projects that I’m most proud of are the ones that focused on lesser-known, endangered, or extinct languages because I feel that by adding them to our database, we are doing our part in trying to fight against language death and proving a resource for languages that usually don’t get funding for translational programs such as Google translate. My favorite moments included when our database could translate something that Google translated as question marks, and I added linguistic data of a language into our database that was not supported by Google.

Sooyoung ’18

Getting Started

This summer, I’m working at the PanLex project, which is a non-profit group under the Long Now Foundation. The goal of our organization is to preserve linguistic diversity and to increase linguistic knowledge, especially in diminishing and non-studied languages. While there are around 7,000 human languages, globalization has caused our world to focus on only the leading languages within industry and academics. This drives people to intensely focus on these top 10 languages, which they believe will open up a better future for them, and increasingly skews the ratios of how many people speak each language, leading to language extinction because there is not enough benefit to using their heritage language. In order to counteract this issue, PanLex is building a database of symmetrical dictionaries between languages. These parallel dictionaries serve to preserve languages that are dying or extinct so that reconstruction of the language could be possible, and to increase the information available so that translational programs and devices could allow conversation between people with different languages without having to prioritize one language or the other.


Because of our project director’s connection with the University of California, Berkeley, we are currently housed in the Berkeley language labs at Dwinelle Hall on campus. The interns here sit around a large table with televisions connected to them in order to facilitate discussion and collaboration with each other rather than the typical office cubical. Here, we hook up our computers to show our work and ideas on the televisions for troubleshooting periods and meetings; write code to extract and standardize linguistic data; and debate over classifications and properties.


The first week was filled with an overview of the different tracks that we could focus on during our time here. I chose to be a part of the assimilation team, which discovers, corrects, interprets, assembles, and standardizes lexical translations in attested sources. In our database, we have thousands of sources available to us that have first been vetted by our acquisition team. From there, we are allowed to choose any source to work on, which allows for the personal freedom to pursue languages that we are interested in. Currently I’m working on sources in Carib, which is a language spoken by the Kalina people of South America, more specifically a version spoken in Suriname; and Wemba Wemba, which is an language spoken by an indigenous group within the Victoria state of Australia. Carib is a threatened language, and Wemba Wemba is an extinct language, which means that it no longer has any L1 speakers, or native speaker of the language. Because these languages are dying or dead, it’s important that PanLex have a record of the language within its database for preservation before there is no data left.

Throughout this summer, I hope to gain a greater skill in creating code that will be able to parse panlexical data in a way that standardizes information effectively; however, I think that the thing that I’m looking forward to the most is learning more about the languages that I work with as I research to better understand how to classify the words and morphological makeup.

Sooyoung Jeong ’18