Don’t all AI methods need a bunch of data to work? How could AI help document and revitalize endangered languages with “human-in-the-loop” or “active learning” methods? Sarah Moeller from the University of Florida joins us to discuss those and other related questions. She also shares many of her personal experiences working with languages in low resource settings.
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
Doing very well, Daniel. It’s a beautiful spring day here in the Atlanta area. We’re going to get to talk about some pretty cool stuff today, aren’t we?
Beautiful as in like crushing heat, or…? That’s what I think of when I think of Atlanta.
Let me put it this way - it will get much worse. It’s not so bad right now. So give it July, August, early September, and it’s a lot hotter at that point.
So right now we’re very optimistic, okay?
Okay. Good. Good. We have a close neighbor to you joining today for a really interesting discussion. Today we have with us Sarah Moeller, who’s an assistant professor in the department of linguistics at the University of Florida. Welcome, Sarah.
Thank you. It’s great to be here.
We’ve known each other, not that well, but got to know each other a little bit better recently in a couple of meetings. I was really fascinated, I think, by partly your story and how you got into both linguistics and fieldwork through teaching English as a foreign language, but now you’re kind of dabbling in the machine learning - and more than dabbling, doing a lot of amazing work there. So maybe you could just start out by telling us a little bit about your background and how you got into teaching English as a foreign language, and how that led, eventually, to where you’re at now.
Okay. Well, to do that, I have to go back even a little bit further than my job teaching English, which I started doing - good grief… As a teenager, actually. And I started doing that because my family– I’d grown up in the American Midwest, so Iowa, and then mostly in St. Louis, Missouri, which is where most of my family is from, Missouri, Arkansas area. And then when I was 15, my parents moved us to Russia; so we moved to Moscow, Russia and had that whole very interesting experience of growing up overseas and learning a new language. And that’s where I started teaching English, and then just kind of falling in love with the country and the culture and all my friends were there, and so went back and taught English.
And I kind of had this idea that I would teach English, and maybe, eventually, someday I’d come back to America and teach Russian. So I decided to go get some education to teach languages, and I went to Moscow State University and started a master’s degree, which I wasn’t able to finish, but that was where I got introduced to linguistics, was at Moscow State University. And a few years later after teaching English I realized I don’t really like teaching languages. I was always bringing in this linguistic, which is like the theoretical, or the structure about the language and probably boring my students to death, because I was like, “Isn’t this just fascinating how this all connects, and all the different parts?” And so I realized, “I should probably go into linguistics. I like languages. Maybe I should check that out.”
[04:17] So I thought about that for a little while, and then as I was kind of making that decision, I also ended up working two summers in Siberia as a translator, in some English camps… Well, some camps for underprivileged children that would have Americans come over. So English, which is a good job skill for people pretty much all over the world… And that was my first time that I encountered speakers of minority languages out there in Siberia, and some of the issues that they deal with socially because their language is disappearing. And when your language disappears, often not by your choice, so it’s not the same as immigrating to a country and deciding to adapt to the country; you live in there, and Russia is really a big empire, it was conquered, and there’s a lot of big history there, and they’ve maintained some of it, but there’s this–
So when a loss of language happens fairly quickly, when it’s not your choice, these things get highly correlated with high rates of suicide, depression, substance abuse. And that’s what I discovered that summer, as well as just meeting the people. And so I found out there was a thing called language documentation, language description, which is going and documenting and analyzing languages that have never really been studied scientifically, maybe barely anyone maybe even speaks it outside of the native speakers, and describing it and getting enough data documented so that if it does disappear - so this happened for a lot of native American languages in the US - there’s enough description, enough data, enough recordings that have been analyzed well enough so that the community can revitalize language if they want. Yeah.
Can I ask you a quick question there?
You started as you were talking about that experience in Siberia, and then as you moved into the native American thing - I’m wondering, as that process of a language disappearing is happening, like it’s in process, what does that do for the folks who are living that, who maybe started in that language and their language is disappearing? I would guess it dramatically impacts identity. And I’m just someone who is not familiar with that. I’m curious what that is like for a real person experiencing that in real-time.
Yeah, it can be very traumatic. There’s so much variation of it. It can happen gradually, it can happen by choice… Usually, when we talk about endangered languages, kind of the prototypical situation is where, like I said, the language is disappearing not because the people have chosen, or they’ve chosen to give it up, but it often what happens is, “Well, we’re not going to pass it on to our children.” So the official definition of an endangered language is it’s not being passed to the next generation, it’s not being taught to the next generation of children. And they might say, “Well, because they can’t get a job in this country with that. They need to learn the major language or the national/international language.”
Is it always either/or though? I mean, it’s not like “Learn this because it’s what our family is, but learn this because that’s what the job–” or is that just too hard to do?
Some communities are really good at doing that, actually. This is something that’s interesting, because often people say, “Well, there’s not very many speakers, so it’s endangered”, and then scientists and linguists will come and discover that actually that community has maintained a very strong sense of identity, and their heritage is very strongly connected to their language. So there are communities that do a really good job at preserving the language, and there’s others that don’t have a culture that ties their culture identity strongly to the language; then it’s like – as linguists who might come in and want to say, “Well, we’ve seen this happen through the history with other communities that were like “Whatever”, but then their grandchildren were upset. They’re like, ‘Wait a second. We want to connect with our history.’” That’s less traumatic.
The thing that really gets you is the situations where they might have wanted to preserve the language, like native American communities or first nations in Canada. At a point of history when it was critical, it was decided by those in power, those who had the say about the education, that it was – basically, bilingualism wasn’t a good thing, and everybody needed to learn English. And the best way they could succeed in life and just all the other factors is to take them away from their families, put them in a boarding school, and then every time they speak the language that they know from their parents and all that emotional connection they might have with their family and punish them, so that then they have a very negative–
It is. Those stories just tear you apart. And those are the stories that are very common in North America, I think also in Australia… So those are really, really hard. And some have preserved a little bit, some, all they have is some scientists at some point recorded a dictionary, and now they’re trying to revitalize it. And it’s interesting - with revitalization… So the other part of that experience is– so there also are these same issues in native communities in North America with substance abuse and health problems. So one exciting thing is that someone actually went and measured and found that at least in one community - I think it was in Canada - a community that had a language learning or language revitalization program, actually had correlating increase in physical health in the community; like lower rates of diabetes, which is a huge problem. So there’s an impact in documenting languages and studying them and preserving them that goes beyond just scientific interest.
Just speculating, I’m just curious - like, the diabetes thing… Because that’s not obviously connected, I think, to most people. At least it would not– prior to this conversation, it would not have been obviously connected to me. Is that, once again, tied possibly to a sense of identity, and if you don’t have one or if yours is diminishing, then you take less good care of yourself because you don’t have a sense of identity?
I think so.
I’m speculating here, but am I on the right track?
I really can only speculate too, because I haven’t studied that, and there’s only, I think, that one study that I’m familiar with. But it makes sense just with so many things, that sense of identity. And you think about it, if it happens quickly, and it’s not just, “Well, don’t speak this language”, but also everybody around you is saying, as will happen in other parts of the world, probably more common than maybe the boarding school situation - what will happen is you don’t even speak a language. Like, I was doing fieldwork and we were working with a couple communities, preparing some literacy material, which by constitution in their country, they should be able to learn two hours a week in their school. So this is a language that’s been studied, it exists. It’s known to exist. It’s unique. It’s been around for thousands of years; it even has some written history, a little bit, from very, very long ago. And prepared this primer material, a little reader and some writing exercises to the local ministry of education, and they said, “This isn’t even a language. It doesn’t exist. Why are you even bothering us with this?” So that sense of, “You don’t matter. Who you are doesn’t matter.” Yeah, it has to connect with that, like why should I take care of myself? Why am I important?
Another thing I connect with too is a study that was done among children from different socioeconomic backgrounds who’d gone through trauma, and it was a study of what helps them come out and be healthy emotionally and socially. And they’ve found that the one factor that was correlated with being healthy and handling the trauma well later on in life was not whether they were rich or poor, or highly educated or poorly educated. The one factor that was common for all of those ones that were successful, if you will, was knowing the stories of their family; it was connecting with their history of their family and hearing the stories. And so this idea that, “I belong to a bigger story. The people, my grandparents and my parents have had stories, and I’m a part of that, and they’ve gone through difficult things. And so I’m sitting and I’m listening to that, and that’s going to help me survive in life.” But if you can’t even speak the same language that your grandparents speak, that whole connection is lost, which is very important to our– yeah.
It makes me realize how incredibly lucky that I am, as I’m listening to this.
Yeah, indeed. So that’s what I got into. I got into the whole documenting languages, and just - yeah, being able to study something that maybe nobody else has ever studied is pretty exciting.
[12:04] Yeah. And maybe you could describe a little bit, too… People are probably not familiar with what are the - sort of like leaving machine learning and AI aside for a second… What’s in the toolkit for people who are doing language documentation? What are some of the practices that people do? What’s been shown to work and maybe not work over time? What are the task-related things that go on?
Okay. So if you start with the kind of the prototypical situation, which would be a language that is spoken by a few people, it has never been studied or written down before, then what you would do after you get a grant and some background knowledge of at least related languages, is you buy yourself some nice recording equipment and you go to the village or wherever they are, probably a remote place, and you basically stick a microphone in front of someone and say, “Tell me a story.” [Laughs] I mean, the whole technique of how to get people to speak the language, to be comfortable, to make it natural - that’s part of what you have to learn. But basically, you’re just getting people to speak the language, and you record it. and so stories, speeches, it might be ceremonial things… You might just get a list of– there’s a list of words that linguists use, that are considered, “These concepts probably exist in all cultures”, like cans, and feet, and mother, and father”, and translate them. You might say, “Okay, I’m going to give you a word. We could try it… So if I give you a word, “jump”, what are all the possible forms of the word that you could think of? Does this word change at all? And if so, what are those forms?”
So as someone is coming into a community and also building relationships with community members, what is that experience like? Because you don’t know the language, right? You can’t just buy the book, and yet you’re trying to connect with people on this very deep level that means a lot to them. So what is that experience like?
It varies so much, but building relationships is a really huge thing. I mean, that’s something that people who teach, people how to do this try to emphasize a lot, because it’s something we’ve heard also from a lot of the community members who have had linguists come and study. They don’t want to feel like they’re being used. They actually feel like they’re building a personal relationship, and that’s important to them. That’s what they want to keep out of this, sometimes even more than the language. So when I did fieldwork, there was people already there that I knew, that I connected with, that helped me and introduced me.
And I feel like, for me, it’s a hard thing. I tend to be, “Let’s just sit and read and let’s do technical things and let’s nerd out.” So one of the best things for me was working with a team member. So we came up with some survey questions we wanted to ask these community members. Basically, what part of the language was disappearing, but also what part of their cultures did they feel like were disappearing, so it could kind of help us focus on what work we needed to do. And so we worked together to create it. We called people up, we said, “Can we come by and visit?” We bought the tickets on the bus and planned out our trip, and we’d go into the people– kind of let them know… We’d go, they’d feed us tea, we’d sit and talk… And then usually, the oldest male in the house would say, “Okay, we hear you have some questions. So okay, these people are here and I’ve invited this school teacher. So now we’ll sit and talk and we’ll do business.”
So my friend would ask the questions and he would just have a conversation, just kind of what we’re doing right now, and I would note down on what they were saying. And then we’d finish up, and that usually would lead to all sorts of conversations. So we’d just have a conversation in that case, make it really natural, and then wrap up and say our goodbyes and head on the next village.
Okay. So I find all of this really fascinating. And I’m wondering, at what point in this process – like, you’re involved with language documentation with the people on the ground, you’re having these experiences, and thinking about the process… At what point do computer-assisted methods occur to you? And maybe even before that, how is computing – like, maybe even outside of the machine learning stuff, how does that impact the language documentation process?
Well, if you think about the actual documentation, it’s getting that microphone in front of someone and letting them speak the language. But to make it useful data for linguistic analysis and for maybe their descendants someday who want to learn that language, it has to get out of the audio. So getting it from the audio to a transcription of some sort - so it’s written down; that’s the first step. And then after that, then you start breaking it down. You break down the words into their smallest, meaningful parts. So jumping has jump, which means to hop, and it has -ing, which tells you kind of the aspect of movement that – how you’re doing the jumping, or over what course of time you’re doing it. And so that’s called morphology. Morphology is how words are built. So you break down the morph themes, which are the parts of words. You do morphological analysis, you figure out what those parts are, you figure out what they mean… You usually do like a rough translation of each sentence… And that’s the minimum that you’re going to do. If you can do that minimum, then someone else can come along, even 100 years from now, and do something useful, creative language learning material. So that’s the minimum that you really need to do. Because if it’s just someone speaking and the language disappears and no one speaks it, it’s like we have it, but it’s like, Egyptian hieroglyphics before they found the Rosetta Stone. We can’t make very much sense out of it.
So it’s those steps right there - transcribing, doing basic word analysis, morphological analysis, and a rough translation of each sentence are very important steps, and you can do all of that by hand. There are software that have been used to help people, at least keep it all in one database. There’s some software… There’s the one that SIL has built. There’s another one called ELAN, where it’s built to help you do this morphological analysis. And the nice thing is if you do analyze a word, the computer will see– if it sees a word that’s exactly the same, then it will present that analysis that you had before, and you can check and, “Oh, yes, that’s correct.” It may change. Sometimes it looks the same, but it’s a different meaning.
So that’s the tools that have existed for linguists, I would think, since the ’80s or early ‘90s, but you’re still pretty much doing it all by hand. There’s some help there. You can set up some rules once you figure out the rules of the language and get a little bit more automated things, but you have to do it – you have to put all the words, all the parts in there, you have to enter it by hand. There’s no machine learning in those tools that I’ve seen so far.
Yeah. And what sort of like– just to give a sense… Day one, you’re coming into the community, you’re putting up your microphones, and how long of a process might this be in terms of like from that point of starting to gather recordings and building relationships in the community and building a community, to having something, like you were saying, that for future generations is useful in terms of the documentation of the language? What are we talking about here in terms of timeframe and the labor that’s put into this?
Well, let me put it this way… I have a linguist friend who just finished doing all the analysis and translations of her data from a language in the Solomon Islands, which she had approximately, I want to say, 30,000 words in that Corpus. And so she just finished analyzing and translating all of it, so that she could retire, because she’s been working on that for 20 to 30 years now. Yeah.
[20:04] Wow. I’ve got to ask, if you’re thinking about that level of investment, how many target languages that are in that state might need that level of investment to get to that point? What are we talking about? What’s the multiple look like?
Think about there’s approximately 7,000 languages in the world. There’s probably more, because we still don’t know how many sign languages are in the world; but roughly 7,000. And it’s estimated that of those 7,000, anywhere from 30% to 90% of them are in danger of disappearing by the end of the century. I think probably the best estimate is around 40%.
So we need more linguists, it sounds like.
We need a lot more linguists, yeah.
We need them really fast. The point is, okay, it took decades to get all of that data processed. However, the other side of that is to get enough data where you can understand the language and write a big thick grammar that describes how it works and pretty much figure out 90% of it - that might only take you a couple years. And you could use like 10 texts, maybe 3,000 words to be able to get to that point, plus some extra stuff; you might need to make an extra check back after a couple of years to get the final little pieces together. But that means for the communities and for future research further deeper in the language, you may have collected in a three-month trip or three-week trip even, depending on how fast and how fast people speak, 30,000 words. Getting 2,00-3,000 might take you two years, but getting that last bit is a lot more, and that’s a huge bottleneck.
Let me ask you another question, before we’re really getting into the machine learning stuff and all… If you’re looking at different languages and you have linguists that are going through this process, and they’re building that body of understanding around each language, if they go – is there a point where even if you were to go to the next language, and let’s say that that language was not terribly similar to some of the ones you’ve done before, is there anything that you can carry through to make that more efficient in terms of that timeline? Assuming that certain sounds are not the same and things like that, so it kind of has its own characteristics. Is there enough alike in the process of doing that, where you can look at a dissimilar language and kind of accelerate the understanding of that? Is that possible, or is that not?
Oh, yeah. If you’ve studied a similar language and if it’s quite similar - in some languages, that line between dialect and language is quite fuzzy, so it could be very similar - you’re going to be a lot faster. You probably know, instead of maybe having to take a three-week and realize, “Wow, I only got 50% of what I needed because I didn’t know what I was looking for. I’m going to have to plan another trip”, you could probably get absolutely everything you need, say, maybe in a three-week trip, because you would know, “Okay, it’s similar enough. I know what the difficulty is, I’m going to know what’s complicated about this language.” Yeah. You can get to figure out how the language works and write a grammar. I think that definitely would speed you up, but it still doesn’t help you do the actual annotation of the data of people talking, which might–
So there’s this thing where you can create a linguistic description, which traditionally is what linguists did, and then they realized that language languages are disappearing, and “Wow, this isn’t enough. We need to really focus on this primary data. We need to annotate and have all the information so that it’s there kind of in the 100 years view. If it disappears and there’s no speaker, do we have enough to keep studying it and let the community revitalize it?”
Going back and recognizing the trust factor in building those relationships, but recognizing you also gave us a deadline a moment ago in terms of, by the end of the century, X number of languages are lost forever and there’s only so many linguists doing this… So is there any opportunity for automatic recording of the life of people in a particular area that are speaking a language, where you basically put in the technology, it has a microphone…? It may not record the specific words that you were looking for, but it records enough of day-to-day life to where at least it’s getting those words into a library of some sort.
[24:12] Yeah. Actually, there’s a linguist - Steven Bird has worked on that; he’s built a mobile app, actually, called Aikuma, with this idea of people being able to speak into it on phones, and its cheap phones, where they don’t have electricity, but they can charge the phones and record, and even do translations of it and do everything orally. So just get natural speech, and then someone else can come along, you train them and say, “Okay, now take that very natural speech” - like, I’m kind of talking pretty fast right now… “And you respeak it very slowly, very carefully.” That helps us figure out what are the actual sounds in the language. And then translate it into English, or whatever language is a language of wider communication.
So yeah, there are the people that are working on that. So going back to that trust and relationship factor, one of the interesting things, one of the things I love about working with linguistics that you don’t get with technology per se, is that people are unpredictable, and cultures are unpredictable. So in one culture it may be, “Oh, yeah, phones. We get that. Okay. Yeah. Oh, and yeah, our language is disappearing. That’s important to us. Okay. Yeah. Give it to us and we’ll do it.” They’re motivated. Other cultures are like, “Technology - hmm, kind of scary to put my talk into a microphone”, or “Yeah, technology I’m okay with, but why do I care about my language? It’s just not important.”
And then there’s factors like communities in Australia, in the Southwest of the US, where there’s like religious and cultural taboos about listening to someone or hearing their voice or seeing their picture after they’re dead. So then you have all this data, but then what do you do with it? How do you deal with that? So the fun thing and the challenging thing is you can come up with some solutions, but you can’t necessarily predict, unless you have those relationships and you understand the culture, how or whether they’re going to work and be accepted by the community.
And maybe that leads naturally into the computer-assisted type of methods that we’ve kind of been alluding to… But how in your journey, while you’re kind of learning about this, you’re participating in language documentation, you’re involved in Academia - at what point do you sort of start to think more about the way in which a computer might augment or assist this process? And when does the light bulb turn on that maybe more is possible, I guess?
So I came back to the States to get a master’s degree and I focused on these very things we were talking about - documenting and describing endangered languages. And then after I finished my master’s degree, lo and behold, I had debt to pay off, so I needed a job. And it’s not that easy to get a job as a linguist. There really aren’t that many jobs, unless you’re going to teach English, or work in the military for the government.
Work for certain agencies. Yeah. [laughs]
Yes, yes, exactly. Except that a friend of mine found a job and then kind of posted to me a job posting for a company called Language Computer Corporation. So it was an NLP company, and they needed annotators to– I don’t even remember what we were doing. We were creating some sort of grammatical templates. And then that turned into, actually, because I spoke Russian - because they happened to be getting money from certain agencies that are interested in languages like Russian and Korean and Chinese - that they hired me because I spoke Russian. So I went back and worked and learned. That’s where I got introduced to computational linguistics and AI; just kind of danced around the domain and discovered that this was something I’d never been aware of, I’d never heard of, and was really cool. And I don’t know why I came to this conclusion, because we were working also with Persian, or Farsi, which there’s not a lot of resources available for that language… But I just had it in my mind that this is really cool stuff. But you can’t do it with minority languages, because you need a lot of data to work on it.
[27:54] “So this is really neat, glad I had this job, but I’m going to go back”, and that’s when I did my fieldwork in the Caucus mountains, and did documentation, and I was like, “Well, I’ll just leave that behind. It’s really cool. I wish I could bring it together, but probably not possible.” And it was when I was on the field, working with the software to– well, different things that we were working on, transcribing a little bit, translating a little bit, doing some linguistic analysis, like word analysis a little bit, working with the metadata of this data, and I’m working with the software that was available, and those two and a half years I’d worked in that company just kept on coming back to me like, “There’s something we’re missing here. We’ve got to be able to do this faster. The computers can do something here. There’s a gap that’s missing.” And that’s what pushed me to doing computational linguistics.
There is an untapped gap that linguists just don’t seem to be aware of, and I know I wasn’t aware of, between documentary descriptive linguistics or general linguistics and computational linguistics or NLP. It’s a huge gap. It’s like a chasm, and we need to build bridges. And I kept on trying to convince my friends who were thinking about going back to graduate school, who’d done linguistics, like “You should do this. You should do this. You know some programming. You should do this.”
I couldn’t convince anybody, until one day I was thinking about it and I was like, “Argh, this needs to be done. Someone’s got to do it.” And it was like a voice said, “Well, Sarah, why don’t you do it?” [laughs] That’s when I was like – I’d already gotten down to my list of schools I was going to apply to for a Ph.D, to do documentary linguistics. I was going to write the grammar of one of these languages… I had to throw it all out and start over again and find schools that actually had a Ph.D. program that was interested in endangered languages and computational linguistics, which at that time, in particular, was actually not easy to find.
Okay, so we’ve sort of made our way to the intersection of computational linguistics, NLP, machine learning and language documentation… You’ve mentioned a few things that maybe listeners are aware of as like, as NLP or AI tasks like transcription or speech-to-text, or like maybe people are familiar with named entity recognition, or finding certain entities within text. People might not be as familiar with morphological analysis, but you’ve kind of talked through this process of transcribing and translating text. As someone interested in this intersection, where do you begin? At the time that you started, where was the real opportunity in terms of like next steps towards computers assisting in this process? And what are the main tasks maybe that you see as greenfield now?
Well, the first thing for me was to learn how to program. So the problem that was getting at me was this bottleneck of “We have all this data recorded of people speaking”, and what so often happens because you can get a certain amount done, you’re motivated, linguists are motivated to get a certain amount done, transcribed, annotated, translated to do their dissertation, or to describe the language, so it’s there. But then you have another 3,000 that are done, you have another 30,000 that hasn’t been done, and it’s just sitting there. Maybe it’s transcribed at most, maybe it’s translated. Those things can be relatively quick compared to doing linguistic analysis, but they’re just there.
And so all this data - first of all, it’s not available for the communities for that purpose of revitalization some data learning or having their stories recorded for their grandchildren, even if they don’t speak the language… But also this need for data in machine learning, right? And machine learning and deep learning is supervised. So you need annotation, and there’s only so much you can do with unsupervised.
[32:03] So there’s two things. If we can annotate this data, if we can get this basic analysis done, basic translation done, then we can train models on these languages to do more. And the morphological analysis is important, that breaking down into word parts. The reason you don’t hear about it as much in NLP is because, mostly, work is done in English, European languages, Mandarin Chinese, maybe… And Mandarin Chinese and English, in particular, are what we call morphologically simple languages. Words are very short, and you don’t add much to it. You add a plural to it, you add a past tense, not much more. I mean, you can have these long words, like transubstantiation, and those mostly come from Latin, or are borrowed from another language. But a lot of languages in the world are much more complex than that, and a lot of the endangered languages in the world are much more complex.
Arapahoe, which is a native American language that I’ve worked on - basically, almost every word is equivalent to an English sentence, and it’s all combined together in a complicated way, so each one is kind of – you can’t necessarily… Well, there are meaningful parts, but making the division between those meaningful parts - it’s just combined in such a complicated way that you really kind of have to view it as one word. So morphological analysis becomes pretty important.
So I wanted to break that bottleneck, because that’s a huge bottleneck - doing that all by hand, transcribing, morphological analysis, translation… And I was particularly interested in the morphological analysis. So that was what I wanted to get into. Can we teach the computer for these complicated languages to figure out what are the parts of a word, and do it automatically? First of all, to break that bottleneck, to finish annotating these words so that then we could have more training data for NLP. But there’s a second part there of discovery. Because if it can learn from what the human has done from the annotated data, then it’s going to run into something that it hasn’t seen before and is very different, and it’ll get very confused. And so those are the things then that we could help linguists focus on. It’s a discovery process, “Wow, the computer hasn’t seen this.”
So the really interesting things about a language are usually things that you don’t see very often, and they’re not repeated. You’re not going to get those statistical models to really see them enough to learn them, so they get surprised, as it were, the models are surprised, and so then we can direct the linguist’s attention to those things that might take them a really long time to discover if they’re doing it by hand; and they have that ability to discover more interesting, unique things about human language.
So is this really an intersection of this whole field of active learning is probably a term that people might use, or human-in-the-loop type of process, where yes, we are going to need linguists to really jump in, but we need them more for certain samples versus other samples? And so you can focus their attention in a loop on the harder things, as you’re building up a dataset… Is that a strategy that’s employed?
That is the dream.
That is the dream for me, yes. Because at least it’s some interesting conversation, because if you say, “Oh, we can automate this to a linguist”, to a native speaker who’s working on their language, they often get nervous.
I’ve fallen into that pitfall a few times. [laughs]
Yeah. Because it’s like, “Wait, no. This is really good, me going through and doing this by hand, I’m learning about the language. A computer, can’t understand things.” I’m like, “That’s true.” So first of all, the human has to come in and learn it, they have to understand it, but there’s at some point where language is statistical. That’s why NLP works. We repeat a lot of things. You don’t want to have to do all that by hand. So at some point, let’s let the computer take what you’ve done, and learn. So it’s not automated, but it’s automatic assistance, I would call it.
Yeah, or augmentation.
Yeah, augmentation. Yeah, to be able to say that for that way. And then the act of learning is so important, because you don’t– unlike English, where you maybe just go and web-scrape and get more data, it’s expensive. It’s hard. You may not have the technology.
[36:08] I mean, that’s an issue that comes up with NLP and supervised learning. It’s always been an issue. It’s expensive to have an expert in it, like the Wall Street Journal, and annotate all that; that took a long time and a lot of money. So in some ways, really the problem maybe isn’t how do we get machine learning models to do this, it’s how do we speed up what the humans are doing? So the act of learning, that part.
Can you give us an example of what a typical workflow there, in a very practical sense, would look like, if you’re going to do that? Because we’re kind of talking conceptually, but also I’m trying to tie that into how you’re using the technology to achieve this. And so will you take us through a generic workflow of what you might do on a given day to achieve that, just to get very hands-on?
Yeah. So this is all abstract and theoretical because it hasn’t been done yet, and this is where I want to go. This is what my next–
…big grant is, what I’ve been talking with people here in the University of Colorado, who also worked on this, and were my mentors… About this opportunity to make it a human-in-the-loop for linguists. So there’s two parts here I see. One is the fact that you need to get the NLP working. You need machine learning models that are relatively good at learning from field data from the linguist. So that’s something that I worked on a lot, because there’s a whole issue there with working with field data, of just - it’s noisy. And how a linguist creates data is different than how people create gold standard data for a machine learning model. It’s noisy. I’ll just leave it there. So it’s a challenge, and finding the right models that work for multiple languages and how much data should we recommend to the linguist that they bring. So then on the other side is then how do we bring this together in a way that linguists can access it who don’t have any computational background? So that’s going to require an interface.
So we’re talking about creating a third– so there’s these software that exists, that I’ve talked about, that don’t have machine learning, and they’re popular, they’re great tools. So we’re talking about creating the third, kind of, not to replace them, but to be there where it does machine learning; so they could work a little bit, export, and then have this interface where in the background you’re doing NLP, and people are doing experiments and trying different ways, but the linguist has to see the computational stuff going on, like “What’s the best way to do this?” So you run, they’ve worked by hand in their tool, they’ve annotated the data by hand, bring it into this interface, train a machine learning model, say, to do morphological analysis, break the words into their meaningful parts, and let’s say… So I’ve seen languages, depending on how you approach it, if you have– and there’s all sorts of different languages with different kinds of words, but I’ve seen a language with 3,000 words and we got like 30% accuracy at breaking a word into its meaningful parts and giving the meaning of those parts based on what the human had done. 30% - not really great. And there was a lot more data that the human hasn’t done.
So the machine learning model is learning, based on the tests, about 30% accuracy. And then we have it predicted over, let’s say, another 10,000 words that haven’t been annotated. And we know that 70% of them are probably wrong. What we don’t know, because no human has looked at it, is what 70% of it is wrong. So then you want to figure out what’s most likely incorrect that the model predicted, and not only that, but if of those incorrect ones, which of those incorrect words, incorrect analyzed words, if they were corrected, would be most helpful for the model to learn the patterns and become better, like reach a higher accuracy more quickly. So instead of having to do 1,000 words, can we find 100 words that if the human corrected those, the model would suddenly jump from 30%, let’s say, to 40%, or maybe even 50%, if we did a really good job. So 3,000 words to get to 30%, but only 100 to, say, get to 40%.
So that challenge right there is some things I’m looking at right now. You can choose randomly, you can choose– a colleague of mine figured out that if you look at the probability confidence score on the predictions of the machine learning model, that’s better than choosing randomly.
[40:10] So one thing I’m trying to do is incorporate some linguistic knowledge into that. So, okay, low confidence, but if you have a bunch of narratives, for example, you probably, at least in English, you’d end up with a lot of past tense verbs. So you’d see a lot of that, and you’d learn pretty quickly probably that -ed needs to be separated out and labeled past. What you might not see as much of is -ing verbs, just to choose English, which hard to come up with examples because English is really simple. There’s not that many complicated things going on in the word.
But let’s say you don’t have very many -ing examples in those stories. So one thing we’re trying to figure out - okay, we’re hypothesizing if you don’t have a lot of -ing examples, then probably it’s not going to do as good at analyzing words with -ing and recognizing that sometimes -ing is part of a verb. Sometimes it’s a word like ‘sing’, and you don’t want to separate it out. So if that turns out to be true, which is a hypothesis, then we could combine these things. We could choose out of those 70%, out of those 10,000 words, the lowest confidence, 100 low confidence scored words, but also make sure that they’re distributed, so that you have more -ing examples, for example, so that the model is learning from these incorrect examples, but it’s learning in a way that’s going to strengthen its weaknesses more effectively.
So then you choose it smartly, the human does just, say, 100 words… It would be wonderful to see it jump up to like 40%. Run it again, now you’ve got another– you got 60%. Okay, pick out another 100, jump it up another 10% accuracy… I don’t know. We still have to find out how that’s going to work and how quickly that will go, but that would be like the general workflow, with some abstract things thrown in the background as well.
Yeah. I think it’s really interesting that you’re working at this intersection where NLP, or modern NLP methods are impacting traditional linguistic processes, but then the more traditional linguistic analysis can impact things the other way. I don’t know if you have any perspective on that, because that’s not– I think it’s a really, really valuable perspective, but it’s not one that we hear a lot. We more hear, like you were talking about, “Scrape the internet, build your model, don’t worry about the language. It’ll figure it out”, narrative.
Oh, yeah. I think it’s a virtuous cycle. I mean, obviously, first of all, if you have more data that’s supervised data, so it’s been marked up by a linguist, you have more data to train models for these minority languages, which you can then build tools that allow them to be on the internet and doing cool stuff and having phones that speak their language, which will encourage children to speak the language and preserve the language and identify with the language, which will help their mental and emotional and physical health. There’s all of that. There’s all sorts of things where this is a virtuous cycle.
So providing more data so that you have enough training data to train some models - say, I don’t know, Google Translate, say, for example. I don’t know if they use one model or multiple models, but let’s say it’s one model; the more languages you have, I feel like that’s going to be more helpful for knowing what is a good model that can learn from multiple languages, so a multilingual model. You’re getting languages that are very different than your typical languages like English or German or Mandarin Chinese. These are very different languages, so you are going to be challenged and your models are going to be challenged in a new way.
The other thing is that maybe even outside of the virtuous cycle, kind of a separate thing is these techniques that we’re looking at of how do you improve NLP models and techniques for low resource languages can be applicable to any low-resource context. So even like with English - sure, we’ve got a lot of data, but if you switch to a certain genre… I remember hearing a researcher who was at Columbia University, who was working on Twitter posts by identified gang members, and trying to do NLP for law enforcement to figure out what they were talking about; for good reasons, we hope… And trying to figure out their slang, their use of emoticons. That’s a low resource context. So these techniques that we’re figuring out for low resource languages can be used to apply even for high resource languages, but in certain genres or contexts where there’s not a lot of data. It’s very specialized or– yeah.
[44:19] I just want to say, because my wife will shoot me if I don’t bring this up… Since you mentioned the gang thing - she is English, and for fun, we will have Cockney rhyming slang in here, and so we need to figure that one out as…
We need to figure that one out. Well, there you go. There you go.
If we’re going to do the gang slang, then we’ve got to do Cockney rhyming as well. If I didn’t say that, she’d say, “Why didn’t you bring that up?” and I’d be in trouble.
That’s great. I think you could probably teach an NLP model to create its own Cockney rhyming, and then–
That would be pretty cool.
…trick the speakers.
Yeah. I mean, I guess as you’re kind of looking towards the future, and you are interacting at this intersection between linguistics and AI, what are those things that make you really excited about that future in terms of the possibilities, and maybe you see momentum that’s starting to build now that you hope will continue? What are those things that really make you excited as you look towards the future?
I’m excited about what I do, and I’m excited about the linguistics, but I also get excited on things that aren’t directly connected to what I do, but are hopefully consequences or by-products. I get excited seeing humanities and linguistic students who now have programming skills, which puts them in a much better position in the job market. I know what a difference it made for me and my prospects.
I get excited for people out there, hobbyists, community members, probably people at hackathons - I hope there’s more - who are building useful technology for minority communities, for them to use their language, to learn their language, or just things that are in their language, just to build that sense of identity and pride and value that they have, I believe, just because they are people God created. Yeah, seeing other people take some of what we’re doing or building off of it and building tools that– because what I’m doing is kind of theoretical and scientific. You have to make several jumps to be things that people who aren’t interested in this specific area think are cool, but language learning apps, or spell checkers and Microsoft word. Those sort of things are exciting to see people build.
And for the momentum - when I first started in this area, there wasn’t very much work being done on computational morphology. It was starting, because most of the work had been done in English, and again, Mandarin or European languages, which are not complex in morphology either; they’re fairly simple. So that’s grown a lot. And then I think specifically in this intersection of documenting and describing endangered languages with NLP is this realization that we have to do a human-in-the-loop. We can’t expect to just get more data. We also can’t expect to– there’s a lot of cool techniques about how to augment data, how to “hallucinate” data in interesting ways that does actually help the models, but they only go so far.
So this realization that we do have to build some way to bring the linguist, hopefully the native speakers back into the loop - that that’s the best way to go, and to bring it to a place where they are involved, and there are interfaces. So the flip side of that is it becomes accessible to them. Those tools, the NLP stuff just doesn’t remain in the computer science departments in the big companies, but it becomes something that’s accessible to speakers and workers in minority communities. That’s really exciting for me to see. That’s a trend. That’s kind of what I recognize, and a lot more people are realizing, kind of confirming what I thought, that this has to happen.
That’s awesome. I love that perspective. I think that’s a really great thing to leave in people’s mind as they move forward. And I think you’ve probably convinced Chris to become a linguist at this point, so we’ll see if–
I’m right there. I’m excited about this.
But yeah, thank you so much for joining us, Sarah. We really appreciate your perspective and your work.
Thank you so much for having me and letting me nerd out a little bit on languages and computers.
Our transcripts are open source on GitHub. Improvements are welcome. 💚