Practical AI – Episode #78
NLP for the world's 7000+ languages
with Dan Jeffries, chief technology evangelist at Pachyderm
Expanding AI technology to the local languages of emerging markets presents huge challenges. Good data is scarce or non-existent. Users often have bandwidth or connectivity issues. Existing platforms target only a small number of high-resource languages.
Our own Daniel Whitenack (data scientist at SIL International) and Dan Jeffries (from Pachyderm) discuss how these and related problems will only be solved when AI technology and resources from industry are combined with linguistic expertise from those on the ground working with local language communities. They have illustrated this approach as they work on pushing voice technology into emerging markets.
Featuring
Notes & Links
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of the Practical AI Podcast. My name is Chris Benson, and I’m a principal AI strategist at Lockheed Martin. With me, as always, is my co-host, Daniel Whitenack, who is a data scientist with SIL International. How’s it going today, Daniel?
It’s going pretty good. It’s been a busy day of recordings. We were just talking about this before the episode started - I just shoved a bunch of cashews in my mouth as a snack, and I think you had some breakfast troubles in between episodes…
Yeah, I had not had breakfast, so in the last few minutes I went to get breakfast, and as I was putting it together - I have multiple dogs - one of my dogs did something; I think it was an organized thing… One dog was trying to pull me to the side while the other dogs went and got my breakfast.
Nice. So you’re hungry.
So I’m gonna sit here and we’re gonna do this podcast hungry. That’s right. You know what - we persevere, no matter what. We handle it.
Exactly.
Okay.
These are the sacrifices we make for Practical AI.
That’s right, that’s right. We’re hardcore podcasters. It’s interesting… I’ve gotta say, I know you live up North, but I’m about to head out today for the Denver, Colorado area, Littleton, where Lockheed Martin has a part of its space division… And I looked at the weather for packing, and it’s like, it’s going down to like zero degrees Fahrenheit.
Nice.
For listeners who don’t know, I’m from Georgia. I’m a Southern boy, used to warm weather. I’m quite frightened to get on this plane and go to this place with such frigid temperatures coming.
Good luck…
Yeah, yeah. Now, I’m being tough in two ways - both with the dogs and with the weather. I’m just saying.
You’re leveled up.
There you go. I’m ready to go. Okay, well we have a pretty interesting episode coming up here; it’s gonna be a little bit different…
Hopefully.
I think so… From what our listeners usually hear. Typically, we’ll have either a guest on to talk about what the guest is involved in, or you and I will do what we call our Fully Connected episodes, where we will talk about a topic between the two of us… And we’re doing a little bit of a blend of those today. We’re going to address AI with local languages, and today instead of strictly being the host, you’re here representing SIL International, which is a non-profit in local languages.
I’m allowed to speak more than just questions this episode.
You’re allowed to speak more than just questions on this episode… And also with us today we have Dan Jeffries, who is the chief technical evangelist at Pachyderm. Welcome, Dan. How’s it going?
It’s going wonderfully. Thanks for having me on the show.
No worries. And for listeners upfront, I’m gonna try to go with – for Daniel Whitenack we’re gonna say Dan W. and for Daniel Jeffries we’re gonna say Dan J. So it’ll be a little bit different, since we’re broken out with Dans today.
Yeah. I suggested that Dan J. go with Pachyderm Dan, but he informed me that he is not completely defined by his employer…
Oh, my gosh…
…which I guess I understand.
[04:13] Oh, and you’ve now put that out on this recording. I can’t believe that.
[laughs] Well, Dan W. wanted to be fully defined by his employer, and says that he has no other outside interest whatsoever, so…
I don’t know if that’s a direct quote, but I’ll let it slide through… [laughter]
Okay, boys. Okay. Let’s get back in our corners now; we have a conversation to dive into here… I’m gonna actually start for a second with Daniel Whitenack, whom our listeners probably know pretty well… They know you mostly on the Practical AI host side, and I’d like you to take a moment and talk about SIL International and what you do there, and then we’re gonna flip over to Dan J. in a moment.
Sure. Definitely. As you know, I introduced myself as a data scientist with SIL International, which sometimes we joke as kind of SIL International is everywhere, but no one really knows about it. It’s an international non-profit; we actually have people from over 80 countries working in 90 countries, and the vision and mission of SIL is to see people flourish in communities with languages they value most.
We do everything associated with language work, which involves a lot of things. That involves things like multilingual education, it involves things like literacy work, like language development, even like language survey and mapping, and other things… But it’s also technology-related. One of our products is called Keyman, which is a keyboard for devices like phones and tablets, and it supports over 1,000 languages; the next biggest keyboard solution doesn’t support even nearly that much.
We also have a product called the Ethnologue, where we track what languages are being spoken where, by how many people, how many languages there are in the world… We’re also involved in the ISO standard process for the ISO standards for languages, the little codes that represent languages… But I personally work on things related to AI and language.
SIL has actually gathered a lot of data related to languages. We worked in over 2,000 languages… So part of my responsibility is to help SIL develop programs and do experiments and research to push AI tasks like translation, or sentiment analysis, speech-to-text, text-to-speech (these sorts of things) past the languages that are currently supported into the longer tail of languages in the world. Those local languages spoken around the world where there currently isn’t support for those things. So yeah, that’s what I get the privilege of doing.
Thank you for that unusual introduction for this podcast, and kind of bringing people in. I’ve gotta say, as you finish, having worked with you now all this time, on all these episodes, for listeners who don’t already know, you are truly a natural language processing expert in AI, and I have learned a lot from you in the time we’ve been working together.
It’s been fun.
Thank you. As we turn toward Pachyderm, I wanted to actually start off by noting that we have previously had an episode on Pachyderm entitled Pachyderm’s Kubernetes-based infrastructure for AI. It was episode number 23, and our guest representing Pachyderm on that was Joe Doliner, who everyone calls JD. He’s the CEO of Pachyderm. But with us today, as we kind of dive into this story about local languages, Dan J, can you tell us a little bit about yourself and a little bit about how you arriver at Pachyderm?
Sure. I arrived at Pachyderm through a circuitous path. I’ve been a technologist for 20 years; I had an IT consulting company for a decade, that I sold, and then I spent nine years at Red Hat. I designed some of their early artificial intelligence strategy before anybody was really thinking about it. I’m also a science fiction author with four novels, professional blogger… And one of the things that started to get me very interested in artificial intelligence early was a series of articles that I wrote called “Learning AI if you suck at math.”
[08:15] I was taking it from an engineering perspective, trying to look at it from – I’m not a data scientist, but I was trying to look at it from someone who’d spent many years setting up huge SaaS infrastructures, gigantic web farms, Office pack infrastructure… And trying to figure out whether this stuff was learnable by someone who hadn’t studied statistics and enjoyed all those things in school.
That series of articles proved very popular; they were read by over two million people in seven different parts… And I was essentially teaching myself many of the concepts as I was going along. At that point I was starting to get talks around the world, both for Artificial Intelligence and other future technology.
I realized at a certain point in time, after my beloved Red Hat was purchased, and it was starting to change in terms of its structure, that I wanted to go back to someplace that was very innovative and that was doing fresh things in the industry. That’s when I came upon Pachyderm, which falls very much into the MLOps side of the house. It does essentially version control for data science. So when your models, your data and your code are all changing simultaneously, how do you keep track of all those things and create reproducibility? Because if you run a bunch of tests on a series of the million images, and then an administrator comes in and crunches them all down on the back-end to a smaller size, then it’s gonna be nearly impossible to recreate that earlier experiment.
So I’ve been very excited to be with those folks, and everybody there is fantastic. It’s exciting to be in this amazing industry now, and get to work with incredible people like you two folks.
Fantastic. Well, thank you for that. Starting off – and I know I’ve learned a lot from Dan W. about local languages myself, just from being associated with him. We did a keynote together recently at Project Voice… And he has been talking for a while about the partnership between SIL and Pachyderm on this. But I’d like to start off by asking - now that you guys have been partnering together on working on this problem for a while, could you actually tell us, kick us off by telling us a bit about what local languages are and why they should matter to the rest of us.
Yeah, I can definitely jump in and give my perspective. Something I didn’t know when I started working with SIL was really just about the language situation around the world, which is maybe something we should just start out by talking about. A lot of people don’t realize – actually, we track this very closely; right now there’s 7,111 languages spoken around the world.
Wow.
So these are living languages, not dialects and that sort of thing. Recognized, living languages that are spoken around the world. And certain countries, like let’s say India or Indonesia, other countries, have hundreds of these languages being spoken. Indonesia, for example, it’s like about 700. So what happens as a result of this is that there’s some languages that are spoken by a lot of the world. Over half of the world speaks 23 of those languages, which is a very small number, right?
Compared to 7,000, it sure is.
Yeah. The other half of the world speaks the rest of them. So what happens is that these local language communities, which don’t maybe speak one of these higher-resource languages, are usually and often marginalized in some way. So these marginalized language communities value communication in their local languages, but they aren’t supported in a variety of ways. And this has implications for a lot of things.
[12:04] One of the ways that we think about this is in terms of the United Nations sustainable development goals. The United Nations developed these 17 goals about sustainable development, and language impacts basically every one of those.
If you think about education, or humanitarian assistance, thinking about something like HIV or AIDS, or Ebola, how are you going to be able to make progress in those areas where there’s extreme language diversity if you can’t get out materials about HIV or other things in a language that people understand and value, that they don’t consider something foreign. This goes across the board.
In education there have been studies that have shown if someone starts their education by going to school – so if they’re at home and their family speaks a certain language, and then their mother speaks it, their father speaks it, their whole village speaks it, and then they go to school and the first thing the teacher says is “Oh, it’s great that you speak that, but you’re not gonna speak that here. We’re gonna learn in this other language.” Well, immediately right off the bat they form an association with education as something terrible and hard, and it actually stunts their educational development. Whereas if they start education in their mother tongue, they actually have the same benefits as others in terms of their views of education and their forward momentum.
So language impacts everything, and that’s why we care about these local languages… Because they actually make a difference for people’s quality of life.
I think this falls into that AI for Good category as well. That’s something that is very close to my heart. I started Practical AI Ethics Alliance. That’s practical-ai-ethics.org if anyone is interested in checking that out… But the basic concepts behind it are - this artificial intelligence is a dual-use technology; it reflects everything that is good and bad about humanity… So when we’re looking at a problem like languages - and I think maybe the most impressive slide or the most impactful slide that I saw at your presentation, Dan W, was where it showed that we were using maybe 100 languages for the vast majority of applications, whether that’s translation, or speech-to-text… And there was 7,000 other languages that half the population was speaking, and we weren’t doing anything with those. And that’s a surefire way really to continue to marginalize people, or to ensure, like you said, that people are going to find education difficult, or that even finding basic services is incredibly difficult…
So I think it’s wonderful that you’re working on a side of the house that allows you to make a difference and have an impact in this part of the world… Because I feel like so much of the research sometimes gets poured into getting people to click on ads, or all the things that make us money. Those things are certainly important, economics are incredibly important, but it’s also amazing to realize that artificial intelligence can make the world a better place, in some ways. It almost sounds cliché, or it sounds a little high-minded, but it is actually true that certain types of things would never be able to be done without it.
I remember seeing a translation for very old Japanese, that only maybe 100 scholars in the world can speak now… And there are tens of thousands or millions of texts in a form of Japanese that’s really not used anymore, and machine learning is able to augment these ability of those translators to scale what it is they’re doing so that those texts don’t die out into the pages of history simply because we don’t have anybody interested anymore and being able to translate them. I think that’s where this thing can make a massive impact.
[16:02] Yeah. And you’ve kind of gotten into the idea of the importance of applying AI, to the possibilities… Do either of you have any comments on expanding that a little bit in terms of why apply AI to this long tail of languages? You just identified one… Any others that come to mind?
Well, I think that AI – and especially what I feel like we’re on the verge of new possibilities in terms of AI and language. Over the past couple of years you’ve definitely seen what some people might refer to as an inflection point, with a lot of new techniques, a lot of emphasis on transfer learning, a lot of emphasis on usage of monolingual data, and things that really impact the languages in this sort of long tail of languages, the languages outside of those supported by the major tech platforms… So I think one thing to note is that whereas [unintelligible 00:16:57.22] we might have been stuck with some of these things like machine translation because of lack of data, or something like that, there are brand new possibilities where, you know, “What if we could translate these HIV materials or this educational material into all of these different languages? What if we could enable people to be part of the global conversation in their mother tongue?”
That’s a very interesting one for me - oftentimes I think we feel like “Oh, how can we get our great content as Westerners into the languages that people care about?” Which isn’t a bad thing; it’s great to try to get those educational materials, scientific materials, entertainment media into local languages… But these people in these local language communities - they have so much to contribute to our understanding of the world, to scientific research, to all sorts of different areas.
So one of the things that I think AI could enable with things like speech-to-speech translation, with things like predictive text and other things like that, are enabling those local language communities to be part of a global conversation; not just to be consumers, but to actually contribute in a back-and-forth sort of way to global conversations around the things that actually impact their life, and the things that they can contribute, like politics, and education, and all of those things.
I think it’s interesting that you’re talking about it from a two-way street; I think that’s an amazing way to frame it, because - what is it that we can learn, not just what is it that we can translate from our own content into what other people can consume? What is it that we’re missing, on the other side of the equation?
Historically, if you look at how language has been used, sometimes it’s been used as a way to dominate other cultures, or as a way to socialize. It’s almost been used as a weapon. The farther that a language can spread, the more that people think in your own way. If you think about something like the Etruscans and the Roman empire, we know pretty much nothing about the Etruscans, primarily because they were completely consumed by the Roman empire, not in small part due to language… And we’ve seen this in other parts of the world. But that sort of slash and burn mentality - there’s a lot of things that are lost… In the same way when we destroy a huge part of the forest, what medicines get lost, that we would never have been able to find? What compounds were hidden in the species of plants that were wiped out? In the same way, different types of languages allow us to think in a different way.
In fact, we’ve just had the wonderful Super Bowl on the other day, and there was a commercial that stopped me for a second, where they said that there four words in Greek for “love”. I went and looked up each of the words, and they were fascinating, in that each one of those words conveyed something very different about the nature of love… From eros, which is more of a passionate kind of love, to agape, which is more of a selfless kind of love, or a love for country. Each of those words convey something very different. So when we lose these languages or we assume that we have the words that perfectly convey things, we lose a lot of nuance and meaning, and we lose people’s ability to connect in a fluid way with us… So I think it’s amazing that you’ve framed it as a two-way conversation. That’s very important.
Dan W, what is SIL doing in terms of AI for local languages these days, and what are you interested in doing to tackle in terms of the types of problems and issues that SIL is attending to, specifically on your roadmap?
Yeah, definitely. In the longer run, like I said, we want to see this sort of two-way street that we’ve been talking about in terms of local languages being part of a global conversation… And I think maybe a natural place to start with that is these sorts of AI technology that have already been ground-breaking in our everyday life maybe as English speakers, or other high-resource language speakers.
If you do a Google search now, that’s hitting an AI model, now integrated BERT into that. If you’re writing an email, you’ve got predictive text along with that, that helps you. If you’re dealing with a chatbot, you have these things like sentiment analysis and entity recognition. If you’re talking to an assistant or a smart speaker, or other things, you’re using speech-to-text, you’re using maybe those same assistant capabilities. You might be using text-to-speech… So these sorts of building block AI technologies are what we’re thinking about a lot right now, and how could we take those building blocks which now only support high-resource languages - maybe up to 100, but as Dan J. mentioned, that’s kind of a drop in the bucket… How do we push those into the longer tail of languages?
What really excites us is “How could we not just do that language by language?” Like, “Oh, we add the next language, and then we add the next language, and then we add the next language.” How could we knock out 40 languages at a time? I think those are the things that get us really excited.
Some of the things that contribute to those sorts of advances, I think, are first of all multilingual models. So we’ve seen this shift recently into massively multilingual models that support things like Google Translate, where one model actually can process multiple different language pairs… And something people may not know is I think – no one’s been able to challenge me on this, but I think SIL and its partners have access to the most massively multilingual corpus that there is.
We’ve done work in over 2,000 languages, we have some type of data, whether that be text or audio, and depending on what partners we gather together, maybe like 1,200 languages… So there is a lot of data there, and part of what we’re excited about is what happens if we take the largest multilingual model that there is now, which I think the most multilingual now is around 103 - what if we push that to something like 300? How does that affect adding the next language? Does it make it easier? How should we structure these types of models into language families and other things, so we’re exploring those things?
[24:25] At the same time, we’re exploring a lot of the low-resource machine translation technology that’s been developed around transfer learning and fine-tuning, iterative back-translation… There’s just a lot of different techniques out there now that allow you to maybe take a high-resource language and adapt it to a lower-resource language, or even make use of multilingual data… So those are all the things that we’re interested in exploring, first in terms of experiment and research, and then in terms of making strategic partnerships with tech companies, but also local institutions and governments to pilot out some of these possibilities and actually get them used.
That really begs the question for both of you - as you talk about these partnerships, what specifically brought SIL and Pachyderm together to tackle these kinds of problems that you’ve just addressed here, and from each of your perspectives, why did that partnership make sense?
Sure. I’m very pleased that Pachyderm wanted to work with us. I’m really happy about that. Thank you to Dan J. and the team who wanted to work on this… But I think that despite SIL having a ton of data, and a very multilingual corpus, and an amazing amount of language information and linguistic expertise, we’re not a tech company, we’re a non-profit that has done language-related work for a long time, but isn’t really an AI company per se, and isn’t operating a ton of computational infrastructure. So whereas we have a lot of this sort of data and language information, that side of the equation, part of what we want to do is partner with people that have a lot of expertise on the infrastructure side, on the AI methodology and practical AI training side, and Pachyderm definitely fits into that component.
So from my perspective, that’s what I was excited about working with Pachyderm, is actually building something useful, that we can use over time, and repeatably use, and scale up… Because this is a large-scale problem, right? 7,000 languages… We need something that’s gonna scale and something that’s gonna work… So that’s what originally got me thinking of a partnership with Pachyderm. Dan J. can speak from Pachyderm’s perspective, but I hope that they were excited about these sort of AI for good problems.
We definitely were excited about these AI for good problems, and frankly, we’ve been looking for a number of these types of things in the field. So if folks are out there, interested in doing those types of things, we wanna talk. We’re certainly not DeepMind, or OpenAI, or have infinitely deep pockets to be able to throw at some of these things, but we do feel it’s of tremendous importance for us to help enable projects like this. And frankly, Pachyderm is more of the infrastructure side of the house.
We recently launched the PacHub product, which runs on Google Cloud, and a lot of people - they’ll automatically spin up clusters, and add GPUs to them, and parallelize their resources… And those types of things - I think Pachyderm is one of the solutions that people don’t realize they need until they start doing data science at scale… And I think we’re seeing a development of a canonical stack, probably over the next 2-5 years, where the tools become codified, that allow data scientists like Dan W. to really do their job easier.
And if you think about the history of how these things have worked, a lot of times data scientists were just passing around a text file between them, or maybe FTP-ing something somewhere and cobbling together infrastructure… That’s not really going to work as you get this technology out of the hands of the unicorns that have a billion dollars to just throw at things and create the infrastructure on the fly.
[28:21] If you think about a company like Google, or some of the research foundations doing their own work, they’re all building all of their pipeline tools and their training visualization tools and their explainer tools - all these types of things… And they’re experimenting with lots of different frameworks and libraries. But over time, we’re gonna start to see more and more standardization. And a problem like being able to version control your data and understand the entire data lineage of where things got, from point A to point B to point C, is incredibly important for being able to reproduce experiments.
I’ve read in VentureBeat the other day that something like 87% of data science projects never make it to production. That’s a massive number, considering that we’ve spent upwards of 60 or 70 billion dollars, or we’re spending hundreds of millions of dollars more in the coming years… And that means we’ve wasted that much if we don’t improve that. And if we don’t improve that, then we’re in serious trouble. And one of the ways to improve that is by having that level of reproducibility, and being able to work across a diverse team. So getting our tools into the hands of people who are doing amazing things is definitely the way to get our name out there, but also really make a difference in the world. I think both goals are incredibly important.
Yeah. And on a practical level, since I’m always interested in keeping this podcast practical, I’ll kind of walk through our internal workflow and thought process on this. If you imagine – let’s just take machine translation, for example, which is one of the things that we’re working on… Well, I can spin up a Colab notebook and pool the data together from a source that we have access to inside of SIL, do some pre-processing on that, do the training, do the testing, get the inference bit worked out… But now, if we’re really serious about our goal of pushing this sort of thing into many languages at once - now I have to think about other sorts of problems. And one of those, on the data side, is the data that SIL has access to and that we use is a big mix of data.
So it’s partly internal data, and most of the time formatted in maybe non-standard formats as far as AI people are thinking. It might be data from partners, it might be a mix of public data, and so we have all of these sorts of data that we may want to bring together in unique ways, and the combinations of those data for – let’s say if we’re targeting 40 languages, they might be different for the different languages. So there’s this complicated issue of “How do I combine all these things together in a sane sort of way, with a bunch of pre-processing?”
And then I’ve got the problem of “Okay, I need to standardize those.” Those datasets might be updating at certain times… And then I’ve got to connect all of those data sources with the correct pre-processing, like I said, but then training; that training needs to happen on GPUs, where maybe the pre-processing is happening on CPUs… And then I need to connect those output models, I need to actually export them and optimize them in a way where they can be exported to a certain place where they can actually be used in a product.
[31:50] All of those things, for even a few languages, or 40 languages, or whatever we’re looking at - that gets complicated fast. So the ability to track all of that very rigorously, but also be able to scale it as we might want, and do it in a sort of way that isn’t – you know, there’s only so many technical people at SIL… So we can write small bits of code to do these various things, but we’re not gonna write the whole infrastructure and logic around this… So we needed something that was able to handle those sort of data elements, scaling, pre-processing across lots of data sets, and also scaling our training while utilizing certain GPUs.
The Pachyderm project and what’s available in that project in the pipelining and data management allowed us to do those sorts of things… And it really comes down to the fact that we want to scale this out, we want to push it to many languages. To do that, we’re gonna have to do it reproducibly, and we’re gonna have to do it over and over, and maybe scale it out horizontally as well. So there’s a lot that goes into that, and I’m very thankful to get help on that front.
I know Daniel W. when we were doing our keynote together at Project Voice, you had a really interesting example that you as SIL and Pachyderm work together on around these sorts of problems. It kind of outlined how both sides approached it and what you were able to do, including the benefits of doing the activities on Pachyderm. Could you go through that example? I would also invite Dan J. to pipe in, so you guys can relate that together a bit.
Yeah, sure. The problem we were looking at was text-to-speech, or speech synthesis, and specifically adapting an existing text-to-speech model to a local language or a local dialect or a local accent. So you could think of examples like there’s a whole bunch of vernacular arabics that are spoken around the world; there’s a bunch of world Englishes that are different in certain ways.
We took, for example, Singlish, which is a dialect that’s spoken in Singapore. It’s actually a mix of English and some other languages spoken together. This is an interesting problem, because as a dialect, it has these elements from at least four or more languages, but it also has various standard accents that go along with it. There’s an Indian accent, and it’s English Chinese accent… And it’s a nice proving ground for some of this adaptation to accents, and other things.
We wanted to create some text-to-speech models for Singlish, because for one, these don’t exist, and for two, we were able to access some data through one of our partners in Singapore to get some of this data for Singlish, and to utilize it to test out these methods. The downside to that is there was a lot of processing that had to happen here, so our partners were able to actually get us – I think it was like 800 GB of data, so this was between our partner [unintelligible 00:35:10.14] Singapore and the government institution there, IMDA, that has gather a lot of this data.
All of that data is formatted in specific ways. Some of it is kind of noisy. It corresponds to a lot of different speakers (2,000 different speakers), so there was a lot of like “How are we going to pre-process this?” and then “How are we going to make this efficient, so we’re not running these models for weeks on end without progress?” That’s where we consulted a lot with Pachyderm, and they were able to guide us through “Well, here’s how maybe similar people have set up their pipelines in the past, and the type of infrastructure that they’ve used, and how they’ve scaled it.”
Dan J, I know we worked through some issues with data, like “How do we upload that much data? How do we pick out–” Like, you don’t wanna load 800 GB into memory, so how do you access some of that data, but not all of it, to figure out what you need? From my understanding, these are problems that other people are facing, and they were gonna help us solve some of those. I remember the one problem of accessing some data but not all of it was kind of key to whole problem.
[36:30] Yeah, being able to split up the data… And we rely a lot on – I think we made an intelligent choice in going with communities in Docker early… So we can leverage a lot of the scaling that happens now. If you really think about the history of containers, Google was running billions of these containers even before Docker existed… So the industry was moving in this direction, and they had originally an internal service called Borg, that was then shifted over into the Kubernetes open source project. That really took off and allowed people to build these massive infrastructures that scale much faster than these virtualization infrastructures where you ended up having an entire operating system that was built into this little box that you were processing things in. The virtualization there was very effective.
Once we got to need lots of ephemeral machines, or being able to quickly spin up 1,000 different nodes to split up data and process it into little chunks, so that you’re not trying to load everything into a massive virtual machine and saturating the memory…
We spoke with one customer recently - there’s a case study coming out - where they were doing a lot of language processing, and they’d built their pre-processing tools… And they were taking about 8-10 weeks on the biggest possible node that they could spin up in Google or AWS or Azure… So they were basically grabbing the most expensive node possible to try to fit everything into memory and stack the GPUs in there, and it was taking them about ten weeks. They were able to parallelize it with Pachyderm and get it down to about six or seven days.
That is a massive increase, and that’s basically because they were able to split it up across multiple nodes, without actually really having to worry about precisely how they split it up. Pachyderm does a lot of the heavy lifting for folks in the back-end, and allows it to work across multiple nodes, as opposed to having to try to figure out within your own code, because you’re already worrying about [unintelligible 00:38:44.18] Dan W. is already trying to worry about “How do I solve a problem like transfer learning, or a noisy dataset, and how do I clean that up?” Or “I’ve got a number of different formats. How do I either use all of those formats, or standardize on a different format before I can even do any of my work?”
The last thing you need to be worrying about is then figuring out how to also be an infrastructure engineer and auto-scale lot of different nodes yourself, and spin them down, and all of a sudden you’re forgetting about them, and then your company gets an AWS bill for a million dollars at the end of it. So that’s where we really make a big difference for folks doing this kind of work.
Gotcha. I have a follow-up question to the example itself that you guys have worked on, and that is if there’s one thing that’s being unclear to me as the person not involved in this - it’s what a target-rich environment this is to work on, in terms of 7,000+ languages, and with as much work as you’re working on at this point you’re still only hitting a fairly small fraction, at least at the moment. You have addressed this first example that you’ve kind of relayed to us - why that one? Where can this particular approach that you’ve talked about lead to next? What do you envision as being next steps for extending this?
[40:08] Well, I think one thing to emphasize with this and one other reason why we were really interested in this partnership to enable this sort of thing is Pachyderm is part of the open source community and part of the Kubernetes community… So anyone can run a Pachyderm pipeline and anyone can spin up a Kubernetes cluster in the cloud, or on premises, or wherever. So we’re thinking about this sort of work as a sort of template. I created, with the help of – the Pachyderm team created this template for training speech models. Using this pipeline you could plug in any sort of speech dataset you want, assuming you could pre-process it into the right format…
So the idea is we showed this for, let’s say, one accent and one speaker of this dataset that we worked with - and we’re working on others, as well… But I could publish that pipeline on GitHub - which I have - and anyone could pull that down, anyone could access the IMDA data if they knew where to look; we put the links in… And anyone could generate their own speech model, run the same Pachyderm pipeline on their own Kubernetes cluster, because everything is portable and everything is built on this great open source community… And people could collectively work on this for a greater impact than any one certain person.
Like you said, this is a target-rich environment, so the only way that we’re going to make progress here is if we make these reproducible templates and enable people to run them for their own context and their own data, and scale things up that way. So that was another really appealing thing to me about setting this stuff up. Internally, at SIL, we could re-run this pipeline for any language where we have audio data to train our own text-to-speech, but we’re not gonna get to them all at once. Other people could run this pipeline with their own speech data, on their own cluster, to create their models… So by creating this sort of reproducible template, it’s actually enabling a different sort of scaling.
And I’d touch on the essential part of open source… I’m admittedly a true believer in open source. I was, like I said, at Red Hat for nine years, and saw the early days there, where recruiters told me “Why are you going into this Linux thing? Solaris is where it’s at, and where all the money is.”
Solaris isn’t where it’s at?
No. And I told them, I said “It might not exist in ten years”, and they thought “What are you talking about?” And we spent a lot of time in the early days going in and saying what is Linux, and why won’t it fall over; why would I bet my future on something like this… And it’s been amazing to watch the transition to an open source over the course of those years to become the default model for how things are done.
It used to be that things would happen in a proprietary world, and then open source would come along and kind of build a commoditized, almost good enough version… But nowadays everything, including most of artificial intelligence work, starts in open source… And there’s a huge advantage, I think, with something like Pachyderm being completely agnostic to the tools that are built on top of it, especially versus some of the pure cloud services that have to (because of limited resources) take a fully opinionated stance on every project; they have to support it for it to run, as opposed to us… Which allows data scientists to really bring whatever tools they need to the project, and then publish anything that they create. So we don’t just have to explicitly support something like PyTorch…
[44:02] We spoke with another – the group that I mentioned earlier was using a more obscure speech recognition tool kit called [unintelligible 00:44:08.04] that they had heavily modified themselves… And the chances of something like that being supported in one of the cloud providers’ choices with the limited resources, even if they are a billion dollar unicorn and they’ve got 1,000 programmers, they still only have so many resources.
Something like scikit-learn, or PyTorch, or TensorFlow and 50 different Python libraries are going to get supported, where something like [unintelligible 00:44:32.22] is not going to get supported, in the same way that the languages are not going to get supported across the world because of resources.
So allowing people to do things with open source and bring whatever they want to the party I think allows this kind of collaborative creativity to happen, and allows a kind of scaling that wouldn’t be allowed to happen with smaller projects, or being able to move towards languages that might not be represented. I think those two concepts are intricately interwoven.
I remember, Chris, when we were talking about our talk at Project Voice, one of the things that kept coming up was this idea of collaboration for collective impact… And the problems that I have are actually not – like, I can get data for languages, I can get information for languages, I can get linguistic information for languages, I can get dictionaries, I can get grammars, I can get all of these sorts of resources, but I am limited in certain areas as related to maybe some of these more infrastructure-related things, I’m limited resource-wise… I work in an organization that’s primary a language-related organization, and we’re doing a lot with AI now, but we’re still figuring out a lot of those things, where as you at Lockheed Martin have a lot of resources in terms of computation, you have a lot of AI knowledge, but you might not have those language-related things that I have easy access to.
So a great way for us to make an impact in the area of language is in this sort of collaborative way, for collective impact, where we’re not just kind of siloed in our own world; like Dan J. was saying, we’re not limited to our own implementations, but we can work together, we can open source things, we can bring all of our resources together to solve larger, harder problems.
Yeah. I think the collaboration we did, especially for Project Voice, and really all along, as I’ve been learning from you over time about this - it really brought home where local languages and how important it is… I know for me personally, I know I’ve mentioned before working on humanitarian assistance and disaster relief initiatives at Lockheed Martin, and I know we’ve talked quite a lot about educational impacts… It’s so critical that you be doing this kind of work, what your two teams are involved in, so that we don’t leave behind enormous populations of people on the planet and grow that digital divide.
I think it’s a little bit of a stealth issue to many of us, because it’s not something we necessarily think about all the time… But if there’s one thing that I’ve come to realize in my own introduction to this has been just how incredibly important this is to everybody going forward. You can’t, in a humanitarian assistance or disaster relief scenario, go in if you can’t communicate effectively in the languages of the people that you are trying to work with… And same with education, as you pointed out.
So I think if there’s anything I’m coming away with, it’s that we’re at a very special time for the integration of artificial intelligence and language, and that there are so many possibilities that could now be realized… So I guess I wanna finish up by asking you both, what do you think the future is looking like for local languages and AI from this point forward? It’s a remarkable moment we’re in, at this turning point, but what do you ee ahead? As an example, how do you think SIL would work with Pachyderm and other organizations out there to enable all the possibilities that we have before us?
[48:09] I have so many ideas… My problem is not lack of ideas. [laughter] I’m so excited, and I’m sure, Dan J, you probably have a lot of ideas. Maybe you can go first… I don’t wanna always be hopping in first. What are you excited about?
Dive into it, Dan J.
Yeah. Steal my thunder.
You know, when we had dinner together the other night at the conference we talked about a larger misconception in artificial intelligence, which is it seems like everybody’s pouring resources into this concept of generalized artificial intelligence… And that’s a noble goal, and I think we’ll get there eventually at some point, maybe even in our lifetimes, maybe not; maybe it proves to be a lot more intractable than we imagine… But you said that you were thinking more in terms of augmentation. And the way that I tend to think about artificial intelligence these days is very much in that augmentation model as well, or that [unintelligible 00:49:09.16] model, where the artificial intelligence is helping humans scale their abilities and use their higher-order learning and understanding and intuition… The types of things that are still intractable in machines.
We see some of this behavior that we might be able to call intuition, and something like AlphaGo, as we combine three or four different algorithms, and maybe when we’ve got 20 algorithms working together we could mimic it even more… But in the short-term this was really about scaling and about augmentation, and about allowing people to do more. And if you think about something like language, especially when you’re working with a language where there aren’t necessarily as many speakers, or there aren’t as many experts in that field, or there isn’t as much data, you absolutely have to have augmentation; you have to be able to scale what those folks are doing… And that creates more leverage. If you think about trying to lift a giant rock by yourself, you can only do so much. But if you get a really long pole and put it under that rock, you’ve got a better chance of lifting it. That’s the way I think about artificial intelligence now in the speech side of the house.
[50:26] We need to be able to help all the folks out there if they’ve only got 1,000 experts in a particular field. If the work that you’re doing is able to combine a lot of different datasets and look across 1,000 languages or 100 languages or 50 languages and find the similarities and therefore make the fact that there’s not as much data for that largely irrelevant, and still create a very robust translation model, that could then make it easier for 1,000 different texts to be translated and then a human to go over each of those quick and enhance them… Versus that person trying to scale themselves to do 1,000 different translations and quickly getting burned out. I think that’s really the wonderful part of this aspect.
And from the Pachyderm side of the house, being able to scale the infrastructure automatically on the back-end for folks to help data scientists do this kind of work, so they don’t have to also be specialists in machine learning operations, and trying to figure out how to slice that dataset up into 1,000 nodes, so that they can train it and do pre-processing… We’re very happy to partner and make a big difference and a small impact in what you’re doing, as well.
Yeah, I think you hit the nail on the head. I think what you got to there at the end, around augmentation, but also this idea of leveraging the language information that we have access to, and things that are already out there, to grease the wheels and get things moving for local languages… I really am fascinated by the concept that – like, we have all these pre-trained models out there for certain languages and for certain language families, we have these open datasets, plus we have datasets internal to SIL…
We also have this Ethnologue resource, which is information about all of the languages of the world and how they’re related… I’m just really fascinated by the idea that we could have all of that information together - what languages exist, what are their populations, but also what languages currently have data, and what languages currently have pre-trained models, and that way when we have a Pachyderm pipeline (let’s say) and we wanna train a new text-to-speech model, or we wanna train a new machine translation model, if we put in the front-end like “Oh, I wanna train a model for this language, Kimbundu”, an Angolan language that doesn’t currently have any support, what is the closest-related language that has either a pre-trained model or has a lot more data, and use tools like AutoML and some of this automation to pull those resources in and augment the development of that next language. Those are the things that really excited me and fascinate me, and I’m excited to dig more into.
Well, this has been a truly fascinating conversation. I know I learned a lot. Thank you both for diving deeply into local languages and how AI can impact and move that right along for the benefit of all. Truly an AI for good initiative. I’m pretty excited about it, so… Daniel Whitenack, Daniel Jeffries, thank you both for coming on the show.
Thank you.
Thanks for having us. I really appreciate it.
Our transcripts are open source on GitHub. Improvements are welcome. đź’š