Practical AI – Episode #184

Cloning voices with Coqui

featuring Josh Meyer, co-founder of Coqui

All Episodes

Coqui is a speech technology startup that making huge waves in terms of their contributions to open source speech technology, open access models and data, and compelling voice cloning functionality. Josh Meyer from Coqui joins us in this episode to discuss cloning voices that have emotion, fostering open source, and how creators are using AI tech.


Notes & Links

📝 Edit Notes


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host Chris Benson, who’s a tech strategist at Lockheed Martin. How are you doing, Chris?

I’m doing well, Daniel. You sound a little funny today. Actually, so do I. What’s going on today, Daniel?

[laughs] Nice, Chris… We have clones.

And we might have frogs in our throat…

That’s a good one… You know, for listeners who might have been with us a while and know our voice well, those are our voice, but not quite our voice. So today, we’re gonna be talking a lot about synthesized voices. And we’ve got Josh Meyer with us, who’s the co-founder of Coqui. Welcome, Josh.

Yeah. Thanks for having me, Daniel. Good to be back.

Yeah. So just to – because we have to say what that was first, and then we can launch into other things. So you’re a co-founder of Coqui. A few – I don’t know, it was a few weeks ago or whenever it was, you came out with sort of a live, I guess, demo or prototype of some of the functionality that your company supports, which is text-to-speech. But it’s a sort of voice cloning thing, where – what Chris and I did to create those was just upload… I think mine ended up being like 11 seconds. It was like 11 seconds of my voice, and then I was able to synthesize that intro bit, and Chris did the same. So that was pretty cool. Maybe before we launch into all the AI stuff and what you’re doing and the company, the projects, what’s the general reception been to this kind of voice cloning thing that you’re doing?

Yeah, it’s been, honestly a very positive, a very interesting reception. Most people that we end up showing it to, they’re like “Wow, that was fast, and that sounds like me.”

We’ve been working on this tech for a while. We published in ICML the kind of the core technique that we’re using for this, and it’s been there for a while; the tech’s been there for a while, but you really needed to be a coder, you needed to know your way around the command line, you needed to know how to navigate through GitHub, and download the models, and all this stuff. And what’s different now, the reason that we’re getting this kind of like wow reaction is you don’t need any of that. You can send the link for the website,, you can send it to anybody you want, they just use the web app, whatever browser they’re used to…

[04:21] And you don’t even need to actually upload audio technically, in terms of, you don’t need to go find a file on your computer, or you don’t need to pre-record… You just hit the microphone button, and say a few sentences, seconds. I think right now we have it capped at 30 seconds, just to kind of be reasonable with server fees, because more audio takes a little more server fees… But in general, we’ve found that five seconds is good enough; five seconds is enough to get a nice voice print, or however you want to call it. So it’s been fun. And I think the reception, especially from non technical people, has been the most fun.

I really like working on this, because I get to interact with people who don’t know AI, who don’t know machine learning or deep learning. They think it’s cool. They maybe read some articles in the New York Times about it, but they’re not in the weeds like all of us are. So getting it in a place that we can show it to them, and they’re “Wow, this is cool”, it’s fun, it’s refreshing, if that makes sense.

Yeah, I could definitely see that. I would not hesitate to kind of bring my wife in, who definitely does not – her perception of my work is I always have a screen up, and it’s a really dark screen with a lot of text on it, and that’s like my life. But when I pull this up, it’s very welcoming and easy; just like Record Your Voice button… Yeah, it’s cool. I think it’s a similar – we recently interviewed Abubakar from Gradio, at Hugging Face. It’s similar, like - as soon as you put a demo in front of people, it’s like a light bulb moment. This is how this thing works, or this is what I could expect as output, right?

You know, it kind of feels a little bit like we’ve been waiting for this moment for a while, because we’ve been talking through the show, up until now, mostly to kind of technical tooling and technical use cases where people are putting together amazing things… But now, as you point out, you can bring people in who have no technical capability and get something really, really interesting out of it. And we’ve been kind of waiting for AI to take this turn. So it kind of feels like maybe this conversation is kind of the beginning of starting to turn toward really broad usage by people that would not otherwise have access.

And I think that there’s – kind of generally speaking, outside of speech, outside of language technology even, there’s this… In the last few months in particular, maybe just last month - I mean, with all the image generation stuff coming out, with DALL•E, and Parti, and Craiyon, and all if this stuff…


Yeah, yeah, yeah. You’re seeing what creative people can do… Like creators, or creative types; I don’t know how to – but people outside of the coding community; you give them the tools, you give them more tools in their toolbox to do creative things, and they’re coming up with awesome stuff. I mean, there was – I remember there was that one a week where it was just like Twitter was full of koalas riding unicycles. [laughter]

It was a bit overwhelming. It was awesome, yeah.

[07:46] What I’m really optimistic for is that kind of creative use for these voice tools, because there’s tons of applications I have in my mind of where I see these tools being useful, and helping the creative process. But also, I know that the people who came up with all these image generation stuff didn’t think people were going to be doing everything that they’re doing now with, and it’s even more interesting than what the creators could have come up with. So that’s something that I’m really excited to see in the coming weeks/months, is to see what cool stuff people are doing with the voice tech.

Maybe that brings me to something that’s kind of been on my mind in this discussion, which is – so a while back, we had you on the show as part of a discussion about Mozilla Common Voice, and we were talking about speech tech, and also like open data, and that sort of thing; open speech data. I think it’s episode 104 if you’re looking for that, so take a look there. But I’ve always kind of – even at that point, I think, had in my mind this perception of like speech tech, and Alexa or Siri or whatever it is, almost like a kind of novelty type of thing, from my perspective. Like coming to a computer, where my first computer - I interacted with it with keyboard and mouse, that’s like my standard interface, and that’s what I’ve grown used to, but there’s tons of people all around the world where maybe the first interface that they’re interacting with a computing device with is their voice, using Siri, or different things. So I’m wondering, how do you perceive, as someone that’s like very close in the sort of speech technology space, how do you see the trends shifting in terms of like how serious people are taking things like voice interfaces, or creative uses, like you were talking about, of speech technology, in terms of like practical usage, and like real-world kind of scale, I guess.

Yeah. So I think a quick back-story in terms of kind of how long I’ve been in speech - I started really getting into it in probably 2012. And I did Academia research, and that was very fun… But I got into industry because I like building things that people use. I still like writing papers, which is why Daniel and I recently wrote this paper with the great folks at Masakhane, an African text-to-speech… But working with Mozilla, in particular the last few years, or collaborating with Mozilla, because I don’t work there anymore, but I still keep up ties and collaborate with them, and they’re all great folks, working on great things… I think some of the best democratization of speech technology is honestly coming out of Mozilla. But in terms of kind of speech tech being a novelty, or at least the perception of speech tech as a novelty, but actually finding real-world applications for it… I think there used to be this talk about “Keyboards are going to be gone in 15 years.” There was this talk that we’re never going to type again, we’re only going to use our voice… And I never really subscribe to that kind of viewpoint, because I think that mixed modality always makes sense. Sometimes you want to type, just because maybe a baby’s sleeping in the next room, and I’m not going to shout at my computer to wake it up.

I think in general, the most interesting applications of voice tech and machine learning at large are where they augment and support humans doing human things. I don’t think machines taking over completely kind of the functionality of a human makes a lot of sense, which is why for example, let’s say self-driving cars - I think that the technology underlying self-driving cars is super-useful.

[12:02] If I had a self-driving car, I wouldn’t let it drive for me around town. But would I let it parallel park for me? Of course. That’s something that I don’t like doing… So there’s functionality where if you take kind of parts of what the human pipeline is, for whatever task, like going to the grocery store, there’s multiple parts of that that the machine can do better than me. But there’s also parts of it that I can do better than the machine.

So I think a lot of the discussions around machine learning and also voice tech and kind of the place in humanity and where it’s going to be in the next few years - there’s a lot of this kind of dichotomy of “Machines are going to do everything here, and they’re going to completely replace humans there…” I don’t think that that makes a lot of sense, especially when it comes to even voice technology.

I think that there’s a lot of voice technology that’s being used to – let’s say like call center technology. Speech-to-text, speech recognition, I don’t see any time in the near future as replacing people who work at call centers. I see it as being very useful to people who work at call centers; they are the customer in that case, right? Say you’re on a call with somebody whose TV is broken, and you’ve got this transcript in real-time of what they’re saying, but also, you’re able to run NLP on that and come to answers faster; you’re getting kind of recommendations on how to talk to the client. And I think that this kind of using machines to augment what humans normally do, and replace some parts that are maybe particularly tedious, or annoying, or whatever, like parallel parking - I think that makes sense. And I think with voice tech, we don’t exactly know yet how it’s going to shape out, because it’s pretty new technology. I mean, speech recognition and speech synthesis have been around for a while. I think the ’80s is maybe when they first got some kind of more mass…

Adoption, or…?

Adoption, that’s the word. Mass adoption. Dragon NaturallySpeaking - I don’t know if you guys remember that…

I do. I probably still have it somewhere stuffed in an old bookshelf.

Yeah, it’s still around. I remember my parents had that when I was in school, and I tried writing a few papers with it, and I was “Not yet.” But since then, it’s advanced a ton, just in general, the technology. And especially with the speech synthesis, honestly in the last like five years or less, it’s just gotten crazy better.

Five years ago you would never listen to synthetic speech and be “Is that a human, or is that a machine? I can’t tell.” But now, it’s every research paper that comes out from the big labs, it’s “Wow, this sounds like a human. I can’t tell - is that the training data, or is that the synthetic speech?” And now I think the big challenge with speech synthesis is getting it to sound more expressive, more emotional, because I think machines and humans can basically do flat speech identical… But yeah, it’s going to be interesting to see how it all shapes out, but I think that there’s going to be a lot more creative usage that’s not predicted by kind of the die-hard technologists.

So before we dive fully into the technology, I want to follow up on something. As you were talking through that a moment ago, I’m thinking of my own experience, and my family’s, and stuff and, and it’s been pretty fascinating for me to see my wife, my daughter, my mother, people that are not technical currently, that are diving into this. And yet, I as a technical person in the AI space, am still tending to default to the keyboard, going back to that… And it’s a different user experience that I’ve seen my daughter gravitate to naturally. Before we dive fully into the technology, what is it about that difference in user experience that seems to make it more accessible, do you think? The idea of speech recognition, natural language processing to parse it all and come up with a good response, and then the speech synthesis coming back is something that is so natural for my 10-year-old daughter, that it’s just – it might as well be one of her friends that she’s talking to. So could you speak a little bit about what that experience is, and how the rest of us that maybe are doing it some, but maybe not in such a completely natural setting - how does that evolve over time? What does that look like?

Yeah, so I think that one reason that we are using voice technologies that are built on lots of other technologies - I think Alexa and Siri and Google Home, Mycroft - I think that any kind of voice-enabled home assistant is useful insofar as it’s immediate… Like, you can’t sit there and wait two minutes – not even two minutes; I’m not even gonna wait 15 seconds for Siri to talk back to me, right? I will be walking the dog… So okay, I will say, I’m somebody who has been working in this space for a long time, as I mentioned, but I’ve never been one who’s had home assistants; one, because of all my kind of privacy concerns. I just don’t want to have them sitting around. I’m very realistic. I know people who work on all the teams… Nobody’s consciously snooping and recording audio, because one, it would just kill their servers, because it’s way too expensive to even stream all that back, and recognize it all, and blah, blah, blah.

So I haven’t had them for a long time in any case, but I recently got an Apple Watch, and I find that I use it every day when I’m walking the dog, because it’s just so convenient, because I can’t – my dog pulls on the leash, and I’ve only got one hand that’s functional… But if my service is bad, and Siri is not replying for like maybe even just ten seconds, I’m “Forget it. I’ll take care of it later.” You know, the attention span of people is, in general, not super-long, and so the technology has gotten faster in general over the years, and that is a big part of making it adoptable. And besides that, there’s kind of the pleasantness of talking to a voice that sounds really human… Like, I don’t know when exactly Apple introduced it, but if I say “Hey, Siri”, at least the voice that I have sometimes will play like a normal American English speaker would reply, exactly. Like “Yes, I’m listening.” It’s not going to say “Yes I’m listening.” It’s going to talk like a human. Something closer to the movie Her. And so I think that the voice is being high enough quality that it almost sounds like you’re talking to a human… Not just the quality itself, but what they’re saying. It’s that kind of turn of phrase. But also what it’s connected to. The home assistants now - it’s basically your access to a search engine for the internet, right? You can ask it kind of fact-based questions and get answers that are usually accurate pretty fast… At least they’re more accurate than then GPT-3s… [laughs] But yeah, I think how fast it is, how human it sounds, and the kind of breadth of functionality it has; you can ask it so many different things. And you ask it to do things for you, like to schedule appointments, and so on.

[20:34] Not as a question, but just as a final thing on that topic… It’s interesting, coming through COVID, coming through the pandemic era, it’s changed the way kids interact. And I’ve watched my daughter - they’ll get on and play Roblox, online gaming, and they do a cellular conference call on the side. So all the kids are talking, and they’re playing Roblox, and they include these home devices in the conversation and bring things up. It’s kind of eerie to watch this whole thing happening, because you’re seeing children leaping forward with this technology in a natural way very, very rapidly. Multiple times I’ll just stop and go “Wow, I find that a little bit odd”, and I’m in the space… Anyway.

I really appreciated how you brought up this concept of how people actually use their languages, so how do humans actually speak, which I think is like a really interesting – we’ve sort of done this to ourselves in certain ways, because the speech corpora, for the most part, that we have created, aren’t actually representative of how people use their languages. So things like – recently at ACL I saw this amazing paper, I’ll link it in the show notes, from Text2talk from a group at Radboud University in the Netherlands. I’m not sure if I’m saying that right. But they were talking about like meta-communicative devices, meaning like filler things, like “Hm… Um… Yeah…” You know, like these sorts of things you brought up - these are very powerful communicative devices that people use very strategically in their conversation, but are for the most part considered like noise in our AI data, right? And we clean them out or don’t want to have them.

And I think the other one that you brought up, Josh, was emotion and how you tune sort of the emotional aspect of a synthesized voice. And I know that’s something you emphasize on the Coqui website. Could you describe – maybe it’s in the context of synthesized speech or speech technology more generally, but what are some of your kind of goals as you’ve founded Coqui, and how would you like to do some of these things like synthesized voice in like slightly different ways, or how has your perspective and being in the industry so long informed you about like “Oh, there’s things like this emotional piece that we really need to think about more deeply”?

I think about emotion and speech way too much; especially the last month it’s been really kind of top priority for us, because I mean – we’ve been working for a while on getting the voice cloning side of things working, so optimizing models on speaker similarity, so it sounds like you, like your vocal tract, physically. But getting emotion right is just so hard. And not only getting it right, but there’s this kind of – so there’s getting the model, the neural network to produce speech that sounds appropriately emotional. That’s one side of it. But another side that I’ve been thinking more lately about is how to, from a user’s point of view, somebody who is creating new synthetic speech with some neural network, how do they want to interact with that? How do they think about emotions? Is it something like a color wheel from Microsoft Word, where you can say “I want my color to be…”

[24:18] You wouldn’t even describe it in words; there’s just a pixel that you point to on the color wheel, and you’re like “I want that.” That’s a much better interface, even if you can’t describe it in real human words, right? I mean, you can do whatever RGB coordinates, and some designers maybe understand those intuitively, but most people in the world do not.

So if you think about color and kind of design as thinking about putting emotion into speech… Do we want an emotional color wheel of sorts? Do we want to have a dropdown menu that says, “Make this sound more angry”, “make this sound more sad”, “make this sound more sarcastic”? Or do you want something that’s more free-form, like typing a description of somebody who responds in an angry, but sassy, but sarcastic, but also a little bit sad at the end kind of way… You know, it’s hard.

Yeah, it’s almost like there’s an arc to how you want it to happen, right? Even defining like one emotional flavor for a clip is sort of not enough, in certain ways.

It makes the communication authentic, essentially, which creates trust, which affects that user experience.

So you mean if you’re able to put the appropriate emotion in speech?

Yeah. Because at the end, there’s a human that that model is dealing with.

You can think about it in this way that I was describing - like a designer that’s using Photoshop, right? In terms of using a color palette. You can also think about voice in terms of stage directions, or actor directions, right? I mean, I’ve spent a whole day sifting through scripts from movies and TV shows to just understand how do writers express what they’re trying to get across in the lines? Because if you look at a movie script, it’s not just, whatever, “Batman says this, Joker says that.” It’s “Looking away from the camera distantly, thinking about the future solemnly, and then they say this.” And putting that behind a web-based user interface is a challenge… Even more so it’s a challenge to make a neural network that is that controllable. It’s what we’re working on.

It’s exciting to hear – one of my favorite parts of honestly working at Coqui is when we’ve – we’ve got so many people who are so much smarter than I am when it comes to speech synthesis, and we’re working away, hacking on something for a week, and then on Friday we share some new voice clips from the new models, and it’s “Wow, that person sounds super-angry.” And it’s just a synthesized voice talking about whatever, getting the wrong coffee. I don’t know, it’s one of my favorite things about working with this.

Yeah. So would you say the mindset that you have at Coqui, and one of the things that you’re wanting to enable is this sort of easy to access, like from a creator standpoint, this sort of configurable and controllable way to synthesize voices? Would that be a synopsis of at least part of what you’re trying to do?

[27:46] No, I’d say definitely. I mean, that’s one side of it, which is very much the kind of business side, customer-facing, making synthetic speech for people who are creating content. That’s one side of it. And another side, which is kind of historically where we came from, is the open source research – I mean, I think that we’re pretty special in terms of the kind of voice cloning companies out there. I mean, saying that we’re a voice cloning company is not doing it justice. So we’re a synthetic speech company, but the amount of work that we do open source… And I think that because we’re open source, we’ve been able to really attract some really smart people. We’ve learned from them, we share and collaborate, and that is honestly one of the most also refreshing parts of the job, is getting people – especially people working on low-resource languages, which is something that I did my whole kind of PhD on… And working on this paper with Daniel, from the BibleTTS, which is coming out of Interspeech, making some of the best synthetic voices for a handful of African languages… I mean, working with everybody on that team was so much fun, because it was like everybody’s so motivated, and also we’re all in such different time zones… It felt like there was like a passing off the torch…

Very distributed.

Yeah. I mean, I think you have that clip, if you want to play it, of kind of what it sounds like.

Yeah, this was Howza clip. One of the interesting things about this clip, I think, is it’s one of the out of domain clips. So we tried synthesizing some voices because the voices are out of audio Bibles, and we tried some Bible-related synthesis, but this is actually out of domain. So this is like a news article, but using the same voice. [audio 00:29:38.10]

So Josh, it was, as you mentioned, a lot of fun working on the project related to the synthesized African voices. That was so cool. And also seeing the actual members of these language communities working on technology for their languages… And I think a lot of that was enabled because of a variety of open source tools, but certainly kind of centered around some of these that Coqui has produced.

Did those open source things - were those things that you were working on personally before the company was founded? Or was this a sort of team thing that you started and it was always kind of a part of the strategy that you were building with Coqui?

Okay, so in terms of the kind of core open source technology that we’ve been working on at Coqui, there’s two main sides of it. There’s the speech to text, and there’s the text to speech. Those two projects were projects that we were working on, the founding team, we were all working on for the past almost five years at this point, when we were all working or collaborating with Mozilla. And so it’s been going on for a while. And I think – so for this project in particular, there was a couple other parts that were really helpful that were actually outside of Coqui. One of them, in case people are interested, was the Montreal Forced Aligner, which is maintained by a very hardworking group of academic folks, which made –

It is really nice.

It’s so nice, right? [laughs] It’s nice because it’s built on top of Kaldi, which - for anybody who’s used it, can be a little painful… But the Montreal Forced Aligner wraps it so nicely that you don’t have to worry about all the kinds of how do you put all the stuff inside the black box.

[32:06] But yeah, so the projects - they were started at Mozilla, and the community, the open community grew around them, and there’s long-time collaborators from all over the place, in all different kinds of languages. And with Project Common Voice, which we’ve mentioned before, was really – Common Voice was a project that was created to be the data feeder for the speech recognition side of the open source project. And that’s why there’s this really rich, I think, multilingual kind of heritage to the projects, if you want to call it that, because we’ve been working with kind of traditionally marginalized languages… And those people from those communities - they are so motivated to work with the languages, they care so much, and they get it. They get that this is important, because some of the bigger companies are starting to put out more multilingual work, because of the existence of Common Voice, really; because before that, there was just no data for – I think Common Voice has all the Celtic languages now; it’s got Welsh, and Gallic, and Gaelic, and maybe there’s Manx in there, too… I mean, there’s tons of languages that are just community-driven efforts, or at least their participation in Common Voice and in speech recognition is a completely community-driven effort. So yeah, I mean, if it weren’t for the open source side of things, then none of this would have been possible.

So one side of it is like the models, the architectures, the implementations that are driving the speech to text, text to speech, and like the voice cloning things that you’re doing… But then you also have kind of pre-built models as well. If I’m going through your site, there’s a lot listed there, which seemingly you could get started with out of the gate. Maybe you could describe that kind of ecosystem a little bit, like what’s currently there, how you’ve seen people use it, maybe even in surprising ways.

Oh, yeah, there’s been some funny ways people have used it. So the largest diversity of languages we have is for speech to text, the speech recognition side of things. We set up the codebase so that fine-tuning to a new language is super-easy, even when you have a tiny bit of data. And if you constrain the vocabulary, you can do really cool things, even if you don’t have enough data to make a full-blown speech transcription system.

For example, we had hackathon… Wow, was it this year? [laughs] I feel like it was earlier this year… Where a team put together a voice-activated 3D chess board, that you could – well, you can; it’s open source, it’s out there. They got it working for English, for Turkish, I think maybe Hindi… And right now, there’s some people who are adapting it for Korean.

And that’s like, move Han, whatever to, not a chess player, but I know the sort of things I hear on Harry Potter, or whatever.

Yes. Yeah, it’s exactly that. There was a huge discussion on how to do that well. So now, after that, I know how to move pieces in chess, but before I did not know how to say it out loud.

You also have to capture Harry Potter’s emotion though, as you’re moving the pieces, and all that. That was a very emotional game that they were playing there, so… We’re right on topic here.

[35:59] Yeah, yeah. The cool thing about that is because the models for speech recognition are so small - I think they’re like 46 megabytes for the acoustic model, and then the language model is, I think, really small, because the vocabulary is so small. So these things you can run just on your laptop. You can turn off the WiFi and just have it running locally.

And actually, right now, the last few weeks, there’s been a group of folks who are working with the Catalan language, who are adapting our speech recognition tools to make it work with WASM, so that it just runs in the browser, super-fast, everywhere. That’s just like a subset of our open source community, who just picked up the tools and are running with it.

And then for the speech synthesis side of things - honestly, one of the easiest places to interact with those models is on Hugging Face. We have a Hugging Face space, I think it’s coqui-AI… And you can just – the Gradio app is really nice, I have to admit… You just type in what you want to say, you click the language and the model you want, and then you get it. I don’t remember exactly how many languages we have for speech synthesis, but it’s growing. And after the Masakhane collaboration, it’s six more languages from Sub-Saharan Africa. So it’s pretty cool.

So I’m just wondering, as you’re doing some of these how – I’m going to kind of going back to when you were talking about how you were just thinking about this all the time, and you can’t really turn that off…

Voices in your head…

Exactly. Like, so many topics… What are the types of things that practitioners as opposed to the users need to be thinking about as the field at large is moving forward? Because we’ve asked these kinds of questions of ourselves, and 99% of everyone has the best of intentions, as do you. But how do we make sure that as we really move the state of the art forward in terms of having things like very, very genuine-sounding emotion, and different emotions rippling through this, how do we need to think about that in terms of the effect on the users? Because there’s some amazingly positive things, and potentially, if we screw up, for the very few bad actors out there, there could be negative things as well. So it directly affects kind of mental health, both positive and potentially negative, of the end user. So as someone who’s thinking about this all the time, how do you frame that? How do you frame moving the field forward in a positive and productive way?

So I think that I’m glad you brought this up, because it is a huge part of working in this technology. If you ignore it, you say, “I’m working in voice technology, and I’m just going to work in my bubble, and I’m not even going to care about the fact that somebody might be using our speech recognition for illegal surveillance”, right? Like, if you ignore that that is a possibility, you are doing yourself and also the community a disservice.

So there’s a lot with this… For one, I’d like to point – and hopefully, we can get maybe a link to this… There’s an issue on our GitHub for text to speech, which is an open discussion on ways of mitigating misuse of synthetic speech systems. And that started as kind of – actually, not an issue; it was as a GitHub discussion, I think. And it started as this kind of, “Hey, let’s throw some ideas together” and it just evolved, and now it’s just kind of a growing discussion. And I think being open, having these conversations openly is really important, because we’ve got some feedback… There were ideas that came from the people in the community that I’d never even thought of. There’s watermarking audio, which is kind of obvious… But there’s whole layers of that.

[40:04] And then there was somebody who’s doing their master’s thesis on this in particular, and they weighed in… So I think that one first step is to not brush it off, to think about what kinds of misuse is possible… Because there’s so many different kinds, and if you lump them all together, then it becomes too kind of hard of a problem, and it can just kind of lock up; you get brain freeze, or however you want to call it.

But I like to think about kinds of misuse as basically two major groups. There’s people who will misuse technology accidentally, where they don’t think they’re doing something bad, but it just blows up… Which could happen very easily on social media. The Onion, as a satirical newspaper from the United States, famously – I don’t know, every other month there’s an article from The Onion that gets taken seriously by people who don’t know The Onion. And that can cause real harm, right? It can get people very upset, because they have a satirical headline, and it looks like a real news site, but it’s not.

And so that is an example of somebody who’s not – they’re not trying to spread fake news, but because people don’t know the context, because it’s just shared as a screenshot on social media, it loses all context, and it just snowballs into something that it was never intended to be. So there’s examples like that.

You can think of somebody who uses our voice cloning to make a clone of president Trump saying something and they share it on social media… And earlier, we had a voice cloning demo which was – you didn’t need to sign up to have an account to do it. So anybody could do it. And what we did then, because anybody could do it without having us know who they were, we put basically a watermark, a very audible watermark in the background, which is background music. I mean, it’s like a very low-hanging fruit, but anybody listening to it would not think, “Oh, this is an actual recording, a secret recording of Trump in the Oval Office.”

Yeah. You delegitimize it deliberately, to where any user will pick up on that.

And so if you’re making something that’s really, really accessible, easy to – like, the barrier to making a mistake and making something that you think is funny, and sharing it with your friends, and then it blowing up is very low, then I think putting those kinds of roadblocks in is important. Roadblocks is another – if you think about mitigating risk as putting up different kinds of roadblocks… Because at some point, it’s impossible to mitigate all risks; or at least, it’s impossible to guarantee that somebody or an organization that’s motivated enough will not do nefarious things with your code, right? But there are lots of responsible ways to put roadblocks in there.

So right now, we have this voice cloning technology, but you have to sign in, you have to have a real email, and that is a way to have some kind of accountability. And on the speech recognition side of things, we have model cards that are out there that explicitly say “Do not do bad things with this.” It’s more specific than “Do not do bad things.”

And in terms of research collaborations, I think that, especially when you’re working with language communities in which you are not a member, it’s important to have members of the community working with you, so that you know what are the risks. Because the risks for me living in California, the ones that I can perceive, are not the same kinds of risks for – you know, I have long-standing collaborators from Makerere University in Uganda, and we’ve been working on keyword spotting in radio data, ideally to be used by the Ministry of Health, to kind of help inform health policies. And there’s a whole bunch of risks that we spent a long time talking about, and figuring out how to mitigate for that context, because it’s just different.

[44:21] I think somebody brought this up on Twitter - while it’s kind of forefront in my mind - how do you know if you’re working with a language technology for a low-resource or marginalized language, that you’re not doing harm to the community? And I think the simple answer is, if you’re not a part of the community, you have no idea. That’s why you really need to be working with people from the community, if you want to be working on that technology.

I’m glad you addressed it. It’s a relief to hear – you have so much thinking around that, way more than we can cover, but it’s also a relief to know that you’ve kind of put that thought ahead of doing this stuff… So yeah, thank you for that. I appreciate it.

But yeah, I think it’s also like there’s really amazing positive things that come out…


…involving the community, like this Masakhane work that you and I did, Josh, with that community. I don’t speak those languages, but there’s like very simple things that I would have missed in the audio, or in the processing, that were just completely obvious to them… And like it made the results so much better. Also in terms of the quality of the work. So just involving the community… Like, you learned so much, you learned about this use and ethics side of things, but you also generally produced better output and better work. There’s a lot to be said for that, and I’m glad you brought it up.

And actually, I want to say - the way I think about is I don’t want to be working on a project where I’m involving the community. I want, ideally, to be involved. I want them involving me. The motivation – and I think that with this project in particular… I know I did a lot of the kind of the technical – some of the technical, not even the technical side of things, but it was very much a Masakhane-driven project. And I think that if there’s language communities out there that want to collaborate, I’m like “Yes, let’s do it!” But I definitely want to be the one who’s getting involved, as opposed to trying to pull other people into a project, that might have false pretenses in the first place. Like, I could think this is a great project, and I might be able to convince people that it’s a great project, and it’s their language, but it’s not. At the end of the day, I think that if you get the motivation, the impetus going the other direction, that’s where real good work is done.

Maybe that’s a good sort of way to segue to a close, as we’re coming up on the end here… What would you tell people out there – maybe it’s language community members, because now we do have listeners all over the world… Like, language community members that want to get involved, and sort of like build things with the open source technology that Coqui is a part of. Or maybe it’s people that are creators, are curious about this technology and want to get involved. What would you tell them in terms of joining into this work and helping move it forward in a positive way?

[47:21] I think there’s like a million ways to get involved in a machine learning project, and you do not have to be technical. There’s people who use the tech, and that’s the obvious one to point to… Like, “Oh, I’m involved in the project because I’m using the voice cloning software to clone my voice, sounding like five different characters in my video game. And I’m using that.” That’s one way. But from the open source side of things, there’s so many ways to get involved. Documentation is just a super low-hanging fruit. Documentation is something that really can make or break an open source project. And you can be super-technical, and write super-technical documentation, which is useful, like API documentation… Or you can be somebody who’s writing a kind of best practices playbook on how to use the tech. And honestly, the people who are less technical - maybe they’re just starting, they know a little Python… Those people are able to write tutorials, and beginner-friendly documentation way better than people who have been in it too long… Because if you’ve been in it too long, you’ve just absorbed all of these assumptions that are not intuitive for working with the code.

So I think getting involved with an open source project, an easy way to do it is to join wherever people are talking, whether it’s on GitHub, whether it’s – like, we use Gitter for our chat rooms, and you can pop into the chat rooms and say, “Hey, this is me. I want to get involved. Here’s my skills.” That’s also a very low-hanging fruit. And also, I think – so the Masakhane community is, I think, honestly, the best example of this. They’ve got such an active chat room on Slack. You’ve got people from all across the spectrum, of super-technical to not technical… But the thing that unites everybody is just they love languages, right?

When I think about where I want us to be, at Coqui, for a kind of healthy open source community, I often think about Masakhane and how they’ve done a great job. That’s the whole reason the project started, the speech synthesis project started, is because I enjoyed hanging out in those Slack rooms, because they were fun, and then we just started brainstorming, and then it just evolved.

So yeah, I think that getting involved is very easy. You definitely don’t have to be technical. And a lot of times the non-technical people have more to bring, because they’ve got a fresh view on things.

Beginner’s mind.

Yeah. I definitely appreciate that, and I think it’s a great way to end. I mean, definitely, even with this podcast, it’s great to be part of a wider community that’s doing amazing things, and you sort of get sucked into these amazing stories, with Masakhane or other things… So yeah, I’m really glad that you brought out that side of things, and I appreciate you taking time to speak with us, Josh; really excited to see what’s happening with Coqui, and hope to have you on again in another 80 episodes, to share all the great things that are happening then. Thanks.

Yeah, thanks for having me.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00