Practical AI – Episode #128

Next-gen voice assistants

with Nikola Mrkšić from PolyAI

All Episodes

Nikola Mrkšić, CEO & Co-Founder of PolyAI, takes Daniel and Chris on a deep dive into conversational AI, describing the underlying technologies, and teaching them about the next generation of voice assistants that will be capable of handling true human-level conversations. It’s an episode you’ll be talking about for a long time!



O'Reilly Media – Learn by doing — Python, data, AI, machine learning, Kubernetes, Docker, and more. Just open your browser and dive in. Learn more and keep your teams’ skills sharp at

RudderStack – Smart customer data pipeline made for developers. RudderStack is the smart customer data pipeline. Connect your whole customer data stack. Warehouse-first, open source Segment alternative.

The Brave Browser – Browse the web up to 8x faster than Chrome and Safari, block ads and trackers by default, and reward your favorite creators with the built-in Basic Attention Token. Download Brave for free and give tipping a try right here on

Notes & Links

📝 Edit Notes


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I am a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who’s a principal emerging technology strategist at Lockheed Martin. How are you doing, Chris?

I am doing just fine. How is it going, Daniel?

It’s going great. No complaints. I took a few days off last week, and a short week this week, so… I’m rushing to get things done this week, I guess is how it is.

It’s supposed to recharge the batteries and make you feel all refreshed, but you just end up having to do all the same work and get it done faster, I get it. I’m sorry.

Yeah, I have known people in my life that do a really good job of frontloading some of that before they leave. For some reason, I’ve never been able to figure that out, so…

No, neither have I. I think that’s like a super-power that people have. It’s a vacation super-power. It’s not something I’ve ever acquired, sadly.

Yeah, maybe someday…

Yeah. I have to go into vacations highly stressed out because of everything that I was trying to get done, so yeah, I guess that makes the vacation all the more important.

Maybe so, maybe so. Well, one of those things that I’m definitely working on right now and trying to get out the door is a couple changes to some of our internal speech dialogue related technology… Which is part of the reason why I’m really excited today about the topic, because we’ve got a great guest, the CEO and co-founder of PolyAI, Nikola Mrkšić. Welcome!

[04:22] Thank you for having me. It’s great to be here.

Yeah. So before we get into all sorts of speech and voice and dialogue-related things, maybe you could just share a little bit of dialogue with us about your background and how you go to do what you’re doing now.

Yeah, for sure. I’m the CEO and co-founder of PolyAI, I did a Ph.D. with a guy called Steve Young back at Cambridge with my two co-founders, Shawn Wen and Eddy Su… And you know, we worked on building dialogue, in an academic context, for a long time… Since 2006, when Steve founded the group. Steve started working on this stuff back when speech error rates were about 20%, which means that one in five words that you said would be misrecognized, and typically it would be the one that you needed to get right to understand what the meaning of the sentence was.

Of course!

[laughs] So building a formalism on top that would kind of like let you model the uncertainty, predict all the right things, know when you got something wrong, disambiguate, ask a question, personally confirm something… It’s an art, and I’m sure we’ll talk about it a little bit. The squat or the deadlift of just kind of like doing machine learning and NLP, because it involves natural language understanding, doubt management, response generation, interacting with the external world, knowledge bases of specific different tasks, all the way into the natural language generation, figuring out what to say in human language, and then producing it again in an audio format which is as human-sounding as possible. So it’s a real compound, and you have to get it all right.

So it’s a really difficult task, it’s why I got into it, it’s why I’ve stayed passionate about it, and my whole team is really a group of people who worked on it long before this new hype of conversational AI came about. We worked on it for a while… Our previous company, also a spin-out from the University of Cambridge, VocalIQ, was acquired by Apple in 2015, to make Siri more conversational, to give it a bit more of an ability to have a back and forth conversation… And in academic terms, the multi-turn task-oriented dialogue is something we’ve stayed passionate about. And at PolyAI we’re building voice assistance for customer service, helping brands create something that’s a superhuman customer experience, for anything that’s short, to moderate, or even high complexity, putting in automated agents that sound at least as good as your best agents. They have answers to all the right questions, they’re always current, up to date, and they’re able to provide a superb level of customer service; and if they’re not, they have also their human colleagues as kind of like supervisors or [unintelligible 00:07:01.16]

You mentioned one thing, a phrase - multi-turn voice-enabled dialogue I think is what it was… So maybe you could just kind of set that in context for some of the – like, what are the categories of dialogue and voice-enabled dialogue that are out there? Probably most people are familiar with Alexa… Is that multi-turn voice-enable dialogue? What are the categories of things out there that people are doing in terms of interacting via speech?

Yeah, there are many ways you can look at it. I’ll lay out the taxonomy here. I think when you think about Alexa, Google Assistant, Siri, and you ask if they’re multi-turn, if you can really have a dialogue, a conversation, as opposed to just have many questions answered, the truth is primarily they’re single-turn question answering, or kind of like simple task execution systems. But then again, they’re working really hard on making them multi-turn.

[08:03] Now, one reason why it’s really hard to build a general multi-turn voice assistant for consumers of all shapes and sizes is that they have very different requirements, they’re trying to do different things… So it’s actually a task of enormous complexity.

When it comes to the things that we do, they’re a bit less complex in scope, because we build things to help you change your ticket for an upcoming flight, or maybe you’re making a reservation for a restaurant, or you’re trying to debug your router, which stopped working, and you’re having connectivity issues. Or you’re calling your bank and updating your address. These are all things we do.

The one thing that’s important about that task-oriented bit of the nomenclature is that it lets you evaluate. And when you can evaluate, that means you’re doing good science and you can improve. Now, evaluating something that does as many things as Siri or Alexa - it’s hard. Building them is hard, and evaluating them is hard. Knowing what you should be expecting and where product-market fit for them is - it’s one hell of a task.

Yeah. So when you’re saying a turn, just to kind of get into some of this jargon - a turn would be like you say something to your smart speaker and get a response; is that what you mean? And then in your multi-turn - if you’re trying to debug your router, or you’re changing your flight ticket or something, that is likely going to take more interactions than uttering something and getting a response. Is that correct?

For sure, for sure. Yeah, the whole reason we talk about turns is that, for the most part, the dialogue systems today, from the voice assistants of the large tech giants, to automated customer service, to chatbots, are built on what’s known as the turn-taking paradigm. So the assumption - and it’s a strong one, and it’s not something that necessarily holds in human speech, is that you’re gonna wait for me to finish before you start speaking. And the assumption is that you’ve also absorbed all the information that I’ve tried to relay over to you before you started speaking. Then we’re taking turns speaking, and… Yeah. A multi-turn conversation is kind of like anything that takes a lot more than one turn to achieve a task.

I actually wanna take you back for a minute, because you said something interesting… I’m curious about your perspective compared to someone like myself, who doesn’t have your expertise doing this. When you were talking about Alexa and Google Assistant being hard - and you used the word “hard” associated with that several times - I really couldn’t help wonder, as someone who has been doing this as long as you have, starting with those Steve Young days and moving forward to the present, when you say it’s hard, I’m kind of curious what you’re thinking. You’re compressing it all into a single word, but how are you thinking about that? As you were saying that, I kept wondering that.

Okay, it’s a really good question. I say it’s hard because it’s enticing, it’s fun, it’s a big problem, I expect to spend the rest of my life solving it, and to be maybe a small cog in the wheel of how that ends up being solved. It’s a hard problem on a pure academic level, because it’s that compound movement of different NLP tasks that all need to work really well, and they need to communicate with each other, which is something that we’ve not really yet cracked.

The thing at the center of the dialogue system, language understanding, is not a tax that’s fully well-defined just yet. Think of speech recognition - in most languages, you say something and there’s exactly one way of writing it down. But natural language understanding - what does it mean? How do you choose to interpret it? What are the things you’re choosing to take away from, even like an order, or something as simple as that? So there a lot is left to the interpretation, and that means that it stops really being a science, or even a field of engineering where you have a clear metric to beat… Because really, what we’ve shown over the past ten years, especially with machine learning, is that you give something a clear evaluation, and the sheer force and the intellectual power of the people working on it will crush it.

If you think about question answering, the SQuAD dataset.

[12:10] I remember the leaderboard on the Stanford website, I believe, and - you know, at first I think the scores were pretty low. Then all of a sudden, like 3-6 months in, we got to the point where performance was unbelievable. I couldn’t believe that it’s that good. And it’s because when you define a clear scope for a problem, we’ll build a machinery to solve it.

Now, when it comes to building these voice assistants, we don’t know what machinery to build. The truth is we’ve built a lot. [unintelligible 00:12:36.27] I like to compare these assistants to aircraft carriers. When you think about Alexa - 14,000 people building that thing. And if you think about the ROI, they’re not building it because they’re making a ton of money on it. They’re investing in the future. I mean, Amazon always has a math around it, and I hope they’re right… In any case, it’s really good for us, because they’ve actually indirectly funded a big growth in the area. They’ve allowed us to, in turn, build a lot of stuff ourselves. But we were doing it before they got interested.

The problem itself is hard. It’s a good word to use… Because you have to solve a lot of these different problems, you have to solve things that are not just in the domain of machine learning, but also human-computer interaction.

The voice user experience is something that academics tend to overlook. They don’t appreciate that sometimes just the tone or something like that is much more likely to prolong the conversation and imbue the caller with enough patience and goodwill to go through a conversation. Equally, people who are very good at user experience don’t tend to be mavericks at machine learning… And then kind of bringing it all together to build something - it takes a lot of different personas, people… It is hard. So I guess that’s what I mean.

You’re talking about the human-computer interaction, and maybe you could speak a little bit to the way in which people interact with - whether that be like a text chatbot, or a voice assistant or something, is different than how they might interact with another human. What are some of those differences in terms of ways that people interact with those systems, versus their friend and meeting them at the coffee shop.

Oh, for sure. I think that we could spend a lot of time just talking about the differences between voice and chat, and how people interact there… People will dispense with pleasantries and they’ll tend to use shorter sentences when they operate with technology. They’ll swear more; a big chunk of input going into all the large tech companies’ assistants are–

I’m sure you have a huge database now of incredibly explicit and abusive language that people have said to their bot…

We’ve maybe got a few colorful examples… I can tell you for a fact that we have a lot less percentage-wise than the large tech companies, where the number of terms coming into these assistants with profanities can reach double-digit percentages. Part of it is just human nature… Like, what do you do when you can use a new technology that understands you? Well, you swear at it, because why not? “Let’s see what it does.” Hopefully you tried it at some point, or hopefully – I don’t know why “hopefully”, but… You know, things like Siri will have really good back-up mechanisms to be sassy, or to tell you off when you’ve cursed at them… Or even when they think you might have, which can happen, because speech recognition is not perfect.

The other thing that’s always interesting with technology is what do people build first when a new framework comes to mind… I used this in one of our first investor pitches, but… At the time, the top four applications within Alexa were things that allowed the system to read recipes out loud, which is kind of cool, and then the remaining three were meowing, farting and barking. Funnily enough, the top revenue-grossing app when the iPhone was released was something playing one of those kinds of sounds. You can probably guess which one.

I guess it’s a pattern of how technology evolves, but people tend to do these kind of simple things that are just hacks, where they have their fun, and then they go on and build life-changing things.

So Nikola, you were talking a little bit about different applications of voice technology, but also the way people interact differently with even chat versus voice… From your perspective as someone who’s really hands-on working with customers in this space, what makes for a good voice use case? From my perspective, people maybe don’t have a great grasp on yet, in terms of like - yeah, we all think voice technology is maybe gonna be a huge thing, and we can see really cool applications of it, and maybe even really useful applications, but it might be hard for people to visualize what is a good voice application and what are the benefits of that as compared to creating a text-based search, or creating other things… Like, when should I be thinking maybe voice?

As you answer that, can you differentiate between voice and chat, just for people who aren’t intimately familiar with the use cases?

Yeah, for sure. I think that there is a bigger question of like where you want to use voice as an interface to technology, and then there’s just the more narrow question of where you want to use voice or text when dealing with customer service.

The only other interface you really have other than language are good graphical user interfaces… Let’s say like smartphone apps, and the web. Obviously, the language-based ones are better if you’re on the move or if you’re just simply not in front of a computer… Or if you want to do something really quickly. Now, how the whole AR/VR space will evolve - it’s hard to predict, but we know it’s coming. And there, the role of voice in particular is gonna be much larger than what you see when you interact on the web, or with mobile. Mobile in fact is the worst one, because you’re kind of holding your phone, it’s a bit awkward, you’re typically surrounded by people… It’s awkward to speak into your phone. And really, the place where voice on a phone has really been successful is hands-free while you’re driving. There, Siri gets a tremendous amount of usage.

[19:54] When you think about the web, I think that’s where chat is a natural interface for customer service, compared to speaking, often… Because you might be at work, you might be speaking to your bank, or dealing with something, and you don’t want your colleagues to know that you’re actually doing that at work, so chat is pretty useful. Or maybe it’s early in the morning and you don’t wanna wake up other members of your household. But in reality, 60% to 70% of all customer service interactions happen over the phone, and they happen with voice because in this day and age, where you could easily transcribe all this and have all sorts of channels, we’re doing this podcast. Well, while recording it, I see you guys, and you see me, but the end product is just voice, because people can consume it anywhere, and it gives them a pretty good feel for what kind of people we are, how we talk, our style… A lot of emotion goes through that voice, and it’s also a really high bandwidth channel. I can probably type a bit faster than I speak, although it depends, because I tend to speak very fast… But really, I have a lot more fun when I speak, and it lets me express myself a lot more fully.

If you think about just the need to capture that channel when it comes to customer service, when Covid hit, everyone thought big crises like these tend to accelerate technology adoption… And for close to a decade now, companies have invested in digital transformation in order to push people to digital channels - mostly chat, either web chat where there are humans on the other hand, or chat with an automated system. The hope there was it’s cheaper.

Some people, especially those heavily invested into these projects, would tell you that it’s the channel of the future, younger people prefer doing it… Well, look - I’m a millennial, I’ve got a Ph.D. in computer science, I grew up playing computer games, and not seeing the sun… And guess what - when I need customer service, I like to call. And I have a bit of anxiety calling in, like most millennials do… But I still prefer to call, because it gets the job done. The alternative is you’re typing, and then someone responds in four minutes, because they’re actually speaking to ten people at the same time… And it’s not really a better experience.

With Covid, people thought “Hey, now’s the time for chatbots to take over.” They’ll go from their 10%-15% of the market share, heavily augmented by the fact that you’re being forced onto that channel, and the hope was now it’s gonna go to like a much higher percentage.

Truth is, Covid hit and call center volumes went up, and all other stuff went down because of social distancing or mandatory lockdowns… But really, people kept calling, and it’s dispelled that myth. Now, in our case it’s been really great for PolyAI, because we built voice-based systems for customer service, and it’s been a big boon, especially getting into those industries that previously might have hesitated to build this kind of futuristic technology… But it’s not going away. And as time passes, you’ve got a smart speaker in every part of your house. At some point you’ll have some kind of wearable that will capture your voice really well… It’s gonna be really convenient to just say “Hey, turn on my thermometer, and order pizza.” We can talk about these scenarios…

I’m curious though, as we’re about to dive into that next - do me a favor and set some context for me for those of us… Like, both of you guys are experts in natural language processing; I’m one of those interested people, but I’m not an expert like you.. And we’ve talked about having this multi-turn dialogue and these interactions. As you’re going and solving this for people out there and providing these capabilities that we’re all getting excited about, can you talk a little bit about what it is that you have to be thinking about in that pipeline, as you’re doing multi-turn? What are the things that are a part of that consideration for those of us who are not as intimately familiar with that?

[23:48] Okay. So if you think about it like that, that cycle of building a dialogue system, especially if it’s voice-based, the first step in it is speech recognition - transcribing what you think the user said. If you’re doing it in the best way possible to maximize performance, the output of that is not a single sentence, but instead something that is a lot more complicated. A bit more complicated is an N-best list, so maybe ten different hypotheses of what you might have said.

Let’s say I wanna get Serbian food, and I said it fast. So is it Serbian, Siberian, Syrian…? You’re not sure. Or like “I wanna go with three people.” Did I say three, or did I say free? Well, of course, if we go into that technology, the language model there would basically say “Hey, it’s more likely that they said “three people”. But then again, “free people” is also something that you tend to see quite often in text, so it’s not impossible. So a good system will tell you “I think it’s three people, but it might be free.” And equally, Serbian/Syrian, a few other hypotheses.

So then the next thing that comes is natural language understanding. Taking what the user said, and parsing it and saying in some ontology that I have previously defined, that I need to interact maybe with the external world… So let’s take booking, as an example. If I say “Hey, I wanna come in with me and my fiancée.” That actually means two people, and it’s not NLP as in like parsing who the entities are, because while it could be useful in a composite task of counting up how many people there are in the request, really what you need to know is that I’ve initiated a booking request, and how many people have asked for it. And it’s two people. And that’s actually a really hard thing to do, because actually parsing those words and saying “A-ha, two people” - that’s complicated.

Then the next thing is – in very traditional dialogue system literature this would be called a dialogue act, where I’ve kind of like confirmed that the number of people is two, and that will then go off into a dialogue manager that will say “Okay, well do I have something for two people?” And then the system will have to go and say “Request a time for the booking.” Now, request time - if you respond like that, you’d sound a bit like [unintelligible 00:26:00.19] you’d sound like the Terminator, or you sound like a really bad voice assistant.

“Request a time…”

Yeah, for sure. So you need to turn that into “Hey, what time would you like to come in?” So that’s kind of like natural language generation, another big subfield of NLP… And then finally, if you wanna produce it in audio, you have to use a text-to-speech engine. That would convert it into audio, you would play it back…

And then the big thing in understanding everything and having a good conversation would be the bigger task of dialogue management - looking at the whole previous set of things that were said, and using it to augment the prediction in every subsequent turn.

You might have thought that I said “free people”, but if I repeat “free people” in the next turn, then that alternative hypothesis is probably true, and the system should figure it out, like “Hey, why is he repeating it? It doesn’t sound right. Let’s try that other one. Did you say “three people”?” You can choose to confirm if you’re uncertain. And there there’s a lot of machinery around how you handle that probability, distribution, uncertainty, a lot of Bayesian methods that come into play. It’s pretty serious, this one.

Yeah… So now if you take that series of steps that you’ve laid out, obviously PolyAI is working in all these areas, but I was wondering if you could maybe talk about where do you feel like you’re having to spend most of your time – there’s probably open challenges in each of those areas, but maybe where is the biggest open challenges in terms of advancing this field along that pipeline of things?

Yeah. When it comes to where we focus, one place where we don’t spend a lot of effort is the speech recognition task itself… Because that’s one which is pretty well defined, commoditized, a lot of people are playing there, a lot of progress has been made… The big tech companies are pouring in millions, and that’s great for us, because we’re just getting a better product, that we then get to build on.

So we typically use (often) several speech recognizers in a single deployment to get that variance out, so we can extract the best possible prediction out of all of them. So the more uncorrelated Google and Amazon are, the better our performance gets. But we love that, and we thank them for all their hard work.

[28:16] Now, when it comes to the piece where we really excel - and this is where we’re really, really differentiated from your 1,500 chatbot providers that a lot of them claim to do voice, but their idea of doing voice is “I’ll put a speech recognizer and a text-to-speech engine there. It’s gonna be great!” This is why they don’t have many appealing voice applications out there.

The piece that’s really then exciting is what we like to call spoken language understanding, as opposed to natural language understanding, so SLU vs. NLU. And the difference there is you really have to consider the fact that there’s a bunch of different speech recognition hypotheses that you can operate over to really figure out what’s going on. The second bit is you also have to look at what happened previously in that conversation to know again how to tilt the outcomes to improve the accuracy.

And then finally, one thing that we do really well and that’s really important is as the conversation progresses, you can anticipate where the conversation is going to go. If I’ve asked you for how many people are coming in, or if I asked you about – for example, what our systems can do is parse my name right, and Nikola Mrkšić now will be a common name in English-speaking environments; it’s a hard name by Serbian standards. But if you know that – like, say I told you my phone number and you’re authenticating me, then if you inform the speech recognize that Mrkšić is coming up, well, then they’re actually quite likely to parse it correctly, even though it’s an impossible collection of syllables in English; it’s very unlikely even in Serbian. But if you know that it’s coming, then you can [unintelligible 00:29:49.27] So that’s really important, and that’s spoken language understanding. That is what lets us do voice really, really well.

Could you also, just for those who are coming along with us, as you talk about spoken language understanding, which may be a new term for some people, could you also just real quickly define – you’ve kind of talked about some of the qualities of that… Is there more of a formal definition, or is this more of an information way of addressing it? I’m just kind of curious… As you bring people into the terminology.

Yeah, it’s a formal research problem, and it kind of like touches on different ones… But if we attempt a formal definition here, in a specific dialogue task, where we’re trying to accomplish something, it is this problem of taking an audio stream and turning it into actionable, parsable slow-value pairs, typically, or something like that… So kind of like slots, or things like maybe say date, or location, or number of people… So extracting that structured information that in the backend your logic - so not AI; your pure business logic knows what to do… Either it sends a booking request, it sends a query for a specific kind of information, or something like that.

So rather than NLU, which is, again, relatively complex to define, when it comes to dialogue it’s again this idea of extracting the same kind of information from a written sentence. Now, the thing about a written sentence is that there isn’t any noise injected by the speech recognizer, whereas in SLU there’s an audio file which is not only about a speech recognizer that may struggle to recognize a particularly complicated word that may be from a pharmaceutical, or from a travel domain, or something that doesn’t come up frequently, or it’s a problematic last name, but really maybe it’s just background noise. Maybe it’s the fact that the accent of the person is not something you’re expecting; your models aren’t very good at it. Or it could be that increasingly they’re speaking from two rooms away and your seven microphones in the Alexa device are insufficient to capture what they’re saying… But you know, we’re people, our expectations are growing, and we expect that these things will work for us.

[32:03] That’s what makes the problem fun as well, because it’s kind of like shifting goalposts. Just when we got it to work when you normally speak on the phone’s speakerphone. Once it works on the speakerphone, there’s a baby crying in the background. And then you’re driving and there’s a baby crying in the background, and someone’s talking over you. And then you might wanna switch language… So it’s fun. It’s a hard problem, as I said.

As you were talking through some of the things about spoken language understanding, one of the things that you mentioned was things related to specialized jargon maybe, or particular accents… My question is - let’s say you’re onboarding a new client, they’re in a specialized domain, and they’re trying to create this new voice assistant… At this point, how difficult is it to onboard a person into that? How much data do they have to provide, and how much are you able to transfer things from your other use cases and common data that you have? …maybe both in terms of restrictions between clients, because I’m sure you can’t always share data that you’ve gathered from certain clients and use it to create things for other clients… How does that process work at this point, and how much pretrained models can you use, and that sort of thing?

For sure, for sure. Well, I’m sure most of the listeners of this podcast know about the importance of pretraining for deep learning applications. It’s typically like, figure out how to pretrain well, and then really good things in that [unintelligible 00:34:25.21] follow. So when it comes to natural language understanding, I can tell you, and I can talk about it for hours, about collecting datasets on a thousand, two thousand training examples… And we did this back at Cambridge. My co-founder Shawn had this really good dataset, and a revolutionary paper. He had one of the first papers training an end-to-end dialogue system, and that involved a cool compounded movement. It’s a really well-cited paper, a really good piece of work. But for that paper, he collected a dataset of 600 training examples, and then for another paper of mine I needed a bit more data, so he collected a bit more… And then I’d go through an annotate, he’d go through and check the annotations, I’d do it again… It takes about a collective one week of work, and it leaves you with permanent mental health problems. [laughter] I didn’t mean to crack jokes about mental health, but it is a daunting task. It’s no fun at all.

So when we started PolyAI, we were like “This has to stop. We’re never gonna build amazing things if we’re dependent on doing that…” Because bear in mind, we’re pretty highly qualified for creating this kind of data. So if it takes two people in the last stage of their Ph.D. after years of doing this stuff to create that dataset, that’s not scalable.

So what we then started doing was pretraining representation models for dialogues, that would kind of like look at billions of conversations - things like Reddit, Quora, Twitter - and learn good representations for a dialogue, so that if I give you a set of turns, like you spoke, I spoke, you spoke, I spoke, and then I say “Hey, model, use the representation of the dialogue so far, and the representation of a potential follow-up, to determine whether it’s a good follow-up to that conversation.” So we would pretrain in that way.

[36:21] If you do it like that, then you get a lot of trimmed data out of things like Reddit, Quora, Twitter… And Reddit in particular is an incredible resource, because people talk about all sorts of things on Reddit. And they’re people from all over the world, and in different languages as well. But also - let’s think about English - in all possible different dialects; you will see anything phrased and rephrased there, in a good way… And you can train this thing for a long time.

This encoder (ConveRT) that we have built is something that’s in the family of models like BERT, or a GPT, where it’s pretrained on a lot of data, but unlike those models, it’s not really a language model, it’s an encoding model for dialogues, and it’s purpose-built and purpose-pretrained for conversational AI… So that then, when you use this model to do things like intent detection, value extraction, all these tasks that form that big compound movement of dialogue - it’s a model that’s really powerful. It takes much less data to get to a high level of performance than something being trained from scratch.

There’s been a lot of benchmarking of this model. Salesforce recently came out with a study that confirmed that this is the most accurate – or rather the best thing to pretrain with to get the most accurate models, with a limited amount of data… Or any amount of data, really.

So this is really important… And then what we’re able to do and what we bring into all of our deployments is this model. Then in all these deployments, wherever there’s not sensitive information, it’s just more conversations that are used to subsequently tune that model. But we don’t need to use specific, nitty-gritty details of, I don’t know, how you collect British postcodes, or how you spell Serbian last names. That stuff is a bit more proprietary, we have a lot of different technology that’s used to counter those specific sub-problems… And the truth is we often have to solve some of these challenges that are a bit separate, often very engineering-heavy… But when it comes to that data barrier that people think about, that first step of building a dialogue system, which you need a lot of data to train - like, we need a lot less data because we’ve already spent years pretraining the stack.

Yeah, this brings me to probably my favorite subject… Before the interview I was reading your post about the ConveRT model, and you mentioned that it’s a pretrained speech encoder in multiple languages… And I know that multiple languages is something that’s emphasized on your website. What is your thought process behind that? And maybe let’s say that we’re specifically talking about this language model… How have you gone about setting up that language model such that it enables you to solve problems in multiple languages?

This is a passion of mine… So quite a few years now one of my best friends, [unintelligible 00:39:15.16] who’s a great multilingual NLP researcher - he’s Croatian, I’m Serbian; we met in Beijing over a few beers, and we were like “Well, how do we end up working together later?” He worked on multilingual NLP, I worked on dialogues, so we were like “Hey, can we do something multilingual in this context?” And then we got really interested in – this was a time where word vectors were all the rage; things like the Word2Vec model, GloVe, and all those things; Mikolov, [unintelligible 00:39:44.01] all those guys… And this was the first wave of massively, mindlessly data-driven NLP. And I stand for that, I love that, but I also love languages and these nuances, right?

[39:58] So the question is like “Okay, you train something in English… How do you import it to another language?” Typically, older-school NLP had this pipeline of things running - [unintelligible 00:40:04.21] parsing the sentence structure… And that’s different in different languages, like the subject/verb/object works differently in different languages. The morphology murders in you in different languages. If you go from – the word “gender” is a thing that exists in some, but not in others. In one language word order matters, in others you can do whatever the hell you want. It’s really fun. But when you think about then creating a dialogue system that works across all these languages, it’s daunting.

I mean, you can’t just go and translate, because a lot of stuff is lost in translation, and just the multi-sense words in one language could translate into something catastrophically different.

Rhetorical questions?

Yeah, that’s the tip of the iceberg. I mean, even mundane things… The word “bill” might mean something – an account or a bill could be the same word in one language, but they’re not in others, and that’s just very confusing. They trigger different actions that both exist.

Now, what we started doing then was training word vector spaces which would embed complete vocabularies of different languages into the same high-dimensional mathematic representation. So words like chien, Hund, dog, suka, hound, whatever - they were all in this bubble, in one place. Now, of course, there are problems with this, because there are multi-sense words… But you know, the multi-sense ones tend to flow away into a bit of a different direction, and then you have machine learning models trained to operate over those mathematical objects instead of operating over a unitary representation of the word “dog” in Serbian or in English.

If you do that, then the beauty of task-oriented dialogue when it’s a specific task is that you don’t need to understand the nuances or the rhetorical questions, you need to understand that someone asked for a table near the disabled toilet. And at that point, the fact that you’re just parsing a limited number of intents means that you’re actually able to do it across different languages at once. And that’s the big thing that we really, really care about - it’s, again, a place where pretraining comes to our rescue, and we’re then able to do these things very well.

The other thing - again, a big thank you to all the cloud providers for the millions they’ve poured into speech recognition research across different languages, because that’s not a thing that we have the budget to do, and they do it pretty well… And you know, that’s a piece we don’t touch, but it’s provided by those companies, and everything else we do in-house.

So then if you have this multi-dimensional space which embeds vocabulary from multiple languages, is the hope then that when you add – so let’s say that you support X number of languages, but then your next client wants to have a dialogue in a next language… Maybe that language is related to one of those you already support, but it’s different. Is the hope then to sort of retrain that and add it in, or add transfer-learn from that existing model, which is faster than training from scratch in a whole new language, or…? How do you approach that situation?

It’s a really good question. You could do either… The transfer one seems cheaper and easier, but really, in our architecture the best thing to have is a unified approach that works across all the languages we wanna support at once. So there really retraining everything makes the most sense.

Now, bear in mind, we have a single model that is kind of like the nuclear reactor of our system, so that model needs to be retrained. Once it’s retrained, we’re done. That language is in there forever. So on that front - yeah, it’s a lot more heavylifting than fine-tuning 100 small models, but it provides this unified thing that will in the longer run save us.

This is similar to what pagerank did to search. They created an algorithm that just indexes an incredibly large matrix, and factorizes it. But once it does that, “Well, here you go. Search. Forever.” Interim changes [unintelligible 00:44:13.11] and then I’m done.

[44:17] Whereas previously, in older search engines you’d have to go to a specific industry, and then you’d search there, and there would be a small keyword-based model that would flag the right results, and you’d have special if statements on what’s more relevant… Whereas now you have a unified approach and it works a lot better. We’re trying to do the same for language.

So back at the beginning of our conversation you said something, and I’ve been holding it on to, because I knew it wasn’t yet the point where I could ask… You said, as we were getting into the topic and you were introducing it, that you expected to spend the rest of your life working in this area. That made me as a non-expert in your area really wonder – I think a lot of people that aren’t working in the field would assume that we’re gonna take a few years and solve all this NLP and related stuff…

Yann LeCun famously said that he’s gonna solve NLP in two years…

How many years ago was that?

I think like 6-7 at this point… He’s made great progress.

Yeah, no doubt. No, I heard a similar thing – I think it was Eric Schmidt in a tweet, he was like “Speech is a solved problem.”

I think Eric Schmidt showed a lot more wisdom than Yann LeCun did there, despite probably less intimate technology understanding.

Okay, but I’ve gotta ask this now, I’m curious… Because you said that, and I know that it’s not a static thing, “It’s done.” I know that you’re gonna continue to make great progress; you’ve been doing that, and you’ve been telling us about it through this… So I’m fascinated about where is this going and how does the larger vision for the problem evolve over time toward you as a young man, a millennial - says this gen-X’er who’s significantly older - how is this evolving over your lifetime to where you are remaining impassioned about solving this problem in the long term? What does that look like? I’m really curious about that. I’ve been waiting the whole way through to ask you that.

Well, that’s a tough question. I think that there are a lot of tactical things to solve. Say, customer service - it’s a challenge that PolyAI as a company is focused on, and I think we’ll be working on it for a long time… Because we’re really far from having a voice assistant that you speak with without getting that digital frustration at the start when you realize that “Oh, God, it’s automated.” We wanna make that a non-problem. We want people to call in, “Automated? Fine.” The same way that you log into a website and you’ve never seen the format before, but you figure it out.

I think there’s a lot of work to be put into those things becoming really good. It’s not a small challenge. It’s gonna be an adoption curve that is also partially just about shifting consumer behavior… And voice assistants have done a lot to help there. They’ve shifted this into the realm of possible likely, and the new generations, people younger than any of us, are growing up with these things… And that I think is really powerful. Because we might be the last generation which was really, really fluent, and kind of like [unintelligible 00:47:15.29] opening up a terminal, the heavy intricacies of the web… Like, why would you do that if it’s a lot more accessible with a more natural interface?

[47:26] I think that people who are heavily reliant on the web might seem in 20 years’ time to those younger people like those guys that are still using the terminal, or speaking about Assembly. I know that’s not the best and the most fortunate analogy, but… To [unintelligible 00:47:42.07] into the meat of your question - let’s say that in ten years we’ve got voice assistants that are everywhere. They’re endemic, they’re just how you interact with businesses… They’ve enabled us as humanity to move away from doing all of those mundane tasks. What’s next? Well, I think just a general interface with technology.

The deeper answer to your question doesn’t come without understanding what happens in AR/VR, what happens with things like Neuralink eventually, where – like, that’s an absolute necessity. I really want that, because that’s really how we then transcend humanity.

This technology then becomes a big interface of that, of just communicating and understanding and absorbing all that information… And then you could go and fall for singularity, consciousness in AI, how you’re gonna communicate with all that - I don’t know where that’s gonna go; I’m not super-bullish on that. But I know that this problem is gonna get deeper and deeper. [unintelligible 00:48:43.11] where people sat down for a summer and they were like “We’re gonna crack this AI thing.” “Yeah, right. you didn’t even scratch the surface.”


So when we get to the point where we have technology, where voice is completely natural, I think it’s really hard to imagine what the world will look like at that point. It’s gonna be great, it’s gonna be really interesting. We’re not that close; there’s a lot of work. I’ll be past your age by the time that happens.

You’ve got a way to go then… [laughs]

We’ll see.

I’m definitely glad to hear that perspective, and also just good to hear from your great work with PolyAI. It’s been a pleasure, and I know I’ll probably annoy you with all sorts of speech-related questions as time goes on, and I look forward to seeing what PolyAI does… But yeah, thank you so much for joining us, it’s been a pleasure.

Thank you for having me. I’ve had a lot of fun.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00