This is a fancified excerpt from Catherine Breslin on Practical AI #82. Catherine is a Machine Learning consultant at Cobalt Speech who specializes in voice & language technologies, including Alexa. Get the full experience you should listen while you read.
If you say… “Hey, computer, play me some music” and then it starts playing you some music, there’s a number of things that have to have happened for that to come true.
The first thing is that the computer has to firstly wake up and start listening to you when it hears “Hey, computer” (or whatever it is that you’ve decided is going to be the wake word)… And that is often a very small, low-powered speech recognition system which is sitting on a device or on your phone, that’s listening very specifically for particular words.
Then it’s gonna run speech recognition on what you’ve asked it to do. Speech recognition goes from audio to text; it’s transcribing what it is that you’ve asked for. So it says, hopefully, “Play me some music”. Although speech recognition systems are not perfect, and they make some mistakes, we hope that most of the time though it gets your request accurate enough when it transcribes it.
But that’s not enough for the computer to know what to do. The computer has to sort-of bucket that into one of many things that it can do… So you could have asked for playing some music, or you could have asked for buying some music. It has to distinguish those two things. You could have also asked for the weather forecast, or asked for the answer to a factual question… Which are slightly easier to think about.
So there are some things that you ask about which are close together, and some things are further apart, and the computer has to distinguish those with some sort of language understanding technology.
If you’re asking about anything complicated, it not only has to bucket what you’ve asked, but also what particular entities you might be asking about. So you could say “Play me some music by Sting”, and there it has to know that Sting is the name of the artist that you’re actually after, that you’re interested in hearing music from.
So this language understanding technology is going to pick out what you want to do, and the sorts of things that you want to do that with; so the artists you want to listen to, the city you want the weather forecast in, the album you want to hear, the thing you want to buy or add to you shopping basket - all of those things we have to pick out.
And then there’s some computer system which is gonna take that request and go and execute it, and actually figure out what music to play back. If you play music back, you might just hear the music start to play, but you might also hear an announcement about the music that’s going to play. Or if you ask for the weather forecast, you might hear it tell you the weather forecast in words…
And that text-to-speech technology is the last part of the pipeline. And that’s sort of like the opposite of speech recognition. In this case, you’re going from text and converting it into speech that can be understood.
All together now
So you put these things together in this pipeline - you’ve got:
- speech recognition
- language understanding
Which will combine together to give you a virtual assistant which is going to act on what you tell it to do.
The conversation doesn’t end there. Listen to the entire episode a deep dive on speech recognition, captioning, transcribing, and much more. You can play it from the start right here 👇
Oh, and don’t forget to subscribe to Practical AI in your favorite podcast app so you don’t miss future episodes and insights. ✌️