Practical AI – Episode #298
Full-duplex, real-time dialogue with Kyutai
featuring Alexandre Défossez
Kyutai, an open science research lab, made headlines over the summer when they released their real-time speech-to-speech AI assistant (beating OpenAI to market with their teased GPT-driven speech-to-speech functionality). Alex from Kyutai joins us in this episode to discuss the research lab, their recent Moshi models, and what might be coming next from the lab. Along the way we discuss small models and the AI ecosystem in France.
Featuring
Sponsors
Fly.io – The home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes.
Timescale – Purpose-built performance for AI Build RAG, search, and AI agents on the cloud and with PostgreSQL and purpose-built extensions for AI: pgvector, pgvectorscale, and pgai.
WorkOS – AuthKit offers 1,000,000 monthly active users (MAU) free — The world’s best login box, powered by WorkOS + Radix. Learn more and get started at WorkOS.com and AuthKit.com
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome to Practical AI | 00:34 |
2 | 00:35 | Sponsor: Fly | 02:29 |
3 | 03:16 | What is Kyutai? | 02:43 |
4 | 05:59 | French AI ecosystem | 02:42 |
5 | 08:41 | Formin a non-profit | 01:50 |
6 | 10:31 | Connecting to open science | 01:57 |
7 | 12:28 | What makes Kyutai stand out? | 03:46 |
8 | 16:26 | Sponsor: Timescale | 02:21 |
9 | 19:04 | Moshi's capabilities | 03:54 |
10 | 22:58 | History of speech-to-speech models | 07:55 |
11 | 30:53 | Cool things to try | 03:12 |
12 | 34:13 | Sponsor: WorkOS | 02:51 |
13 | 37:13 | Fine tuning data sets | 05:16 |
14 | 42:28 | Model sizes | 02:42 |
15 | 45:10 | Things to come | 03:23 |
16 | 48:34 | Thanks for joining us! | 00:35 |
17 | 49:16 | Outro | 00:46 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of the Practical AI Podcast. This is Daniel Whitenack, I’m CEO at PredictionGuard, and joined as always by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris?
I’m doing very well today, Daniel. How’s it going?
It’s going great. I think we talked about this a little bit on the last show, but now we’re officially up against the Thanksgiving break, so a couple days off here in the US, which will be nice… Maybe I can catch up on some of the cool AI stuff that I’ve been meaning to play around with in my spare time. But one of those cool AI things that definitely made its rounds over here at PredictionGuard, and we were talking about, was the recent advances in real-time speech assistance, and in particular what OpenAI was doing, but then also what a lab in France called Kyutai released… And today I’m really excited, because we finally got the chance to have Alexandre Défossez, who is a scientist and co-founder at Kyutai with us. Welcome, Alex.
Thank you, Daniel, thank you, Chris, for the invitation. Looking forward to discuss about the details about Moshi.
Yeah, we’re excited about it. And maybe before we do that, if you could give us a little bit of a background on kind of what Kyutai is, how it came about.
Yes. So Kyutai is a non-profit lab that we launched a year ago in Paris. We have funding from three donors, so Xavier Niel, Rodolphe Saadé, and Eric Schmidt. Eric Schmidt is probably the one you know the best, and Xavier Niel is a tech entrepreneur… I mean, entrepreneur; now he is a successful one. And then Rodolphe Saadé works in logistics. So they gathered together to try to fund this effort to bring a kind of independent lab with a mission to do open source research at a time where the open source is maybe suffering a bit from the competition between some of the major labs. So that’s, I think, a big motivation for everyone on the team.
Basically, we have sufficient capacity to be kind of competitive with big labs. We can’t really fight every battle, but as we show with Moshi, we can definitely bring interesting ideas and innovation to the table.
Yeah, and I find it – I mean, maybe for those here in the US AI ecosystem, we do see a lot of innovation and interesting things happening in France and in Paris… I’m wondering, just out of curiosity, what is the ecosystem like there, and how would you – I mean, you seem to be kind of formed out of part of that. So how has that sort of shaped you, and what is the ecosystem like there?
I think the ecosystem kind of starts with the studies, with France – like, there’s a very strong engineering culture, also a very strong emphasis on mathematics, which I think was giving a good soil that initially attracted a number of big American players, like Facebook, that opened… So I think at the time the Facebook AI Lab in Paris was probably the second largest after the Californian one, about tie with New York. So I think that kind of says how attractive the city can be, because it’s not so easy to compete with the attractivity of America.
So now I think what has changed in the recent years is really the kind of independence that’s growing from this kind of initial seeding. I think for many years there weren’t a number of like truly French organizations where you could have access to a sufficient number of GPU, large enough clusters so as to develop machine learning model for a number of applications, and that’s especially the case with large language models… But there’s been a number of events that have kind of led to this diversification of the ecosystem in France. So now I guess there’s a number of big startups, there’s Kyutai, and I think that’s only going to grow.
[00:07:56.09] Also, there’s one specificity in France which I think is very nice, especially for deep learning, and it’s the fact that we can do a PhD as a resident in a private company. So for instance – or even like a nonprofit. So at Kyutai we’re going to have PhD students at Facebook, where I partially did my PhD, there was also a number of PhD students, and I think it’s such a great opportunity to get to use graphic cards so early during our career, and even as students. And I think that’s very specific to France, and that’s also part of the success we’re seeing at the moment… And that I think can only be growing as we train more and more people in such a way.
I’m curious, as you were describing the ecosystem there in France, and how strong it is, what was the specific dynamic with all these for-profit organizations around you, that brought about the desire to have the nonprofit? And how did you find yourself in the middle of that as you were in the formative stages?
I think for me there was a growing a will to become a bit more independent. I think even though at Meta, for instance, there was a lot of value put on the Paris office, at the same time an American company always takes decision in its center… So that would be California. And satellite offices always have to kind of bear the consequences of those, no matter the contribution they will make to the overall value of the lab. So that was kind of the initial desire, to be a bit more independent in terms of the decision-making, the ability to lead the research.
I got the opportunity – so I was contacted by [unintelligible 00:09:47.09] who was doing his PhD with me at Facebook, at Meta, and then had been at Google, doing very successful research there… So he was part of the first team that was contacted, I think, by Xavier Niel. And I think the project was initially very appealing, because it’s same business as usual… So doing research, what I love the most, having sufficient resources to do it, in a completely independent and French environment. So that was of course very appealing. I didn’t hesitate very long. I guess, even at first, it seems a little bit too good to be true, but so far so good. So yeah.
Kyutai kind of promotes this idea of open science and democratization of AI, or artificial general intelligence through open science. Some people in our listeners might be familiar with sort of open source, open source AI, or even like open access models… How would you define and think about open science as a thing - and in particular, how that connects to kind of the way in which you envision the building of AI or AGI?
Yes. So I think the two are quite related. Usually the open science comes really around explaining how you arrived at the final results, and kind of what are the mistakes you made, what are the things you tried, what was important and whatnot… So I would say that’s like a first part that we’ve been doing really well with Moshi. We released like a preprint technical report with a lot of details, that actually took us a bit of time, and that’s something that’s not necessarily… I don’t think if we were not with this kind of nonprofit mindset, we would dedicate as much time, but I think on the long run it’s kind of important. And then there are several aspects. The open sourcing can go from just the weights to like full training pipelines…
[00:11:57.23] So releasing more code around the training of such models is also on our roadmap. We didn’t get a chance to do it yet because - yeah, the paper already took us a bit of time, and we have other things we’re working on. But I think that’s also part of it, explaining exactly how you got to the final results, and not just having a set of weights for one specific task, but being kind of stuck with it, if you need to adapt it to something else. That’s kind of the, I think, the vision of open science.
Could you talk a little bit about kind of what you’re able to do with that model that maybe the commercial labs that you have in the same ecosystem aren’t able to do? And maybe also kind of – is it more standard within other nonprofits around the world, that are doing similar things, or is there something very, very distinctive compared to you, that maybe other nonprofits that you’ve seen, or maybe even modeled after don’t have?
Yes, so that’s a good question… I’m not necessarily familiar with all the nonprofits in the AI ecosystem. I know the Allen Institute, for instance, is one of them. I think it’s very – there’s also the Falcon Team, TII… Yeah, I think we’re kind of serving a similar mission. I don’t think there is necessarily a big difference.
Some of them might be more around like contribution to science, for instance like general science or core deep learning… I think for us, we are mostly focused on core deep learning. We don’t necessarily want to compete, for instance, on the purely text-based LLM space. So there’s differences in terms of the choices of the research we’re doing… But yeah, fundamentally, I don’t think there is a big difference. And then your other question was with respect to like other for-profits?
What do you feel is really in your sweet spot, to put it in another way, compared to these competitors? It’s very easy to kind of say, recognizing that all the resources that some of the largest companies in the world have, and they’ll put into their labs, but there’s definitely a place for others out there. And I think that gets missed a lot by the public. And so given the fact that you have this space that you’re playing in, just kind of what sets you apart from those commercial in terms of maybe advantages that just having the mass number of GPUs available to them, what are some of those distinct things?
Compared to some of the for-profits, if we take the biggest labs, obviously, I guess we have agility that is not really possible in a super-large company, where every action will have consequences in the stock market, for instance. So the decision process can be really fast. That was the case for the release of the model. For instance, we were able to release it under a commercially friendly license, which would be a bit harder in a larger structure.
Then I think we have a strong – for instance, we have a desire to go more and more towards on-device models. So Moshi is kind of barely on-device. We demoed it on a MacBook Pro, but it was like a top tier MacBook Pro, so it’s kind of like proof of concept; it runs on device, not every device… But I think we definitely have a value there, because a number of for-profits are not going to develop really powerful on-device models, because that would be a potential threat to their… Like, it’s harder to protect in terms of intellectual property. And I think in general, between the bigger players, there is kind of the race to the very top, very best numbers on the benchmarks, MMLU and everything… And so if it takes 10 times more inference time to beat the other on the benchmark, they are going to do it, because it’s either beating the other on the benchmarks, or kind of leaving the arena. So we’re not really in this mindset. We’re more like – the on-device, I think could have a very large number of applications. It definitely cannot solve all issues… But I think as a non-profit, we won’t have the kind of reservation other for-profit might have for on-device models.
Break: [00:16:18.22]
So Alex, you’ve mentioned Moshi a few times now… Maybe if you could just give those that haven’t heard of this an idea of, first, what is Moshi? And then maybe if you could then after that step back and describe - well, how did the lab, how did Kyutai start thinking about that sort of model or that sort of research direction as a research direction of the lab?
Yes. So Moshi is a speech-based foundation model that also integrates text as a modality. So it’s especially built for speech-to-speech dialogue, and especially real-time dialogue. So we put a real emphasis on the model being able to act in a way that’s the most fluid as possible, like a real conversation with a human being. One of its characteristics is that it’s completely full duplex, meaning that the model can both listen and speak at any time. So it’s not turn-based, like walkie talkies, which I think is an important feature as we communicate. So we wanted the model to be able to do the same thing.
We also – yeah, as I mentioned, that allows us also to have a very low latency… So we have like around 200 milliseconds between the time the audio leaves your microphone and the time you get a reply that has accounted for that audio. And yeah, at the moment it’s kind of like mostly – we designed it as a speech agent with which you can discuss, ask questions, ask for advice, that could potentially serve as a basis for a much larger use case. That’s why we also mentioned it as a kind of foundation model and also a framework for a number of tasks that would require kind of reacting to your speech, and beyond just being kind of an assistant.
And then the second part of the question was how did we start working on that… So we were two people on the initial team - so Nel and I - who have done most of our research on audio modeling… And then Edouard Grave had been a core member of the initial team of LLaMA, the very first LLaMA at Meta. So we kind of had the right tools. So I guess the first reason is basically we sat together and we’re like “What can we do, and where do we have an edge on the competition?” And I think on this aspect of like combining the text knowledge and the top of the line audio modeling techniques, we are the real edge, compared to other labs. So that was important. And also, there was a sense that speech was becoming an important modality, and what had been done in a number of other modalities was still completely lacking.
So that was back in November. At the time, OpenAI hadn’t made any announcements, so it was still pretty much a new area to cover. So we kind of immediately started working on that. We actually started – so both on Mimi, the codec that we used, with the goal of having a really highly compressed representation at 12.5 Hertz, to get as close as possible to the text, which would be around like three Hertz. Of course, it’s not regularly spaced with respect to audio. And then once we were happy with Mimi, we immediately moved on to the kind of aspect of how do we model the speech, how do we handle the full duplex, how do we instruct the model… A number of challenging questions that arose, all the way to the first public demo in July.
That’s great. And just one more kind of background question, for those – some people might have seen, I guess, non-real-time agents… So agents that would take in audio, transcribe that, maybe transcribe that with one model, use a language model to generate an answer, and then use a third model maybe to generate speech. So that’s one kind of way to process this pipeline. You’re talking about something different here, particularly for these speech to speech models, or the kind of multiplex models that you’re talking about. Could you give a little bit of a background? How long have people sort of been studying this, researching this type of model? And has it really only been possible in sort of recent times to make this kind of real-time speech a reality? Because I think some people are – at least public-wise, they may have seen things like Alexa in the past, that processes speech in certain ways… But these sort of demos, at least, that they’re seeing from OpenAI, demos that they’re seeing from Kyutai - this is a different type of interaction. So how long has this sort of been possible, and what is the kind of history of research? I know that’s a hard question, because there’s probably a million things that have been done… But from an overall perspective, how would you view it?
[00:24:20.16] So I guess just to put it in perspective - so I’m not necessarily entirely familiar with how Alexa works, but it’s more… I mean, anything that’s kind of pre-GPT model would be kind of rule-based, or based on automatic speech recognition, which is actually a fairly old field; and even real-time speech recognition has been successful for a while, not necessarily with the amount of success we see with deep learning. I mean, it was already using, some of them, deep learning before… But then it’s kind of rule-based. So if you don’t formulate your request in quite the right way, it’s quickly going to say I don’t know””, or just do a Google search.
Then what brought a change of paradigm was all the GPT models, and ChatGPT in particular, with this ability to perfectly understand human requests, no matter how it is formulated. Then to bring that to the audio domain, what you need is the ability for a kind of language model like a transformer to process the audio streams. Ideally, you would think it’s very easy for a GPT model. You have text tokens in, and you predict the next token, and then you just need some special characters to differentiate between the request and the reply, and you want to be able to do something similar with audio… But things are not quite as easy with audio. Audio is not as dense in terms of information. You can think of words as being like really almost information – from an information theory point of view optimal way of transmitting information, while audio as recorded by a microphone is just a wave that’s oscillating like maybe 40,000 times per second, and if you just look at it with your naked eye, it will make no sense. So you need the right representation to be able to feed that into like a transformer model, have the transformer understand it, and be able to produce the output, and that has been quite a challenging task.
If we just talk about audio, the first few successes were, for instance, WaveNet, and on top of WaveNet, there was Jukebox by OpenAI, that I think was the first like “Let’s use a transformer language model to try to model audio.” But I think I record from their paper that processing one minute of audio would take eight hours on a top of the line GPU at the time. So obviously, the technology has progressed a lot, and I think some of this progress was especially done by [unintelligible 00:26:46.08] for instance - he’s another co-founder at Kyutai - at Google, with SoundStream in particular, that provided these kinds of discrete representations at a relatively low sample rate, low frame rate… And then already very quickly, Nel and his team showed that this could be fed into a transformer. At the time they were kind of using a technique where you would still have many more – like, for one second of audio, you would need to do maybe like a few hundred autoregressive steps, which is very costly. One second with a transformer of like equivalent information would be maybe three autoregressive steps… So that naturally put a constraint of both your context, and the kind of lens of the sequence you can generate, and completely ruled out the real-time aspect.
Then when I was at Meta, I also worked on a similar topic, especially on how to kind of not do as many autoregressive steps, but try to predict some of the information in parallel, and how to organize it in a way that you would have kind of minimal dependency between the different aspects you need to predict. That maybe I guess is a bit hard to say orally, but basically it’s like for each timestamp, instead of having just one token like you would have in text, now you have maybe four, or eight, or 16 tokens… And yeah, you need to make sense of that. You cannot just flatten everything, because that’s just not going to work in terms of performance.
[00:28:13.23] And then there was a number of works… I think one we use for Moshi, the RQ transformer, that kind of models the dependency between those tokens for a given timestamp with a smaller transformer. I guess it was a pretty important algorithmic contribution from – I’m trying to find back who did that, but I don’t have it under my eyes… But yeah, so we kind of built – so both on this expertise, the work that Nel had been doing, the work that I’ve been doing, and this kind of RQ transformer paper… And that’s to solve the aspect of being able to run a big language model, so let’s say 7 billion parameters, to take audio as input, and then output audio sufficiently fast for real-time processing.
And yes, then the other aspects - I guess the one where we kind of brought a lot of innovation was the full duplex aspect of kind of having multiple audio streams. So one audio stream for the user, one audio stream for Moshi… And I think that’s kind of – it’s not something you would naturally do with text, because you already have one stream, so going to two streams, it’s kind of a hassle… But if you think of it for audio, it’s like all those kinds of tokens in parallel, they already form like up to 16 streams that we already had to enter, so it was just like “Okay, let’s just double the number of streams.” Then now we have two of them, that are clearly separated. We do, actually – the model is trained, for instance, during pre-training to also generate some of the user’s reply, even if at that stage of the training there’s no real – it’s just kind of a participants in the conversation that sample randomly. Then obviously with the model we released, now it only tries to model its own stream… But yeah, so that’s kind of like the rough line of work that led to Moshi.
Then of course in audio modeling there’s many other techniques that I didn’t mention. In particular, diffusion is very popular. So there’s many model doing diffusion for a music generation, for instance, for TTS, for a number of things… And obviously, that’s not compatible, or that’s much harder to make compatible with the real-time aspect, where the autoregressive language model is still kind of the more natural and dominating paradigm.
That was really fascinating in terms of understanding, and I definitely learned as you were kind of describing it… I don’t think I’ve heard such an excellent, kind of - not just from Moshi, but just kind of how to get there on that. What I’m wondering in my head is what are some of the – I can imagine as you’re talking so many cool things to do with this technology… What are some of the cool things that you’ve seen already, or that you guys have tried specifically, that maybe wasn’t possible before, or that maybe people could only do at some level with something like a ChatGPT 4.0 kind of – you know, through the API that way. But this is open source, it’s open science, they have a lot more capability… There must be some pretty awesome stuff out there.
I mean, there’s a few things that we’ve done that were really, really funny. For instance, just training on this old dataset from the ‘90s and like early 2000s of phone calls… And then it was not really like an assistant anymore. So it’s just like you end up on the phone with someone random, and they will tell you their name, they will tell you what they think about US politics at the time… And it’s really – it’s kind of a different thing that we tried to keep with the final Moshi, but obviously, with the phase of instruct tuning, we lost a bit of this… I mean, it still quickly falls back to the helpful AI assistant personality that’s maybe not as nice… But that was a funny thing. Basically, we can train it on anything, and then this is going to act like a kind of actor that would pretend to be a certain person in a very realistic way.
There’s a number of things that we’re exploring with this kind of approach. Anything that would be like speech to speech, or text to speech, or vice versa… Some of them we kind of mentioned in the paper, or with just this framework… Because we also have a text stream that basically we use only for the model to be able to output its own words. We don’t actually represent the word from the user, but the model outputs its own words. And this kind of aspect, by making the text late or early on the audio, we can turn the model from being like a text-to-speech engine, because if the text is early, then the audio is just going to follow it, but if the text is late and you kind of force the audio to some value, and you only sample the text token, that now becomes automatic speech recognition… So I think that kind of shows how versatile this multi-stream approach is. And all of those applications are really streaming. So we could – actually, something we did for the synthetic data was using this kind of approach to generate long scripts, and you could imagine generating maybe 15 minutes, or whatever. That’s our things that we’re working now more independently.
And yes, in terms of more as a general community, I’m not aware of anything in particular. I think one thing we want to do though is to release code to allow fine-tuning, maybe with LoRA, and also make it really easy. Obviously, the pipeline is a bit more complex, because you need audio, ideally you need transcripts, you need separation between the agent you want to train and the users… So we want to help with that regard, and try to make it easier to adapt it to a new use case.
Break: [00:34:06.14]
So Alex, you touched a little bit on the data side of this, and also kind of hopeful future fine-tuning opportunities… But I’m wondering if you could go into a little bit in particular, because we’re able to talk about this sort of thing which sometimes we’re not able to talk about, given the nature of the models that we’re talking about on the podcast… What was the sort of data situation that you had to put together in terms of the specific training datasets or fine-tuning datasets that you put together and curated for the model that you’ve publicly released as kind of model builder?
Obviously, we had to put kind of both a pre-training dataset in audio and in text. Initially, we had to put the text dataset together also because there wasn’t necessarily at the time an alternative that we could use in terms of license. And also, we wanted to be able to keep training both on text and audio, so as not to have a kind of catastrophic forgetting of the knowledge that would come from the text.
One thing we realized is that basically it’s much easier to have a very wide coverage of human knowledge with text than with audio. And then there were a number of other difficulties, in particular the fact that for the last stage of the training we needed audio with clearly separated speakers. And also, we needed some kind of instruct dataset. So for the separation, we bootstrapped things from the Fisher dataset, which is the dataset I mentioned earlier, of phone calls. That kind of gave us a good enough base to then be able to train TTS models with separate speakers, also in combination with some recordings. So actually, as I talk about taking faster decisions and in larger organizations, at one point we were like “Okay, we need really good, studio quality recordings of people on separate microphones.” So then we got in contact with a studio in London, then the next day we were on the Eurostar and just like recording a few people, which I think was really fun… It’s good to have a break from just launching jobs and crunching numbers now and then…
And yeah, leveraging that, plus the Fisher dataset, then we could train a TTS model that we could have follow specific emotions, and have two separate streams as output, so for the two speakers. And then we used that to bootstrap an instruct dataset.
Initially, we tried to convert to audio the existing instruct dataset for text, but we quickly realized that a few scripts that were specifically tuned for audio would give much better results. And one of the reasons is that if you look into some of those existing instruct datasets, it’s very geared toward first the way we’re going to use text models… So maybe some people copy paste a Markdown table and they ask to comment on it… There’s a number of entries that are specifically done for kind of benchmark steps of questions. So it’s going to be multiple choice question, and the model just answers B. But that’s not something you’re going to do orally. You’re not going to like give four choices and the model just answers B. We needed a lot more multi-turn, and also shorter reply. You don’t want the model to spit out an entire paragraph for a reply.
So with that in mind, we had to kind of rebuild everything. Edward did a lot about that. Some of it was kind of pinging existing LLMs, being like “Okay, what are like a hundred different tasks we could do with a speech assistant?” And then for each task, give me like a hundred possible scenarios. And then we had another model that we had fine-tuned specifically to follow kind of the oral style, so shorter answers, like maybe short change of turns. We would randomly sample topics and have discussions around them… So we tried to cover different aspects like that, and then we synthesized everything. So at the end, the dataset was fairly large - I think a few tens of thousands hours - and it was kind of sufficient to get to the state for the demo.
So even though it was kind of cool that we could be bootstrap this entire modality, basically from like this 1000 or 2000 hours recordings from the early 2000, and a few hundred hours that we had recorded in the studio, one thing we noticed is that there is still what we call the modality gap. So there is still a gap in knowledge between the text model that we started from, and even – actually, as we trained the model, we still trained it on text, so we can always switch it to text mode and ask it the question in pure text. And the model would get much better replies on trivia QA than it would get with the audio. And that’s, I think, a really fascinating question, of how to make the model understand that it’s the same thing. At the same time, it’s very easy for it to think “Oh, it’s two different modalities”, especially with the pre-training on audio, where it gets like kind of random audio, not necessarily a focus on like giving the right answers all the time… We could recover some of that with the instruct, but I think there’s still work to do to be kind of as simple, efficient as a text model, into really becoming super-useful and factual.
I’m curious if - and you may have mentioned it; you mentioned 7 billion parameters earlier, but… Is that the size of the model? Is it a 7 billion parameter size model?
Yes, it is 7 billion parameters. So as I mentioned, it’s an RQ transformer architecture. Actually, I’ve found back the author - it’s Doyup Lee and his collaborators that published first this model, which has… So the main backbone transformer and the small transformers that just tries to predict the different acoustic tokens. This one is kind of smaller. I don’t have the weights, the size exactly, but in terms of runtime, inference time, it’s negligible. Most of the knowledge and decision is done in the big 7 billion transformer.
How did you pick the model being that size, and also as an addendum to that, what is your perspective on kind of relatively smaller models versus the relatively larger models? How do you see that?
Yeah, I guess when we started 7 billion it was kind of the minimum size for large language models. Now I guess 2 billion and 3 billions down to 1 billion, especially with the advance of distillation techniques from bigger transformers have become very efficient. They are now as efficient as text models at 7 billion parameters from like a year and a half ago. But yeah, at the time when we started, we were like “Okay, we don’t know exactly how much compute, how much capacity it’s going to take to solve the task, so we don’t want to take too many risks.” 7 billion was a well-charted territory, and at the time a pretty good balance between the two. Now that we know that we can solve the task with 7 billion, obviously we want to try to go lower than 7 billion, and that’s something we’re exploring… Because the way we see things, it’s going to be very hard and probably not super-useful to try to put all the thinking capacity and problem solving capacity into Moshi, but we want it to be smart enough to have a direct conversation, understand what the user wants, and potentially then access other sources for getting more complex answers. That would also allow more plug and play aspects, and now you have a new text language model you don’t necessarily want to retrain from scratch the audio part. So the way we see, it is going towards a smaller model for managing these direct low-latency interactions, delegating some of the works to a larger model when needed. So for sure, now that we know it works with 7 billion, we would try smaller, so that we can run on a much larger number of devices.
I guess you already started kind of talking towards additional things that you want to try with respect to Moshi and these types of models in the future… But maybe stepping back a little bit as we get close to the end of the episode here, when you as a researcher in this area look towards the future - it could be work that you all are planning to do internally, or just things going on more broadly, but what kind of is some of the most exciting things for you as you think about the kind of next years of your work, and the things that you’re following, the things that you’re looking at? What’s on your radar and what are you excited to kind of participate in and see happen in the coming months?
Okay, in the coming months… That’s a good question. I mean, I think one topic that I’m interested at the moment is the question of whether we’re going to be one day in a post-transformer era. I love transformers, and I love not having to wander anymore… I mean, if we look at the set of hyperparameters to train those models, they have been frozen for maybe two years, two years and a half. The architecture is frozen… Which is good, because now we mostly focus on just like making the right data to solve problems, and there’s a lot we can do. At the same time, I think I would be really excited to see advancements that could happen either on the optimization side or the architecture side. We’ve seen a lot of interesting work in this area, but at the moment we’re more on the parity thing. So we’ve found other ways of doing kind of the same thing, but there is not really something that has won on a decisive aspect or feature that would be potentially not sufficiently well done with transformers. There’s been like tons of engineering going into it, so each time you think “Oh, maybe quadratic cost is bad”, but then people are like “No, you can just hardcore optimize your CUDA kernel and now it’s no longer your problem.”
But yeah, I think in terms of just the scientific excitement, that’s one thing that I want to keep my eye on. Obviously, at the same time there is a lot of competition going on just applying the current model. It’s not necessarily easy to free time and mental space to try to think about those issues. So that’s one aspect.
And yeah, then I’m also curious, about how the framework aspect is going to evolve. Working day to day with those technologies really feels like you’re back in the seventies of like pre-C era, where you have to think about the CUDA, the code is different for each architecture, there’s a lot of abstraction leakage… It’s not like – you’re not going to write a nice function, you need to write kind of dirty things, you need to do the equivalent of pointer arithmetic all the time… So that’s another thing.
So maybe I’m not replying your question of what’s coming in the next few months, but longer term, sometimes I just think of myself in 10 years and you can just write your attention kernel in a few lines of code in a dedicated language, and you get almost perfect code. And I think that would be amazing to just explore more things more easily. But we’ll see. So yeah, two big potential changes, but I think something’s going to happen in the coming years.
Yeah. Well, thank you very much for sharing your perspectives with us, and also thank you for the way that you and the Kyutai team are inspiring many people out there that are working on open models, open source, open science, and kind of just generally collaborating in this space. We really appreciate kind of what you’re doing as part of that, and thank you for taking time to chat with us. It’s been great.
Yeah, thank you very much for the invitation, opportunity to present, and hopefully we’ll have some other in the future.
Definitely. Yeah.
And enjoy Thanksgiving.
Our transcripts are open source on GitHub. Improvements are welcome. 💚