Democratizing ML for speech with David Kanter, executive director of MLCommons (Practical AI #164)

All Episodes

You might know about MLPerf, a benchmark from MLCommons that measures how fast systems can train models to a target quality metric. However, MLCommons is working on so much more! David Kanter joins us in this episode to discuss two new speech datasets that are democratizing machine learning for speech via data scale and language/speaker diversity.

Changelog++ members save 3 minutes on this episode because they made the ads disappear. Join!

45 minutes
Recorded Jan 12, 2022
Published Jan 19, 2022
Download (43MB)
Transcript
🎧 22,367

Featuring

David Kanter – GitHub, X
Chris Benson – Website, GitHub, LinkedIn, X
Daniel Whitenack – Website, GitHub, X

Sponsors

Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with no ads, extended episodes, outtakes, bonus content, a deep discount in our merch store (soon), and more to come. Let’s do this!

The Brave Browser – Browse the web up to 8x faster than Chrome and Safari, block ads and trackers by default, and reward your favorite creators with the built-in Basic Attention Token. Download Brave for free and give tipping a try right here on changelog.com.

Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com

Notes & Links

📝 Edit Notes

Press Release about MLCommons datasets: MLCommons™ Association Unveils Open Datasets and Tools to Drive Democratization of Machine Learning
NeurIPS Papers:
- People’s Speech Dataset
- Multilingual Spoken Words Corpus (MSWC)
Gradient article: New Datasets to Democratize Speech Recognition Technology
Blog posts for more insight:
- People’s Speech
- Multilingual Spoken Words Corpus (MSWC)
Downloads:
- People’s Speech
- Multilingual Spoken Words Corpus

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Daniel Whitenack

Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?

I’m doing very well. Happy New Year, as we are in the early winter of 2022 now, recording this.

Daniel Whitenack

Yes, the very cold early winter of 2020. Yeah, I think our listeners will probably have heard another episode this year, but this is the first that we’re recording in the new year. So Happy New Year to everyone, again. But it’s always good at the new year to revisit certain things or look forward to things that you thought about last year. And it’s cool that we have David Kanter with us, who is the executive director at MLCommons, because we had a great conversation with David last year, and it’ll be good to catch up on all the amazing stuff that MLCommons is doing. So welcome, David.

Thank you. It’s great to be back. Happy New Year to everyone. I also have to point out - it is only cold if you aren’t in California.

Daniel Whitenack

Yes. [laughs] Well, I certainly am not, at least judging by my walk into the co-working space today. Definitely not California. Although our listeners can’t see, but you are wearing a hoodie zip-up type of thing, which–

Oh, no, no, no. It’s not a hoodie, it’s a fleece, but it’s also–

Daniel Whitenack

It’s a fleece.

I’ve gotta have my own logo.

Daniel Whitenack

Right. It’s swag. Yeah.

Yeah.

You’re looking good, man.

Daniel Whitenack

Sure. Sure.

Yeah. I’m in San Francisco, for those who don’t know, and we should be clear - San Francisco weather is not L.A. weather.

Daniel Whitenack

Yeah, that is true. Yeah. I’ve been there on occasion when that is apparent. So yeah, welcome back… I know we had some introduction to MLCommons in our last episode, which we’ll link in our show notes. But for those maybe that haven’t heard that or maybe want a refresher, could you just give us a little bit of an idea about what MLCommons is, why it exists, and what are some of the things that you’re doing?

Yeah, absolutely. So the mission of MLCommons is really making machine learning better for everyone. So I think of our goal as being how do we stimulate innovation in ML in a way that really benefits society and the whole world.

[04:22] So we were started– it was a very informal collaboration starting in 2018, but we actually formed a non-profit in 2020 that I am the founder and executive director of. We’ve got an amazing team, and it’s an industry consortium, so we’re bringing together a lot of heavyweights in the ML world from the systems side, from the software side, from the cloud side, who are all focused on this.

That mission of making ML better for the whole world - we have three real pillars to that. One is benchmarks and metrics. And so the first thing that a lot of people know us for is MLPerf, which are the industry standard benchmarks for the speed of training neural networks and doing inference. Actually, one of the things that’s been really cool since we’ve talked is– you know, I think when we first talked, we might have only had two of the benchmarks in our suite up and running. And we’re now, what I like to say is covering from microwatts to megawatts. So the smallest systems we’ve measured performance on are maybe tens of microwatts, really deeply embedded IoT devices, up to the world’s number one supercomputer, which is 20 megawatts, and everything in between. And so you can see what’s really cool.

I have some slides that I use in Keynote saying, “Hey, if you look at the two and a half years we’ve been around, you would expect that just through Moore’s law alone, ML solutions would be about 2,5x faster.” And that’s great, right? We want to make your life easier, Daniel, when you’re doing the work… But if you look at the MLPerf data that we have, actually, it’s more like 16x to 30x faster. And so it’s really cool to see when you get benchmarks and when you get metrics and everyone starts really rowing in the right direction, the kind of momentum you can get in the industry. So that’s one example of ways we can make ML better for everyone, is just being able to train bigger models and do it faster, and really bring those capabilities out.

The second pillar is the one that brings me here is our datasets. And I like to think of datasets as really being the raw ingredient for ML. If the industrial revolution was powered by iron and coal - not the most environmentally stuff - we’re really powered by data, right? And I think at SIL, you guys understand that just as well as everyone else; if you want to start talking in new languages or working with new texts, you can’t get from English to Urdu without some new data, right? And bringing large scale, open, nicely curated data is a huge boon for the industry, because even at some of the biggest shops in the world, places like Google or Amazon, their researchers want to use public data because the whole point is that you’re sharing techniques that everyone can use in driving the industry forward.

Daniel Whitenack

Yeah. That’s really interesting. Sometimes I get the question when I’m working on various datasets, what the benefit– and like you said, we deal with this at SIL, too. We have a bunch of data in our archive or wherever that linguists have gathered over a very long time… And sometimes the question comes up, well, first off, why would anyone in the industry be interested in this strange data? And we can answer that. But then secondly, why would we want to make this open? How would that benefit our organization to work on open datasets, versus closed and proprietary stuff.

[08:02] When you’re having those conversations, how do you express that to people? Maybe the business case for contributing to open datasets? Because you do engage with large organizations on the datasets you work with, so how is that phrase to their leadership?

Yeah. So I think it depends on the companies. There’s some companies where their data really is a critical differentiator and they probably don’t ever want to open it. But even in the case where that is true– so I’ll give you an example. There’s a company, Credio, in Europe, that actually was really fantastically cooperative and opened up one of their datasets for our use. It was an older dataset that they weren’t super concerned about a competitor having access to, right? And they said, “Look, if you use this dataset for your benchmarks, systems are going to get faster for what we want to do.” So this is a way that you can kind of, in some ways, punch above your weight in getting the attention in the industry. And I’m sure you’ve talked to folks about how when you start training on a dataset, the models begin to specialize a bit for that. And so if you want your use case to be popular, making it open is a really good one.

And then a lot of our member companies, if you look at someone like Intel or NVIDIA, more open data means more ML, means more people doing cool things with computers, means more sales. So in a lot of cases, I think you hit the nail on the head, which is as a nonprofit and industry consortium, I can’t force people to do what they don’t want to do. I can provide persuasion and I can try to help them understand where our interests are aligned. And I think there are a lot of ways where that’s great.

Chris, you work in the defense and intelligence community. So I had a really cool meeting late last year with some folks there, and one of the things they were saying that they love about open data is certain parts of that community have real difficulty spending money through normal channels.

Daniel Whitenack

Indeed.

And one of the interesting things, specifically about speech, is by and large, the intelligence community tends to care about languages that are not very commercially viable for regular products, right?

Daniel Whitenack

Sure.

There’s not as much interest. Let’s rewind to 10 years ago. If you asked the CIA or whoever, “What language would you love to have an automagical translator?”, they’d be like, “Oh, Arabic, for sure”, right? And they wouldn’t have said English or Mandarin, which are probably the two most popular speech dataset flavors out there. So yeah, I think just open data is hugely powerful in enabling researchers and everyone to work together on these problems.

Do you think there’s a cultural aspect to it in terms of people getting used to this idea? Because when you position it like that, you can clearly see the business case for it. I know in my experience, it seems like some years back we went through this with open source software, and people struggled to see that business case and understand, and now it has swept almost all aspects of business. Do you envision that use case benefit in terms of being able to influence how things go, and participate that on a larger scale, that that is going to be more widely adopted over the years ahead?

Yeah, I think so. I think one of the other things that I’d say is when you look at open data, even where you might be differentiating via your own data, in lot of cases, it’s going be additive, right? So let’s say you want to do a medical transcription thing. You might not want to donate to the wide world all of your work on translating and decoding specialized medical terms like tachycardia, right? But if you were going to train a natural language model, some version of say BERT, you might start with BERT on a large public dataset like Wikipedia or a crawl of the web, and then you’d say, “Well, I’m going to add on my special sauce later, to fine-tune or supplement or augment that.”

[12:16] So one of the ways I think about this is that getting from zero to product is a really big push, and you need the data to get there, but it’s not the only thing you need. There’s also the whole process of how do I productize it? How do I test it? How do I make sure it’s not going to do wild and crazy things that I don’t expect? And if I as an organization can provide some open data and get the whole world a few steps down that path and accelerate things, that’s just good for everyone.

Daniel Whitenack

I’m curious, I’ve got a follow up for you on that, and that is – I know that there are people listening to this, because it’s in my head as well, that are thinking “My organization needs to do better on that.” And as we talk about the decision point about how do I make the business case and what should go open, what should we keep as our secret sauce, and all that, any guidelines on how to make such an evaluation? I realized that use cases vary a lot, but I have had conversations with people where they either want to just throw everything open or they want to do everything as their special sauce, but there doesn’t seem to be a method to the madness on what constitutes the two. How would an organization do that, recognize that by going open, selfishly, they get the benefit of being able to steer things, but they still get to keep their secret sauce? Any thought about how to make such an evaluation?

Yeah… So the other thing I was actually going to mention is one of the things that our organization is very dedicated to is the maintenance and upkeep of these. So if you think of ImageNet, that really started this round of innovation in machine learning. And it was ImageNet, and then the realization that, “Hey, AlexNet plus GPUs plus ImageNet beats humans at image recognition. Oh, my gosh”, right? Now, ImageNet is fantastic, it’s amazing, but now it’s old, it hasn’t been updated, there’s some legal issues associated with the licensing… So imagine we’re doing speech-to-text, and we did that in 2019. That model, that dataset would not have anything about COVID or coronavirus or any of these other things. So you want to keep these things up to date, and that does cost some amount of money. And so by offloading that to the community, or MLCommons, we’re curating our stuff - it’s part of where our budget goes - that can be very helpful in terms of combining community resources. And so that’s another aspect to the business case.

But to answer your question, I think it is very easy to slide into these binaries, right? It’s a very natural way of thinking. And where we’ve had a lot of success, I think, is usually where there is some sort of research folks involved or some sort of specialists who can think about this and reason about it - these kind of endeavors inevitably involve your legal department.

That’s a great point.

And we don’t want to get folks in trouble, and so there’s all sorts of issues about how is it collected? Do you have the right permissions, GDPR? And a lawyer is probably not going to be able to give you the insights as to what is commercially valuable, what is commercially threatening, could it be monetized in other ways, or “No, is this just the thing that makes sense.” So you kind of do some element of the business and potentially technical community. And so we’ve found that that is oftentimes a very helpful driver. One of the things that is beautiful about the machine learning community is I think openness is in our DNA. I think for a lot of researchers, that is a default mode of thinking, and that is helpful.

[15:57] to [16:58]

Daniel Whitenack

So David, as MLCommons was thinking about this pillar of datasets, and obviously there’s a lot of different datasets that could be valuable to the ML community, and all sorts of data, whether that be one type of data or multimodal data, or all sorts of things - how did MLCommons settle on an initial maybe focus on speech data? Because I know we’ll be transitioning to talk about a couple other very impressive datasets that were recently released, but they’re both speech datasets or speech-related datasets… So how did that come about?

Yeah. So when I think about the dataset problems that I want my organization to– or the organization I serve to tackle, I sort of think about it as two sides of the equation: return and investment. And by return, I really mean what’s the impact, right? I want to be working on problems that’ll have an impact, where we’re going to provide some very clearly differentiated, very valuable things that the community doesn’t have. If all you do is produce something that already exists with a slightly friendlier license, that can have a big impact. But it’s not as great as, “Now one of our datasets is the first in many, many languages”, right? That’s a qualitative change in the landscape; that’s really exciting.

And then the other side of that is, do we have the right people to do it? And what is the investment that’s involved there? And can we bring that skillset together? Are we uniquely positioned to do that?

For us, speech was very much ground zero for a couple reasons. One is we actually want to build up the infrastructure around building these datasets, because part of the theory behind our organization is that actually a lot of the plumbing behind the scenes - there’s some commonality to it, and sharing that can be helpful. I mean, obviously, there’s greater alignment between two speech datasets than speech and vision, but still, there’s a lot of just general wrangling you’ve got to do. And also, some of the leads on our project had prior experience building speech datasets, and there was a very real desire to have– and all of our datasets are permissively licensed, for both research and commercial use. And there’s a real desire to build something big enough to train an end-to-end speech model on, and then begin to tackle diversity. So those things all kind of lined up for us in there was a gap in what was out there, and we had the right expertise and we thought that the impact would be big.

Daniel Whitenack

Yeah, so maybe you could just describe - generally, there’s two large datasets that you’ve released recently. We’ll link, of course, to the specific information about those, and people can obtain them and download them and start working with them… But maybe just give us a picture of each of those and what the goal was with each one.

Yeah. So the first one is called the People’s Speech, and I think I talked with you guys about this last year… And that is 30,000 hours of labeled speech data, and it’s intended for speech recognition purposes, but you can use it for a lot of other things. We’ve had people who downloaded it to do denoising filters for acoustical applications. And it’s, like I said, permissively licensed conversational speech, so it’s not just read audiobooks, as was common before. And it’s very big. It’s 30,000 hours. It’s about, I think, two to three terabytes.

[20:40] And part of the point here is we’ve got– you know, when I think about speech, I think about three dimensions to the datasets. One is size, one is the languages, and then the other is the context, and do you have noisy speech, or is it something really clearly recorded? Are there kids playing in the background, and other things like that? And so this is putting the stake in the ground on the size, and we can work on the other things. And like I said, it’s big enough to train an end-to-end model, so kind of over the 10 to 15 thousand-hour tipping point. So that is going deep and big in the size dimension.

And then the second one, which is nicely complementary, is the Multilingual Spoken Words Corpus. And this is exciting, because this is really pushing the boundaries on the diversity angle from the number of languages. And so that - it’s not for speech recognition, but for keyword spotting, so recognizing keywords; it’s a modest number of keywords. And it is 23 million clips that are each about one second long, covering 340,000 keywords, with 115,000 different source speakers in 50 languages. And the thing to me that’s really cool about that is those 50 languages cover five billion human speakers. So to this is the majority of the world’s population. And most of these languages, this is the first existing dataset for keyword spotting in those languages, and certainly, the first under such a nice permissive license. And we’re talking about languages that are not supported by these home assistants, right? I don’t want to pick on one company, but if you look at a lot of the home assistants or things like that, they might support a dozen languages; but there’s a lot of languages where there’s not a lot of support. And so bringing that capability to these new communities is really exciting. It was very cool to see some of the people tweeting about, “Hey, my language is in here”, or getting emails about “How can I add my language?” So that’s pushing forward in a different dimension, but again, also speech.

So I’m kind of curious… I’ve worked with datasets from whatever the task that I’m focused on, but I’ve never done anything at the scale of having to put the data together. What’s involved in that? If you’re looking at you have the keyword spotting and you have the speech recognition use cases and you have massive amounts of content that you’re doing, I would imagine that there is a level of organization and effort required way beyond what most of us that are just doing typical day-to-day machine learning models are having to deal with. How do you even start that? How do you approach such engagement?

Hm… That’s a really good question. So our team is pretty small in terms of the number of engineers. So I’m not sure that– you know, resource-wise, it’s not like we’re coming in with 10 people, or whatever, full-time, or anything like that. I would say, I think the thing that we really wanted to focus on was– because we knew, as you said, this is a big problem; I think we had to focus on things that scaled initially. A really good example of that is we did a back-of-the-envelope calculation of, “Okay, what would it take to label the data manually?” And I think we calculated it would be on the order of about $10 million.

[laughs] That’s a bit of a budget right there.

[24:10] Right. Yes, that’s several years of budget for my nonprofit. And we’ve got to keep the lights on, we’ve got to pay the employees… So that was out of the question. And so we actually– I think one of the things we really wanted to focus on was building the tools and leaning into compute rather than manual labor. And so we were able to label the data and generate the labels using computer systems for under about $20,000. So I think it’s attention to that and building up the tools. And all of the tools are open source, by the way. Not only can you go and get the dataset, but you can get the tools we used to create it, and we’d love to see people updating them, fixing bugs, whatever it is, and see community adoption of all of those pieces. But again - yeah, I think it does come down to the right tools, right? If you’re building one house, you want one set of tools. If you are building a 50-story apartment building or a dozen houses, you’re going to have a slightly different infrastructure.

Daniel Whitenack

Yeah. I would be kind of curious to maybe dig a little bit into that, because when you’re creating a labeled dataset, people might think, “Oh, I want gold standard data”, which necessarily in their mind means human-labeled, without a machine in the loop of creating the gold standard data. So maybe describe a little bit more like why this machine process can create something that’s useful to train a machine on, I guess, would be the question.

Right. Or put another way, okay, so the snake is eating its own tail - does that actually work?

Daniel Whitenack

Right.

Right. Or are we left to where like there’s no snake anymore. So first of all, I think the technical team had some great instincts here. And one of the big hypotheses is, as you say, a lot of people will focus on extremely high-quality data. But it’s also pretty common that if you have some bad data in there, it may not really be a huge problem. And so I think part of this is accepting that a large amount of modest quality data– and actually, we can talk about some of the details, because in generating the labels, we actually did figure out or try to estimate how many of them were good. And a big chunk of it was done sort of perfectly; some of it was done to sort of human quality, some of it was worse. And I think there was a bit of a gamble there, that that would ultimately prove to be useful. But we do know that machine learning is very good at handling rough inputs, and so it seemed like a pretty good hypothesis there. It’s also the ways that a machine in the loop for labeling go wrong are going to be probably a lot different than a human.

I’m just kind of curious… I’m sure Daniel already knows this, but as you’re putting these datasets together, is it kind of one shot and you’re done and it’s there, or is there a management of that dataset over time? Do you go back and re-label with new tools or anything, or do you move on to a whole new dataset? How do you think about producing things?

I guarantee you, like any engineer, we tried a few things, and the first things did not work. I’ll give you an example. I know we evaluated several different forced aligners. So we started for People’s Speech by scraping data where there was a transcript and the audio, but they weren’t necessarily temporally aligned, which you need to train a speech-to-text model. So we did our own transcription using Kaldi, with an Ngram language model. And so we came up with our estimated transcription, and then we reinforced alignment on that to get the timestamps right.

[28:11] And all of this audio came with subtitles. But just as an example, sometimes the subtitles are a little wonky, right? You might get a subtitle – maybe the language spoken is French and the subtitle is in English, or maybe it’s a description of a picture, or something like that. So you can get all sorts of really interesting problems. And so we tried different aligners, and we did find one that we thought solved some of these problems; it couldn’t solve all of them, but we definitely tried out a bunch of different things. And to us, one of the proof in the pudding sort of things was actually training a speech-to-text model on our dataset and seeing how comparable it was to using something LibriSpeech, which is much smaller, but is sort of the gold standard today.

And we got in the ballpark of LibriSpeech, which we were like, “That’s good enough”, but I will tell you, we did not start out in the ballpark of LibriSpeech. There’s all sorts of problems. You have mismatched transcripts where things are just– but you find the problems and you hammer them, down one by one. That’s my experience. I don’t know, Daniel, what about you?

Daniel Whitenack

Yeah, I mean, this is a never-ending process, right? And I think – like, you were talking about ImageNet and others, this is only… Things get stale and need to be updated, and language, especially, is always changing very rapidly. And like you say, if something was released before COVID and you need COVID health words and stuff after COVID, it needs to be updated. As part of your release, do you have a mechanism for people to make contributions or feedback or human evaluation type of feedback or anything like that?

Yeah. So again, we’re big on community. We have an open Discord channel that you can drop in and tell us what’s great, tell us what doesn’t work. When we released it, as you might expect, the initial sets of comments were, “This is really amazing, but we found this hitch.” So we’re nailing those down. And we’ve got a Google group and mailing list for this. Again, all the code is open source, so if folks want to file bugs and– I mean, the thing to me that is most exciting is now that it’s released, a year from now, what are people doing with it? To me, that’s really the sort of thing that’s very cool.

One of the other things I would actually mention just briefly - an example of a scale thing that we had to work on is when we were initially doing the forced alignment, one of the things we found is that just using out of the box software was not sufficiently fast. And so we actually had to go through and optimize our forced aligner, both the acoustic model and language model, and get them running on an accelerator, on a GPU. And at that point, we were able to at 250X real-time. And so that’s an example of paying attention to scaling and systems where it’s like, yeah, 30,000 hours is a lot. You don’t want to be renting an Amazon instance for 30,000 hours, because a) we got to release it. But okay, you cut that down by a factor of 250 and it’s like, “Okay, we can rent a few of those”, right? So there’s all sorts of things that go into this.

Daniel Whitenack

So being with SIL, I think I would probably get fired if I didn’t ask a little bit about, particularly, the dataset with the spoken words, and your attempt to, like you say, bring in a bit of diversity into speech datasets that were out there. As you look back on that, how does this dataset stack up against maybe– like, what are other examples of datasets out there for spoken words, and how does this change the picture in terms of the diversity of languages?

That’s a great question, and that hits on the things that I’m most proud of, in some ways. So I think the gold standard is Google Speech Commands. And I am not as much of an expert on the keyword spotting area as the speech-to-text, but my understanding is that’s a dataset with about 105,000 one-second utterances, covering 35 words in English. And it’s great. Again, fantastic community resource. And if you go back to what I said, we’ve got 340,000 keywords in 50 languages with 23 million clips. So there’s 50 different languages we’re covering, and some of them have very good coverage. We bucketed it into three categories: low resource languages, which is under 10 hours of data… So that’d be something like Georgian, Tamil, Vietnamese, Arabic, and that’s about half of the languages. And then we have medium resource languages with between 10 and 100 hours, and that’s 12 languages. Some examples of that are like Czech, Ukrainian, which is actually– that’s the first Ukrainian dataset of its kind, for sure. Turkish, Portuguese, Indonesian… And then we’ve got high-resource languages, which is over 100 hours. And that includes actually some reasonably obscure languages like Basque, Catalan, Persian, and then, of course, some standard ones like English, Welsh, et cetera.

So to me, the really exciting thing is – first of all, it’s a lot of languages. And also, some of the work in this paper was about, okay, when we say low-resource, does that mean you can’t use it? Or what does that really mean? And actually, what we’ve found is that you can use a lot of the low resource languages, I think, for few-shot training and fine-tuning examples, right? And so that is really powerful in terms of bringing out new capabilities. And I think the blog post I wrote was cheekily like “Giving a voice to five billion people”, right?

Daniel Whitenack

I think especially speech… You know, text is a slightly different scenario, but especially with speech, you actually– we always talk at SIL about the long tail of languages, which is you’ve sort of got the top 100 and then you drop of very quickly in terms of resources that are out there all the way out to 7,000 something languages. So you actually don’t have to go very far. As soon as you get past the first 10 or so top languages in the world with speech data, it drops off very rapidly. And so yeah, it’s really cool to see this effort that pushes that out there, and that’s really encouraging, and my brain’s already going with ideas of how to use this.

[36:05] I think it’s also interesting, looking at it, how you’ve– I mean, it’s one thing to release data, but also it’s very useful, at least in my view, to provide additional metadata and annotations along with those words. Like, I see you have parts of speech and semantic categorization, which is all really interesting and something that I think will be one of those things that drives maybe surprising uses of the dataset. So yeah, I’m not sure when that came up in your planning, but I’m glad you went for that extra kind of metadata information.

Yeah. This is part of the curation process, and how can you improve things over time and make it more useful. And part of it is - you know, we talked to the member organizations along the way as we were doing this to get some feedback on what would be useful… And the other thing I should mention is, actually, the Multilingual Spoken Words Corpus is actually a dataset generator, so you can dial in your own keywords, right? And I think because of that capability, it’s useful to know what parts of speech are there, and what are the content types and the topics and domains. So to me, it’s part of making it low-friction and usable, and also just understanding really what’s going on. And I think that kind of analysis is super-important for large-scale datasets.

To jump around a little bit and talk about the People’s Speech, one of the things we did is we randomly sampled about 5,000 hours to find out what kind of background noise we had… Because if you’re going to train a model, you want to know. And we found lots of music in the background, conversation, basketball bounces, all sorts of things. And ultimately, as a data scientist, just getting this is really good. And I should mention, for the Multilingual Spoken Words Corpus, it’s built on top of common voice. So what we have is this really cool pipeline that can pull in a dataset. And what’s cool about common voice is its ordinary people. So anything that goes into common voice will eventually get incorporated into ours. And there’s going to be good background noise and different kind of recording circumstances. But I think it really helps, to the extent that we can, to try to characterize that.

Kind of in the spirit of “If you build it, they will come”, what kind of member organizations are you pulling in now that you have these new datasets out there? And as you’re attracting new participants, how are you organizing them, and how are you able to put them all together in such a way that it continues the development forward in a cohesive way? How do you manage that process?

Oh, that’s a good one. Well, so we released these datasets in the middle of December; it hasn’t been that much time, so I don’t know if we’ve gotten any new members to join yet, but certainly, part of my goals for 2022 is getting new members… And getting folks who are excited about contributing and using.

Look, there’s a lot of ways that folks can contribute. If you’re using the dataset and you give us feedback, that’s super helpful. I mean, obviously we have to get funding from somewhere, and so we do want new members. I would say that as an organization, a lot of our folks are very benchmark-centric and focused on MLPerf, but I’m looking at, as we evolve the organization, how can we allow for participation from folks who are much more data-centric? The motivation there is a little bit different, right? It may be that it’s not, “Oh, pay us to join this consortium.” It may be, “Do you think this is a great effort? Cut us a check”, right? There are a lot of government and other organizations - Daniel, you’re at one - that are very invested in speech diversity, and they might maybe consider doing grants or something like that. So this is something we’re just starting to explore.

Yeah. My mind is already thinking of follow-up conversations that I need to have with you, David, but…

That’s great.

[40:09] Yeah. I hope others, and our listeners will be, as well. So as you move forward this year, since this is a first recording for the new year, what is in the future roadmap for MLCommons, both in terms of the datasets, but maybe other things too that you’re excited about diving into?

So the first thing is, specifically on these datasets, I think of datasets as sort of – we want them to be living datasets. It’s like a garden, you’ve got to prune it, you’ve got to water it. And so one of the things that I think MLCommons is uniquely positioned to do is to be that organization that does maintain it for the community. So we’ve got engineers who their full-time work is to help maintain and improve this. So that incremental get better, maybe add a little bit more data, but we are looking at new datasets that, again, we think will push the needle forward. There’s a vision dataset that we’re looking at that should have some really new, nice, diverse aspects to it, compared to what’s currently out there. I think that, on the dataset side, is a big thing for us.

And then we’ve got some other projects that actually we announced at the same time. There’s the data-centric AI movement, which is at a high level saying, “Look, we have a lot of competitions where people are showing off their models.” I saw on Twitter Yann LeCun was talking about, “Do we want to use transformers or convolution-based networks for vision?”, right? And with convolutions having been the gold standard for a while, but there’s a lot of really interesting research focus on using transformers in that area. That’s very model-centric. And one of the things that we want to do is focus on data-centric AI, because we think that is really powerful.

We have an initiative called DataPerf, and the idea is to say, “Hey, let’s run a competition where instead of showing off the coolest model, you show off things like data augmentations, or different splits that can help get better accuracy, faster time to train, things like that.” And so I’d also love to see folks doing cool things with Multilingual Spoken Words, or People’s Speech, and stuff that we’ve never even thought about.

So I think that’s some of the stuff ahead that really excites me. And we may also– I should mention, one of the motivators around People’s Speech - it’s not the size of a dataset that someone like Google or Amazon’s going to train on, but it’s definitely moving in the right direction. And so having that for our benchmarking at MLPerf is actually very nicely synergistic. So maybe we’ll start building some bigger benchmarks with that, especially because one of the things we wanted to focus on is making these datasets very legally easy to use, right? They’re all CC BY. Tell the world we’re awesome because we built the data set and do whatever you want, right?

Well, David, thank you very much for coming back onto the show a second time. These are a little bit special for me because I get to sit here not only with you, but with Daniel also and the expertise that he has in language. I kind of feel like a kid in a candy shop every time I talk to you on this, and I learn a lot. So really, really, really cool work that you guys are doing after talking in the previous episode mostly about MLPerf, and now getting to learn more about datasets and everything. So thank you very much for your time and for coming on the show.

Looking forward to next year’s conversation.

Absolutely.

Absolutely. No, that would be great.

Yeah. Alright, we’ll see you, David.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art