Practical AI ā Episode #167
š AI in Africa - Voice & language tools
with Kathleen Siminyu, Kiswahili ML Fellow at Mozilla
In the third of the āAI in Africaā spotlight episodes, we welcome Kathleen Siminyu, who is building Kiswahili voice tools at Mozilla. We had a great discussion with Kathleen about creating more diverse voice and language datasets, involving local language communities in NLP work, and expanding grassroots ML/AI efforts across Africa.
Featuring
Sponsors
Fastly ā Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com
Changelog++ ā You love our content and you want to take it to the next level by showing your support. Weāll take you closer to the metal with no ads, extended episodes, outtakes, bonus content, a deep discount in our merch store (soon), and more to come. Letās do this!
Notes & Links
Transcript
Play the audio to listen along while you enjoy the transcript. š§
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am a data scientist with SIL International, and Iām joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
Iām doing very well today, Daniel. How are you doing?
Doing wonderful, because today is another episode in our series of spotlight podcast episodes AI in Africa. This is a joint initiative between the International Development Research Center in Canada, Practical AI and The Changelog, and a GIZ FAIR Forward project, all of which are involved in one way or another with the Open For Good Alliance. As part of these episodes, itās been really wonderful to have with us a guest co-host, Joyce Nabende, from the Makerere AI Lab at Makerere University. Welcome, Joyce.
Thank you, Daniel. Hi, Chris. Thank you.
Hey there.
Itās just another opportunity for us to discuss on this podcast, and Iām excited to be here.
Yes, itās always wonderful to have you with us, Joyce. Why donāt you ā Iāll sort of pass it over to you to introduce what weāll be talking about today.
Right, thanks, Daniel. So over the past podcasts weāve been trying to look at different studies on Practical AI, particularly on the African continent. And we are very excited for this episode, because we are going to focus on AI, but looking at community building in Africa, and especially looking at the very exciting field of natural language processing.
For this podcast today we are very happy to have Kathleen Siminyu with us. Kathleen is currently a Kiswahili machine learning fellow with Mozilla. But before Mozilla, thereās a lot of work that Kathleen has been involved in. Kathleen, youāre very welcome to this podcast.
[04:21] Thank you for having me. Thank you very much.
Alright, so I think weāre ready to dive in. Over to you, Kathleen - give us an introduction about NLP for African languages, what work have you done in NLP, how has it been like for the last couple of years in the space of NLP for African languages.
Okay, thanks again, Joyce. So I will start with a bit of an introduction to myself and how I got into NLP. My backend is in maths and computer science, and with that I went into data science. I worked briefly in industry for a company in the telecommunications space thatās headquartered in Nairobi, Kenya. One of our products was SMS. And through working at this company I came to realize that support for African languages when it came to NLP and just tools or digital tools (if we could say so) was pretty lacking in comparison to, say, your English. And this got me pretty interested in NLP, and I wanted to build particularly resources for African languages.
So yeah, this is where my interest began, and then I ventured into academic research at this point. My next job was with IDRC, or rather AI4D. I was a regional coordinator of AI4D for a couple of years, and this work involved mobilizing funding for communities. IDRC has funding which is going into AI research in Africa, and I helped a lot with that in terms of identifying communities that could and make use of this funding.
So Iāll back-track a bit and say that during my role in industry, I did a lot of community work. I began a lot of community building, first with the Nairobi Women in Machine Learning and Data Science. This was sort of my first foray in the field, as I was a data scientist that worked in the industry. And then along with transitioning into academic research, I encountered Masakhane. They were doing exactly what I wanted to do in terms of focusing on resources for African languages, so my organizing also transitioned into that. And in the middle, I attended several Deep Learning Indabas. The Indaba is a movement to trends in Africa machine learning. One of the events we have is an annual summer school. I attended one, the first one in South Africa, in Stellenbosch. After which we got to talking and they realized that there was an ML/AI community in Nairobi, courtesy of the meetup I was organizingā¦ And so we organized a second one together in preparation for the third one, which we then came to host in Nairobi in 2019.
So Iāve spoken a lot, but let me see ā Iāll cap it by saying that I have worked a lot with ML and AI communities in Africa, and that brought me to the intersection of my interests in NLP, and the existence of Masakhane, with our focus on African languages. So here we are - itās probably where I spend most of my organizing time at the moment. I am very excited by the fact that we have a wealth of diversity from the African continent in terms of people who are working on languages, and itās languages that they care about. So a wealth of diversity in terms of African languages included.
[08:05] Iāll plug in the fact that as recently as last week we finally had Masakhane registered as an entity, so we are officially the Masakhane Research Foundation. Up till this point weāve been really just organizing as a group of people, and thatās been great. But yeah, I guess 2022 is going to be the beginning of us starting to see how we can formally organize beyond running on volunteer capacity. What it looks like when we organize and we can but funding to some of our planning. I will stop thereā¦ I kind of forgot what the question was, so Iāll throw it back to you guys.
Thatās so wonderful. I really appreciate all that context, Kathleen. I wonder for those that may be thinking of those out here that might be listening to this podcast, that have never ever heard of Masakhane and the things that theyāre doing - could you just kind of give a picture of whoās involved in Masakhane, how do they interact, and what are the sorts of activities that theyāre doing?
Okay. So Masakhane was founded in 2019, and this actually took place at the Deep Learning Indaba that was hosted in Nairobi, Kenya. It was founded by Jade Abott and Laura Martinus. Basically, they did a paper for machine translation involving a bunch of South African languages, and created benchmarks. And part of the work done in this paper, they came to present at the Deep Learning Indaba, with the idea that if they create a notebook which is very easily replicable, then people can take it upon themselves to train similar models for their languages, therefore likely creating first benchmarksā¦ Because very little research exists for African languages.
That is how it began, with a focus on machine translation. The initial work that they did was based on the JW300 dataset. JW is Jehovaās Witnesses, and they are an organization that through their work evangelizing in Africa for many years have translated the Bible to many local languages in Africa. So the notebook leveraging this dataset made it super-easy for literally anyone to start up a notebook, put in the language code for the language you would like to work with, and in an hour or two have a benchmark trained. And it gained a lot of momentum in that way. I like to think that even though many of these languages have not been the point of focus in research in the past, or in terms of product development for markets that can pay for products - they havenāt been a focus from that perspective, but this gained a lot of momentum because of the setting. Because the Deep Learning Indaba is bringing loads of young people from across the African continent, who given the fact that weāve had two previous Indabas, have basic skills.
So some effort in capacity building had been done, and it was sort of a sweet spot in terms of many of us have attended previous Deep Learning Indabas, and we have the basic skills, and we now have a desire to specialize in somethingā¦ And here comes Jade and Laura, and they have a notebook, and theyāre telling us āHey, this is how you can start in machine translation.ā
So it gained a lot of tractionā¦ Personally, Iāve trained models for Kenyan languages, and I think thatās how many people rationalize, in terms of āWhat languages do I care about? What do languages do the community adjacent speak?ā and then went on to do that. If you could find a dataset for your language, then you went ahead and trained a machine translation model.
We wrote a really great paper. Two, I think. One Iād like to highlight is a participatory research [unintelligible 00:11:54.28] in African languages. This one in particular because it describes the ecosystem for machine translation to be successful [unintelligible 00:12:08.10] in an African context.
[12:12] So one problem is access to data, and we talk about the fact that typically, content creators create data. But In an African context, these content creators may not have access to keyboards, or they may not have access to digital dictionaries, which hinders the development of data to start with. So thereās a whole ecosystem description, and Iāll leave it at that for now. But then weāve seen great success, first in machine translation, with being able to do very multilingual and inclusive work. Then, progressing from there, weāve realized that individuals who were participating and contributing to Masakhane were not only interested in machine translation, but as a task in NLP as well. So at this point, we sort of took a step back and generalized and now have membership or participation from people building in NER, in speech, and basically just a wide scope of tasks.
Iāve got a quick question for youā¦ I was interested in the dataset that you mentioned, the JW dataset being a Bible translationā¦ When you look at a lot of different Bible translations, and kind of the language may not be completely what you typically would talk in today - did that present any kind of challenge, or was that generalized enough to where that didnāt affect it, or anything? I was just curious about that being the basis, and what challenges that might present that were unique to the dataset.
Thatās a super-interesting questionā¦ We encountered a phenomenon which ā I donāt know if we called it, or it is known as the biblification of systems. One example I remember is ā actually, I donāt remember what word, but in machine translation systems a certain word kept on being changed to ācanonā. Because ācanonā shows up many times in the Bible, but you are not ācanonā often in conversation.
Right.
So it has been a challenge to rely only on Bible data, and to answer your question it has not generalized wellā¦ Although, actually interestingly, after a year or two of working with the JW300 dataset, the organization actually pulled it from the internet. So it was scraped from the website, and the organization claimed that they were not aware of this dataset existing, and had not given permission for itā¦ I mean, weāve been trying to ask them for many months to see if they would make it open anyway, because it has inspired a lot of our research, and weāre still waiting to see how that turns outā¦ But then ā yeah, just to bring in the fact that access to data is a huge problem, and in many cases we face IT and copyright issues, even with data that is accessible.
So at the moment a lot of activities unfortunately have to start from a point of dataset creation, which I wish was not the case, but we also have found that itās an easy way to start including and up-skilling individualsā¦ Because someone can start out on our project today, labeling data or creating data or evaluating a model, but then progressively pick up a lot more valuable skills and grow those skills. Itās a blessing and a curse, but mostly a curse, because as NLP researchers, I wish we had the luxury of just accessing data for whatever packages we wanted to work on.
So Kathleen, Iām actually in the Masakhane Slack group and I just took a quick look at the involvement. I see in the Masakhane Slack group that thereās - at least at the time Iām looking, thereās 1,304 people in the Slack groupā¦ Which is pretty amazing that this community has grown in the ways that it has. Iām wondering if you have any ā you know, looking back on how the community was formed, and the activities you did, and how you welcome people, and those things, did you have any insights into why you think it grew, has grown the way that it has? And maybe looking forward, characteristics of the ā now that itās an entity and a foundation, characteristics of the community that you would like to ensure that they are characteristics of the community moving forward. Any thoughts?
Yeahā¦ So in terms of it growing, Iād say it was unprecedented, even for me. I will attribute it, first of all, to Jadeās never-ending energy, I swear. Masakhane would not be what it is without Jade. I sometimes even struggle with just keeping up with all the conversations that goes on in chat. It can be very overwhelming.
I think one of the reasons why it has grown, again, is just the fact that these languages are very underrepresented in digital platforms. But that is not reflective of the communities that use these languages. They say thereās 7,000 languages on Earth, and maybe 2,000 languages in Africa. And many of those are living with thriving language communities that use them every day. They are just not used on digital platforms, and unfortunately, because of social linguistic factors, theyāre not used in formal spaces. At least in Kenya I know for a fact that parents are more likely to encourage their children to become better at English than their mother tongueā¦ Because with English you can walk into any office and have a conversation and potentially get a job, versus your mother tongue, which is only useful at home or in whatever local context you find yourself.
[20:23] But that doesnāt take away the fact that these languages are used, and at least from a sentimental point of wanting them to be preserved and captured on digital platforms, I think thatās one huge reason why Masakhane has been so relatable to students and budding researchers across the continent.
Iāll also say we have a lot of room for absolute beginners. Again, itās really easy for someone to work in today - young people have loads of energy, and many will come and theyāll say āHey, I want to do something.ā And you know what - they can add to literally every single project, the fact that theyāre multilingual. So if we have an NER project and you walk in today and you say āHey, I want to do something, and I speak Kiswahili in addition to Englishā, then we can say āHey look, thereās this project where you can go and hereās a Kiswahili text, and you can start labeling.ā And thatās a way to start involving someone.
And theyāll be interested in āOkay, what else can I do? What is this person doing? What is that person doing?ā So I think just the fact that literally anybody who is multilingual can participateā¦ And then weāve been progressively working towards pathways for capacity building, so we have several groups that now run every 3-4 months. Thereās one thatās particularly for beginner NLPā¦ So any beginner whoās involved with something in the community can also plug into that and start gaining more skills. Last week I was on the call and Julia offered to do one for machine translation.
And Iāll say ā let that be a segue in to another factor, because weāve gained quite some recognition globally, and the wider NLP community has also been super-supportive. So in terms of more experienced researchers wanting to know how they can be of use, and then proceeding to actually support us, right? So Julia has a Ph.D. She works at Google Translate and she shows up probably for 90% of Masakhane activities. If you ever want to debug ā I remember my experiences of training those initial models for Kenyan languages, she was literally always available on Slack. So Iāll also shout-out to the fact that weāve received a lot of support, a lot of mentorship support, and thatās continuing to happen, and I hope it continues to happen for an extremely long time.
So the second part of the question, now that weāre already have the entity and are growing somewhat exponentially, characteristics that I would like to seeā¦ First I think is a very distributed nature of leadership. Masakhane is like a storm; itās not a coordinated storm, itās just something that started happening and now itās got a life of its own. I love the fact that itās not reliant on any one person; so thereās never a day where person X is sick and āOkay, now we canāt hold the meeting because person X is supposed to chair it.ā Thatās never happened, because literally everyone is empowered to walk in and chair the meeting. We have a template, itās recorded, and everybody else can catch up.
We see that a lot also with the leadership in projects. So weāve never sat down and said āHey, we want to make NER a focusā, but somebody in our community did. And they not only ideated, but they went ahead to organize regular meetings, and recruit people, and come and give updates at our weekly meetingā¦ And that turned into a paper that weāve accepted to EMNLP.
So distributed leadership - I would like to see this continue. Something we are a little worried about is the fact that having an entity ā well, itās a great thing, first of all, but having an entity means that we can receive funding. And we have been able to receive funding in the past through organizations that we have collaborated withā¦ But having our own registered entity presents an opportunity to fund our research.
[24:27] So far weāve mostly run on volunteer efforts, and the results have been great. So our one concern is that now when thereās funding, a decision has to be made about what gets funded, what doesnāt get funded, and weāre worried about how that scales. Will it stifle the distributed leadership, especially on projects, or will it [unintelligible 00:24:48.05] forward? So Iād like to see us maintain that, despite funding ā Iād hate us to turn into an organization where the characteristics change because now money is available.
Something else Iād like to see us grow into is just further supporting pathways for individuals who are interested in productizing their work. Again, Iāll go back to the fact that we have many young people who this may be their first interaction with any AI topics, or their first interaction with NLPā¦ And we pretty much have a great pathway for them to advanced academic careers, because theyāre taking part in projects, which means theyāre writing papers, some of these papers are getting accepted at leading conferences, theyāre being mentored by leading researchers in the fieldā¦ Many have gone on to get accepted into masters and Ph.D. programs which are amazing, internship opportunitiesā¦ Basically, thereās a really great pathway for people who want advanced research careers, but Iād like to see us build a similar pathway that is just as strong for individuals who want to productize and build companies. So thatās something Iām hoping we can grow into.
And then one of my absolute favorite things about Masakhane is that we have a lot of female leadership. As a woman in tech, this is something I look at. As someone who has organized communities for women, this is something thatās close to my heart. And I absolutely love the fact that we have pretty good female representation. I attribute it to the fact that this is a movement that was started by women, and I think itās very powerful signaling.
Iāll tell you, whenever I have been in a position to be accepted in a company that I wanted to join, Iāll probably go to their website and see if thereās women on the team. And if thereās none, I start to ask myself, āDo I want to be the first one?ā But if there is one, then thatās one person I can write to and say āHey, can we have a chat about what working for company X is like?ā
So I think itās a very powerful signal that it was started by women, and I love it that there is a little sisterhood, or perhaps a not so little sisterhood that is part of Masakhane, and I would love for that to continue to be the case.
Joyce, Iām curious if as a researcher in a university setting working on related problems to what the Masakhane community is doing, how has this sort of groundswell of community building in Masakhane influenced your research team, and maybe the things youāre able to do, and the way youāre able to engage? And then maybe you have a follow-up question for Kathleen with regard to some of the things on your mind as you think about your research group and engaging in the community building.
Yeah, thanks. Just listening to Kathleen is really very enlighteningā¦
Super-inspiring.
Yeah. And the background and the roots of Masakhane that I never knew about. So thatās very interesting. And I think for us how itās benefitted the lab is - unlike Masakhane, that is wide and does many things because itās a large community, we setted out in the NLP space building out in speech, and now in machine translation, like she mentioned, for the language that we care about, which is Luganda, the main language in Uganda.
[28:15] And Masakhane has really come in to support the researchers in the lab. As you said, if there are people who need help, maybe theyāre running their models and they are stuckā¦ Iāve seen many of the people in my research lab go to Masakhane, put in a query, and get responses. And this has enabled them to move much faster with their model developmentā¦ Unlike if it was just a closed community for us, or unlike if you went and wrote an email to one of the researchers that maybe developed a model, or did somethingā¦ But here itās on Slack, itās within that community, and they get faster responses.
Or maybe you find a situation where someone has encountered the same issue, maybe with the modelsā¦ Itās easier for them to even respond. And I like that itās open. They can just put in a question there and then you get an answer. So that sort of community-building is something thatās very critical, that the lab is also learning to adopt and leverage as well.
But also listening to Kathleen speak right from the beginning, I think Kathleen you said community, community buildingā¦ Community is very important to you, and I know that with the new role that you have with Mozilla that thereās also a lot of community building involved in there. So can you maybe tell us more about that, the current work that youāre doing around community building with Mozilla? And Kiswahili in particular, your language that youāre passionate about.
Yeah. Okay, so my current role is machine learning fellow at Mozilla, and Iām working particularly on Common Voice, and particularly in Kiswahili. So Iām supporting work to build a Kiswahili dataset on Common Voiceā¦ And this is starting literally from the collection of sentences. But I should step back and say that itās starting from community building, again, like you highlight. So we want Kiswahili speakers to care about this work; we are working to communicate to them that the existence of this dataset is something that is of benefit to them, because itās intended to be a digital public good, and weāre working to build ties with organizations that are already working in Kiswahili.
So I realize that we have the tech capacity, but then weāre working into a space where people have been working for years to build language communities, and Iād like for us to be sensitive to that, and weād like to be sensitive to that. So reaching out to the language boards, the universities that have linguistic literature departments, and such other communities, just to get input from them. And then this work begins at a course, collecting text. Common Voice is a project for building speech recognition, and we started with text because we canāt have audio without [unintelligible 00:30:58.08]
So building relationships with these organizations basically is also in our best interest, because then we can find avenues to get texts from them if they are existing, or build programs that can enable us to create text in the course of our work.
And then weāve learned so far from linguists especially that thereās a lot of diversity in Kiswahili speakers. This is something that I innately knew as a Kiswahili speaker, because listening to someone speak, thereās a lot of nuance; I can potentially tell what part of the country they come from, if theyāre Kenyan, I can potentially tell if theyāre not Kenyan, or are Tanzanian, or are from the DRC. And then thereās an additional level of nuance that is apparent among people for whom Kiswahili is their mother tongue. And this is a distinction that Iāll say even I wasnāt aware of, because many of us in Kenya and in Tanzania learn Kiswahili because itās a national language in the country, but then itās actually someoneās mother tongue, and like many African languages, there are related dialects.
[32:06] So we are also learning a lot of the nuance, the fact that for people whom Kiswahili is their mother tongue, if they listen to someone speak, they know āOkay, youāre originally [unintelligible 00:32:15.00]ā or āYouāre from this particular part of the Coastā, and they sort of label all the rest of us as off-country Kiswahili speakers. Off-country is like away from the Coast, the East-African Coast, which is the home of Kiswahili.
So learning about all this nuance and recognizing that it would be great if we could capture t his diversity in the dataset - thatās also been an interesting journey. Working with the linguistsā¦ The way this work will ā at least the engagement with the linguists, thatās currently shaping up as subsets of the datasets that are representative of the various dialects and variants of the language. So thereās also nuance depending on what other language, becauseā¦ To try and explain further, in Nairobi for example - Nairobians rely heavily on English, and it becomes apparent in our Kiswahili speechā¦ Because Kiswahili and it sounds like itās a translation, as opposed to it sounding like speech that is naturally in Kiswahiliā¦ Versus someone from the DRC, now I imagine they would mix their Kiswahili with French, and thatās another level of distinction. And to people who are native Kiswahili speakers, to them we are all just off-country Kiswahili speakers.
So thereās layers, and itās super-fascinating, but again, at this point we are really thinking about it in terms of subsets of the entire dataset, which can be used to first fine-tune to various contexts, so that if youāre building an application for the Coast, then we have datasets which make it easy for you to fine-tune to that context. Or if youāre building for Nairobians, or if youāre building for people in inland Tanzania, then thereās datasets that you can use to fine-tuneā¦ But then beyond that, to evaluate performance. Because at the end of the day, I tend to think that Kiswahili could do to other African languages what Western languages have done to all African languages, which is to be the focus of research and funding and development, and at the expense of others. Now Kiswahili is being spoken in parts of Southern Africa, and I think thatās amazing, that we are now pushing for potentially one language to be spoken across the African continent. But then I sort of worry that that may come at the expense of smaller languages, or other languages in generalā¦ Because now parents may start to think, āHey, you should learn Kiswahili for upward career mobility.ā But then that may make the mother tongue come in second place, or be ultimately forgotten.
But going back again to the work with the linguists, we also want evaluation datasets, which can mean that we ensure that thereās at least some minimum performance for all diverse speakers of Kiswahili. Thereās also a very strong gender thread in all our work. We realize that women tend to be ā well, Iāll start with the fact that speech recognition systems generally perform worse on women. And this can probably be attributed to an imbalance in whatever the original dataset is. Fewer women contribute to this dataset, and that could be for a multitude of factors. So weāre being very intentional about creating spaces for women to contribute to the dataset, because the challenges that they face may be unique. But beyond that, we would like for them to also be part of developing [unintelligible 00:35:48.08] distribution.
Well, Kathleen, Iām really excited to hear about the work that youāre doing in terms of building these datasets, and the way that youāre thinking about having things like a strong gender thread in all your work, and why thatās importantā¦ I know myself, Iāve been challenged in this conversation to think about ways in which myself as a practitioner can be involved in creating more diverse datasets, getting involved in these types of communities.
[36:18] I was wondering if you could maybe close us out by talking about how from your perspective - like, if practitioners are listening to this podcast and they have a desire to maybe contribute to work related to this, either building language diversity into text or speech or other datasets, or maybe thereās people in local language communities that are listening to this and wanting to get involved, wanting to build up datasets that could promote this set of technology with their language, what are ways that these two groups can get involved in this work? What recommendations would you make to them? How would you recommend that they connect with people doing this work and get involved, so that the community can grow?
Thank you for this question. So first to the researchers who could potentially contribute to this work - Iāll highlight the model Iāve seen with Masakhane, because I think that works great. It could be through mentorship. I think thereās definitely ways to find groups of junior researchers who belong to these language communities, or are under-represented groups that could benefit from working or collaborating with you. So thatās one thing [unintelligible 00:37:32.20]
Second is in the event that researchers are working or are keen to work on low-resource language, itās possible that theyāll have funding to create these datasets, so Iād challenge them to - beyond looking for native speakers and having them create a dataset, which you then take away and go work on in your lab in isolation, Iād challenge them to use that as an opportunity, again, to mentor. So it may be that theyāre not only looking for a native language speaker, but theyāre looking for a native language speaker who is also a junior researcher interested in this field. And so they can start contributing by creating the dataset, but then you can also then create avenues for them to contribute to what comes next in terms of the analysis and the actual work thatās good after the dataset has been created. I think thatās another model that would be great, and give much more value in addition.
Then turning back to language communities, Iām going to again at this point highlight Common Voice, because I think itās pretty amazing that Common Voice as a platform means that you as language contributor or a language community donāt have to start thinking about this work from a standpoint of āOkay, what tools do I need to collect the data? Where am I gonna store the data? Whatās the infrastructure going to be like? How do I access the data?ā and all those dynamics. I think itās as simple as if you have access to the internet and can access the Common Voice portal, then you can start creating, in this case a speech recognition dataset. Itās as simple as that.
And if you look a little, Iām willing to posit that thereās opportunities for you to identify other such free resources of platforms where you can start creating datasets. Beyond that, I like to think of language communities, particularly language communities that are not already African. So look at Kiswahili, for example - itās a huge language, thereās loads of speakers, thereās loads of interest from people off of the continent in having Kiswahili resources exist. But I personally see it as an entry point to other local languages that are spoken in the places that Kiswahili is spoken. So if a speech dataset exists for Kiswahili, itās possible to then take that text and translate it into another local language that you speak if youāre a Kiswahili speaker. And in that way, it could be a machine translation dataset that comes out of it. It could go on to become a speech recognition dataset.
[40:06] But then Iād also challenge these communities to then think about licensing. And licensing is a pretty interesting topic in the field of NLP, because on one end we have the very well-resourced researchers who donāt ask for permission and just create everything off of the web, and then thereās the language communities and the junior researchers who work very hard to create these datasets and then are not necessarily the ones who get fast publication or most interesting publication because of constraints like skills or resources.
So I would challenge them to think about licensing that can [unintelligible 00:40:39.27] their needs. And it may be that you say itās useful or it can be used for academic purposes, but non-commercial ones. Or it may be that you say you donāt want the dataset used in any manner until individuals from your community can be the ones building the solutions. Or it may be that you say āHey, this is the language that we speak, and thereās 100,000 of us, and nobody cared about this language before we started building the dataset, so we actually reserve the right to control all the solutions that are built from this dataset.ā Because at the end of the day, you are the ones that will be directly affected by whatever those solutions are.
So basically, I think as we bring more languages online, we should empower those communities to start thinking about how they can center their needs. Because they donāt need to just create resources and make them available for the most resourced or the most skilled to swoop in and build tools which they then package and resell to the language communities that worked hard to do it. And bringing these two together - I donāt know. Maybe these things Iāve highlighted already in my response, but Iād like to leave it there I think.
I definitely think that those are great jumping in points for our listeners who are interested. We will definitely include links to the Masakhane community and the Common Voice platform in our show notes. So please take a look at those, and get involved, and think about how you can start thinking about contributing in various ways, or mentoring, or whatever it might be.
Thank you so much for joining us, Kathleen. Itās been a real pleasure to talk with you. I really appreciate you bringing your perspective, and the hard work that youāre putting in on these problems. Thank you so much.
Thank you for having me, Daniel, and everyone else of the Practical AI podcast.
Our transcripts are open source on GitHub. Improvements are welcome. š