Hugging Face is increasingly becomes the “hub” of AI innovation. In this episode, Merve Noyan joins us to dive into this hub in more detail. We discuss automation around model cards, reproducibility, and the new community features. If you are wanting to engage with the wider AI community, this is the show for you!
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How’re you doing, Chris?
I’m doing okay, Daniel. It’s so good to be talking to you today.
It is, it’s wonderful. It’s a little bit of an odd day… I’m joining from a hotel in Dublin, because I’m attending ACL, which for those that don’t know, is one of the big NLP conferences in the research world, and that’s been fun. It’s been tiring, because I just forgot how tiring an in-person conference was.
Is it everyone else dragging around, too? Are they all just kind of slumped over…?
I think so. There’s not consistent coffee in all the places, which is unfortunate, so that’s tough… But one thing that I have heard mentioned quite a bit at ACL consistently is Hugging Face, and we’re really privileged today to have Merve from Hugging Face with us. She’s a developer advocate engineer at Hugging Face, really creating a lot of great content on the web, and great tutorials, and also making really significant contributions on the open source side. I’m really excited to have you with us, Merve.
I’m so happy that you have invited me. Thank you so much.
Yeah, great to have you here. I’m wondering if you could give us a little bit of the backstory of how you got connected with Hugging Face and NLP, maybe more generally. What was the state of Hugging Face when you joined? Because we’ll be talking a lot about how it’s progressing in this episode, but I’m wondering what its state was when you joined and what got you excited about it at that time.
Great question. I have met NLP at my senior year at university, actually. Literally, my first project was text mining and doing a classification using Naive Bayes in R about climate change sentiments. If people believe it or not, but I have scraped some tweets and classified them in a data science class that I have taken. And I was like, “I’m going to make this my job.” I later joined two bootcamps, did masters, and started working somewhere as a machine learning engineer. I did everything I could. I worked three years as a machine learning engineer, doing mostly natural language processing and also a bit of analytics on the side. I was building chatbots, I was doing information retrieval tools, and so forth, and I was using Hugging Face back then.
[04:07] Prior to that, prior to building information retrieval tools, and also like putting Hugging Face into my chatbot with the BERT embeddings and such, how I met Hugging Face was that I have watched Thomas Wolf’s video on the future of NLP. He’s explaining it so well, and I have become a fan. Once I have seen him posting about this community sprint on datasets, I was like, “Can I join as well?” and then I started contributing to Hugging Face. Then I later tried my chances with the audio sprint as well. I met people over there and I have learned a lot of things. They have taught me about CI/CD styling, formatting, and contributing to open source, actually. It was my first time contributing to open source and I was so excited when my first PR got merged. I kept helping other people out over there and I was like, “I’m going to be that person that is going to help people out in the sprints.” [laughs]
That’s so amazing. I love that story because it is very intimidating for a lot of people to figure out that process of committing in open source. Of course, some communities are more welcoming than others. I know Hugging Face is very welcoming, and there’s a lot of great discussion that happens, and all of that. This is a little bit off topic, but I think it’s on-topic, at least as far as the open source side of Hugging Face… What sort of advice would you give people that are looking maybe to start contributing to the open source side of data science, or machine learning, or that side of the world, the open source community? …maybe not even just with Hugging Face, but there’s so many great tools out there, whether it’s spaCy or TensorFlow itself, or all sorts of things.
Basically, libraries like scikit-learn or Hugging Face transformers occasionally have sprints in which the contributors are talking to you, they are giving you issues, or like for instance if we are going to train models, there is a list of languages that models need to be trained on, and there is a dataset, so all you have to do is to actually train the model and improve it.
There are a couple of sprints, also same with scikit-learn, as far as I know. You can get help from contributors actively, rather than being in async in GitHub. I will suggest to be aware of those sprints and community events. Like in Hugging Face, we have a lot of them. For instance, recently we had a sprint about renewing the docs and adding typeins and other stuff for the TensorFlow side of transformers, which was a good first contribution, in my opinion. I think sprints are a good way to begin with. Other than that, it’s just good first issues on the repository.
You mentioned that you were kind of building chatbots before joining Hugging Face. Could you tell us a little bit more about that, and maybe how that shaped what you perceive as what was needed in NLP tooling? It sounds like you’ve found a lot of what you thought was needed in terms of Hugging Face and transformers, but how did that process of trying to build a chatbot – what did that teach you or help you learn, or maybe introduced you in terms of challenges for people wanting to do that sort of thing?
It heavily depends on what you’re building, actually. In my first job, I was building an automation bot that was talking to you and automatically creating appointments for you, or cancel your appointment in the background for service companies.
[08:14] Over there, I was mostly doing machine learning parts. It is usually about how you solve the text classification problem and improve your data. I was using [unintelligible 00:08:25.21] open source. If you are having a narrow domain chatbot like, I don’t know, a pizza ordering chatbot or whatever, it’s very easy to solve the problem, because mostly you are solving a text classification problem at the end, trying to understand the end user. And to iterate over your model and everything, it’s easy.
For my second job, it was really hard, because I come from an applied math operations research type of background and I do not have any developer background. And I had to do the API side and learning Flask and everything. I was doing both the backend and the chatbot itself, and I was also building this tool that helps the – we had some researchers. The chatbot - it felt like a friendly chatbot replica, but you could ask questions to it about your life standards, or you can ask questions like, “Hey, I cannot sleep. What can I do about it?” We had a researcher team that was looking into these answers in the papers and looking for statistical evidence that a certain thing that is good for your health. But the chatbot was rather so hard to make, because basically, conversational agents are divided into two. You have stuff like BlenderBot, DialoGPT, generative models that basically you can talk about anything, like tZERO, whatever.
The second part is chatbots that are based on intent and action, which is you have to write your own training data, it’s not anything like Zero-Shot or whatever. You have to define all of your actions to every single intent. It’s hard to make a generalization over this and still be in control of what your bot is going to say, because we know that these language models are a bit biased. They tend to be sexist, racist, rude sometimes, so it’s a hard problem to solve. So that’s why I quit on doing that, because I kind of gave up.
I would rather work in a chatbot that had narrower domain, because it’s not solvable basically without language models. With language models, I would rather not put or run a language model in front of an end user freely, with no filtering or whatever.
I also helped them out… There was this research team and I have built a tool that would answer their questions from the research papers. For that, I have used CentOS transformers, which I was using through Hugging Face. First time, I cannot forget how I used pipeline for the first time, and I was incredibly amazed… I was like, it’s just one line of code and I can just get answers to my question? I just passed my model – I have fine-tuned the model based on some… Basically, there was this biomedical BERT, and I have fine-tuned it on some tasks for information retrieval, and I called typeline and I was like, “Is this for real? Like, does this actually work?” I was pretty amazed. And then later, I looked into it and I was like, “They made an abstraction over all of the pre-processing, inference, and post-processing and put it in a box.” I’m like, “How smart is this? It’s like an engineering marvel. It’s just amazing.”
[12:04] I had a totally different follow-up question a minute ago, but I’m actually wanting to ask you about this. I think a lot of the folks that we talked to on the show have come from developer backgrounds, and they kind of already have that, and they’re moving into other skills. You’ve come in the reverse way from that, and you had this moment there… What was the hardest thing, as you were transitioning into this skill set, and as you’re talking about this history? That was kind of an “A-ha” moment that you had… What was the hardest thing to move into being able to be productive?
Basically, in my previous job, I was just shipping stuff. The quality of my code wasn’t nitpicked, and everything; my PRs weren’t passing in a long time. In here, because I am working with very big teams and very big codebases, I can see how I can refactor things or how I can improve things. I’m mostly learning development, in a way, and also how UX matters. I feel like it’s a billion-dollar question, how you handle your UX and how you develop tools… Because most of my time at Hugging Face is actually passing with either developing a tool for people on the Keras side… I’ve recently started working on scikit-learn as well.
On the other side, I’m just building fancy demos to showcase people what transformers can do, or other libraries can do for machine learning in general. I have realized later on that UX actually is hard when you do not come from that background. Also, how you can improve your code - it’s just endless. There will always be someone nitpicking your code, and it’s just the most beautiful thing, because you keep learning from that.
So I am just grateful to work here. I feel like I did improve myself from the start. But at first, it was hard, because previously I was only optimizing my models, and nobody questioned my code that much.
Merve, you’ve already mentioned a number of things that I’d love to dig into a little bit deeper, because there’s all sorts of pieces of the puzzle that fit into what Hugging Face is, and the ecosystem. I was wondering if you could help us just set the stage for this discussion. You have Hugging Face, you have model and dataset stuff, you have transformers, somehow Keras, and you even mentioned scikit-learn… Could you just give us an overview of how you would see the Hugging Face ecosystem and how the various pieces fit together?
Yeah, sure. What Hugging Face is working on - if you were to ask to a random person in the company, I feel like the answers would differ. But I feel like the most important thing on Hugging Face is actually Hugging Face Hub. The reason why is because – basically at Hugging Face we are trying to solve open source machine learning in general, and this involves a couple of problems. One of them is reproducibility of your experiments, and also how easy to infer your models are such that people can just go and stress test your models and see if it works for your use case.
[16:06] Another thing is – you know, the essence of the open source ML in general is can your model actually be used by someone else for their own use case? Which is not likely for the most of the tasks, like tabular data-related stuff, but it applies for computer vision tasks, audio classification tasks, at least within language. Or it applies for NLP, because your features are usually universal. If not, it’s language-specific. But at least for computer vision, you can just go ahead and just pick object detection or like image segmentation model and use it in your use case.
With the hub, we are actually trying to do this, and we want to get people to declare the limitations of their model, declare the biases in their model, so that we can have good open source models on the hub. We don’t only have transformers on the hub, we have various libraries. For instance, we host the Stanza models from Stanford NLP. We have Keras models, we are integrating various libraries in NLP - Keras, PyTorch models, you name it.
For instance, with Spaces, what we want to do is we want to get people to see if a thing is possible. For instance, I can just demonstrate a very small thing, like a product, and do a PoC to my colleagues. There are a couple of use cases and things you can use Hugging Face hub for your end-to-end workflows. But my favorite thing inside this - I think Spaces right now, because… I’m a masters student. The most painful thing for me, and I know for the TAs and professors, is to actually reproduce my project, like setting up the environments, running it, and you have to specify how to do that. Instead, I’m just sending them a Spaces link of my project. For instance, this year, I have the Fourier Transform space with Streamlit, and I’ve just sent it to them.
It’s also good for… Like in my previous job, I was a machine learning engineer, and I have built – in my first job, I was looking for ways to just put my model out there, and I had zero idea how to use [unintelligible 00:18:34.23] or whatever. It was so hard for me. Like, why would I be spending my time, especially if you’re in a startup, like, you do everything - why would I spend my time just to… I’m not even putting this into production. I just want to showcase this to a client, or the end-user. Why would I spend most of my time just trying to put this over there through – I don’t know, just build a demo, and that doesn’t even look good with… I don’t know, Flask can just channel it with [unintelligible 00:19:13.25] or whatever.
I think when I was on board, being on board, Spaces was in beta, and not so many people had access to it. When I discovered Spaces and Streamlit and Gradio, I was like, “This really touches many pain points on that side,” especially if you’re a data scientist. Most of the data scientists are actually statisticians, or math folks who do not have development background, but are working in startups. So it’s actually very smart to just write five lines of code and just drag and drop your app.py file into Spaces and voila, you can just show it to your clients or your end user, or your teacher, or your family, or your favorite pet.
[20:05] I would question that as you’ve been taking us through this, it almost starts with the fact that as a – you know, I kind of haven’t gotten to the expertise that you’ve gotten to, but you’ve taken us through this development as you’ve taken this journey of learning… And the ecosystem around Hugging Face has grown tremendously over that time, and the tools are getting amazing. As you are communicating this to people who are getting into it, you’ve got a big challenge just communicate the ecosystem and all the things that are available. But how do you also – you clearly, from what you were just talking about, remember that beginner’s mind… So as you’re bringing new people into the community and teaching them how to be effective and productive in what they’re doing within the ecosystem, how have you managed to stay grounded in that way so that you can accomplish both? You can sync with them at that beginner level and yet you can get them up to that point where they can run themselves.
It’s really hard actually, because there is so many good stuff in the ecosystem. It’s just understanding the user journey and what they’re going through and trying to touch where you can fix their problems in their journey. Hugging Face recently start to invest in tabular data as well. Because I was previously a data scientist, I know what an average data scientist does.
I think a couple of things you can do is, for instance, use a datasets library to host your datasets, which in most of the platforms, you cannot host datasets that’s more than 100GB, by the way. And Hugging Face datasets allows you to do that, and you can even stream your data set. Like, take your dataset, just do an exploratory data analysis. If you want to do a presentation and if you don’t want to show people a notebook, you can do that through Streamlit or Gradio, graphs about your data, or a profile.
After that, you can just train your model and push it to the hub and build a space for it, so that you can show what your model is capable of. You can just put your baseline and let people test it so that it works. I feel like the answer changes a lot according to what you are working on, and what side you are on on the equation. Are you a machine learning engineer? Are you a data scientist? Also, it changes according to the person you are asking to. I really like to ask people about their journey and see what type of problem we can solve with that. For instance, for end-to-end things, it’s more like that.
On the other hand, if you’re an NLP person, you can again take a dataset, train a model with transformers… For the previous use case, you cannot do much with transformers, because it’s not used much in the tabular data. But we have a couple of integrations for the various libraries.
And I can say, for instance, the types of problems we are solving - for instance, we want you to reproduce your experiments, and we want other people to know that models have limitations, and everything. For instance, currently, what I’m working on on the hub is, for scikit-learn at least, I want to enable collaboration for scikit-learn. I am currently designing automated model cards for scikit-learn in which it automatically produces a model card that has your model’s attributes, and also the dataset’s attributes. I have done the same for the Keras. For instance, I have put inside metrics from model history… I have put model’s architecture using graphviz.
[24:02] You can also have TensorBoard logs and you write one line of code to just push your model to the hub and let it post your TensorBoard logs and your model card over there. Sometimes, if your model is working out of the box, then there is an inference widget as well. Same way, with one line of code you can just load your model.
Yeah, we want to tackle reproducibility this way, so that people know that this model has this metric, it has these hyper parameters, this training. We want to version them this way.
For NLP, again, you can just train a model, push it to the Hugging Face hub and the inference which it opens, or you can build a demo with, again, very few lines of code. Because Gradio for instance - Gradio has the same philosophy as transformers, I would say. It leverages Hugging Face’s pipelines to load an interface. When you call interface.load on a model, it automatically knows what type of inputs that model takes, what type of output that model takes… It will just create the interface for you in one line of code. I’m always amazed by the abstractions done to save your time as a developer. I think every user has a different story. But I would first get to know the person and tell them, “Hey, you can do it like this. You can utilize Hugging Face hub like this,” because otherwise it’s incredibly distributed. There are so many things in the ecosystem.
I have also done a project called Hugging Face Tasks. I am still maintaining it. I have come up with this when I was on board. Basically, I have worked with a lot of software developers who wanted to build machine learning products. I know that these people, they just need to know basic Python; if they want to do a POC to data scientists to actually express what they want… Because for the POC, they do not need to know much about machine learning. All they have to do is just to go to hub, filter the models, find the model according to their use case, and just call pipeline or inference API on it.
Most of the people do not know that, and they also do not know what the tasks are capable of, like what you can do with an object detection model, or what you can achieve with a named entity recognition model. So I wanted to show them, “Hey, if you want to build X products, then you can just filter for these models and just call pipeline on that model and use it, and check the model’s metrics. If this metric is on this level, then this means that model is good. This model takes X as an input, and outputs Y as an output. That’s why you can use it for information retrieval.”
It’s also a bit complicated from the machine learning side; so many fancy things going on… But you actually do not need all of that. You just need to know which task is suited for you, and you just need to call it. I have developed this with this beautiful team of developers, and we have released that; I think it was around January or February. I just want to go and tell every single software developer, “Hey, you just need to know this, and you do not need to learn machine learning from scratch.”
So Merve, I can definitely hear just the great respect that you have for your team, and also this collaborative environment that you’re obviously a part of. I know that even just today - the day that we’re recording this anyway - Hugging Face announced more collaborative features and community features on the hub. I was wondering, from your perspective and how you’ve grown to work internally with the Hugging Face team on different models, on different tools, and that sort of thing, what are you excited about in terms of the collaborative features of Hugging Face and what this might enable for the future of the hub?
Good question. We have announced pull requests in the community feature today, in which you can open a pull request to someone else’s repository, and this repository can be a model repository which it contains the model and the model files, like a configuration or tokenizer. If it’s an NLP model, it has a model card that improves reproducibility and open source machine learning. You have dataset repositories in which you have datasets, cards, and datasets themselves, or it can be a space repository in which it has the application file, or if you do not host your model on Hugging Face, it might have your model…
This way, people can improve each other’s work, like we do in GitHub. In here, we do not want actually duplicate work of GitHub. But given Hugging Face is mostly focused on models and the infrastructure as well - we use Git’s large file system to host models and datasets that are very big.
Previously, we were versioning datasets and versioning models, datasets, spaces… Why not do pull requests on them? This might mean - for instance, I have a big TensorFlow model and people want to use it in PyTorch. For instance, someone has a PyTorch model, but I want to use it in the TensorFlow ecosystem, because TensorFlow has nice production level tools in the TensorFlow extended. So I can just port it, but if I will also want to contribute those [unintelligible 00:30:51.11] to the repository, then I can just do that. I can just open a Pull Request in order to contribute those TensorFlow [unintelligible 00:31:00.13] to that repository.
Or if someone has as a space that is broken or needs to be improved - I don’t know, by means of anything, like it needs a description or something like that, or like limitations that I have found in that space… If it has a bias that I have stress tested and needs to be declared, then I can just open a pull request or just discuss that “Hey, I have found this bias in your model. Either let’s declare this, or try to improve the model.”
[31:35] Or if I have a dataset, then I can just tweak stuff in the data set itself and just contribute that, and also have discussions regarding the models. Because for instance, if someone has a specific model that is using a different [unintelligible 00:31:52.10] is problematic, I just want to go and tell them, “Hey, you can improve your model like this” or “You can improve your space like this. If you were to cache this function, then your space would be faster.” I just want to go and tell them that, but I wasn’t able to do it because there was no way… Except for like there’s a Twitter handle on people’s profiles, which - I think if I were to just go and tell them, it would be a bit creepy. [laughs] So it’s nice that now we have a discussion section which I can just tell people, “Hey, if you were to do this, that would be faster”, or I can just open a pull request and let them see how their space is improved, because then they can just clone and just pull and just serve it on their local, or just make another space and just see before merging how my work looks like on their space or model.
Yeah, I’m curious… I think that’ll be a really big change, because you’ve referenced GitHub, and if you think about what GitHub did for the open source world by coming into being, Git was already there. The social aspect and that collaborative aspect, it fundamentally changed the community worldwide. It was not the same thereafter.
How do you envision the social changes, or what do you aspire to or hope for are the changes based on this? Do you think it will propel the community in that same kind of massive shift that we saw in the broader open source world?
Good analogy, Chris.
I just love to follow up with people and see what they are starting, see interesting projects over there. I feel like at some point, we might evolve to that as well. Someone else starts a space and that’s a really interesting one, so let me just go and look. Even more, maybe like messaging, or whatever.
I am not in control of this, I just know that we are also trying to somehow increase the collaboration, and that’s what GitHub actually achieved. There are awesome libraries out there where people contribute to, but for machine learning side, it’s not the optimal thing to use. For instance, recruiters or technical interviewers - they wouldn’t go to all of my GitHub machine learning projects; and even if they did, they won’t understand anything. But for instance, I have spaces in which someone could just go ahead and see that, “Hey, this person is actually doing in this space what I’m looking for. How did they achieve that? Maybe I should hire them.”
Or just hosting model weights, like very heavy model weights and just cloning them is a pain. Why would I just want to clone everything in a repository that is a model – I would love to see if that model works first, through a widget or space, and just do that. So for that side, it’s more optimal, and we are looking for ways to increase the collaboration and give people a better UX collaboration with features like this. I’m also excited to see what’s next. I feel like what’s next is not [unintelligible 00:35:41.23] and such. I’m quite excited for that. So let’s see what happens. I am not fully in full control of the hub roadmap, but it’s mostly about the collaboration.
[35:56] Yesterday, there was another feature launched, and it was… So in model cards, you have a metadata section in which you define languages, and everything. You can also define the models that are in a specific paper. It redirects you to the paper itself, where you can see which model is actually there.
Again, we are also investing in evaluation, sort of like papers with code leaderboards, in which you can see which model is state of the art in the task, in the Hugging Face hub, so you can directly use that model. It’s more about evaluating the model and doing a leaderboard-style thing. It’s mostly about, again, open source machine learning what we’re trying to do, rather than social media network… But it might evolve. [laughs]
Yeah, I have a digression that includes an ACL story. I forgot who I saw on Twitter yesterday, I forget who from Hugging Face said, “We’re going to announce something tomorrow”, and I said, “Great, I’ll check at some point–”
I think it’s Julien.
Yeah, Julien. So I was like, “Okay, I’ll check”, but I was in talks most of the day. The last talk I went to today was called Quality at a Glance. It was from the Masakhane group, which works in African languages, [unintelligible 00:37:24.01] and others. And they analyzed a whole bunch of open datasets, crawled datasets that are on Hugging Face as well, but they’re sort of used widely… And they looked at the quality of those datasets, and found very interesting and disturbing things. I think in common crawl aligned, there’s certain languages… All of the data in that language is not in that language; like, 0% of the data labeled in that language is in that language. One of those I think was the Romanized Arabic; it includes 0% Romanized Arabic.
I was thinking about this as I was leaving ACL, and then I was like, “Oh, yeah, I’ve got to check Twitter to see what Hugging Face’s thing is.” Then I looked and I was like, “Make PRs on datasets.” I was like, “Oh, I need to circle back around and go right back in there and talk to them about how we can get some PRs on [unintelligible 00:38:22.19] other datasets.” It was just perfect timing.
Yeah, someone else can just open a discussion about how that language doesn’t [unintelligible 00:38:32.24] We have a really great ethics team. You probably know them… We have Meg, we have Sasha, we have Yassi… They’re just doing amazing work. We recently have [unintelligible 00:38:48.11] And every time something happens, like we see an inappropriate use case around humans for instance, like - the use cases around the personal identifiable information is actually sometimes problematic. We do stress test the spaces and models to reach out to the people, “Hey, your model might have a bias. Would you declare it?” So we do actually care about the limitations around models and also ethical restrictions regarding the biases, personal information and everything, as much as we can.
It’s a hard problem to solve because it’s all philosophical. In the end, ethics is a bit philosophical, but how we can actually put that in practice is a big question. In case of Hugging Face hub, we do care about in the models that we train, in big science, or the models that we have on the hub, we do care that. If it has a bias, we declare it. We care about the data and everything.
[40:07] Yeah, it’s so important. I really appreciate Hugging Face really taking a stance there and putting a lot of effort into that. As we close out here pretty soon, you’ve mentioned a bunch of things that you either are working on or have worked on as part of the open source ecosystem around Hugging Face. I’m wondering, what’s that thing that keeps you up at night, or the thing that’s on your mind, that you’d love to do or dig into, but you haven’t yet? What excites you or maybe is something you want to dig into at some point in the future?
In Hugging Face, there’s a certain group of people like me that do not really have specific things. In Hugging Face nobody actually tells you, “You should do XY.” You just go ahead and pick a responsibility, and that’s your thing from that moment.
What I did so far - I did the Tasks, and I did a Keras integration in which I have done model cards and a Tensor board, and stuff. I really become so happy whenever I see a Keras repository with a model card inside, because I know that people actually find it useful and just keep using it.
I am currently working on how we can use Hugging Face hub for the tabular data. I’m working with an amazing scikit-learn core contributor that is currently in Hugging Face; he’s Adrian. We are currently working on a package that is focused on how we can improve the production capabilities of scikit-learn… Because you use, for instance, Pickle, which can run arbitrary code on your machine if you just pulled any Pickle and just de-serialize it. It’s a bit hard problem to tackle. We want to post whatever information we can have about the model. Currently, I’m working on that site.
For instance, if it’s a tree-based model, you can visualize the tree. If it’s a clustering model, depending on what type of clustering model that is - it can be like a dendrogram, or visual, like with a PCA… Or if you have a linear model, you can put the hyperplane… I am trying to put those stuff, and also, what model has learned through feature importance, Shapley values and everything. What I want is I want people to call one line of code and just push their models on the hub, which will create these model cards with several information. We are all supporting Gradio, and I’m also working on how we can leverage Gradio for tabular data stuff as well. Because previously, Gradio was mainly focused on the modalities, like text or audio or computer vision. And the components, if you take a look at the documentation, are focused on that; you can just drag and drop a dataset and just automatically visualize everything regarding that dataset, and that’s quite magical.
I remember the first time I used pandas-profiling, and also data analysis baseline library… There are a couple of libraries that enable you to profile your datasets, train baseline models, do an AutoML, like TPOT… I remember using them, and I was like, “This is witchcraft. This is so good.” [laughs] It saves so much time. Recently I realized, we can actually do that on the hub.
Currently, I am building two tools. One is profiling a dataset; it’s like a Gradio interface that does it. We have recently released in Gradio something called Blocks, which is more flexible than an interface. You have tabs, and you have rows, and you can put multiple stuff inside.
I’m currently building two spaces, like I said. I either build, I maintain, or add something to the Hugging Face hub library, or something else, or do demos… Currently, I’m working on something that is like an auto EDA, like a pandas-profiling. Another thing is an AutoML tool, sort of.
These types of things also save a lot of time and also lower the barrier of entry, I think, because you have a baseline model and it will push your best baseline model to the hub, and maybe you can create a space really easily on that, which you can later go and tell your local data scientist, “Hey, I want this, but improved version.”
I really love this. I was a big fan of Hugging Face for - I don’t know, since I saw Thomas’s video. I think it’s been three years or something, I don’t remember. And I’m still a big fan of Hugging Face. I go and talk at Python conferences and people approach me and say, “I’m a big fan of your conference.” And I’m like, “Me too.” [laughter]
That’s great. Well, we’re certainly big fans here, and we appreciate the way that you’re building community and collaboration around AI and datasets and all of these things. Yeah, we really appreciate your work, Merve, and we appreciate you taking time to talk to us. It was fun.
Thank you so much for inviting me. I could talk about Hugging Face all day and I would worry that people would get bored of me, so it’s nice to meet people like you who are fans of Hugging Face as well.
Of course. We will talk to you soon.
Our transcripts are open source on GitHub. Improvements are welcome. 💚