Towards high-quality (maybe synthetic) datasets with Ben Burtenshaw & David Berenstein from Hugging Face (Practical AI #290)

All Episodes

As Argilla puts it: “Data quality is what makes or breaks AI.” However, what exactly does this mean and how can AI team probably collaborate with domain experts towards improved data quality? David Berenstein & Ben Burtenshaw, who are building Argilla & Distilabel at Hugging Face, join us to dig into these topics along with synthetic data generation & AI-generated labeling / feedback.

Changelog++ members save 11 minutes on this episode because they made the ads disappear. Join!

57 minutes
Recorded Oct 2, 2024
Published Oct 9, 2024
Download (55MB)
Transcript
🎧 28,012

Featuring

Ben Burtenshaw – GitHub, LinkedIn, X
David Berenstein – GitHub, LinkedIn, X
Chris Benson – Website, GitHub, LinkedIn, X
Daniel Whitenack – Website, GitHub, X

Notes & Links

📝 Edit Notes

Chapters

Chapter Number	Chapter Start Time	Chapter Title	Chapter Duration
1	00:00	Welcome to Practical AI	00:44
2	00:44	Sponsor: Fly	03:06
3	03:56	What does data collaboration mean?	03:22
4	07:18	Understanding your data	02:40
5	09:58	How to start curating data	03:14
6	13:12	Practical steps to scale	03:30
7	16:52	Sponsor: WorkOS	03:21
8	20:23	Traditional & new usecases	04:28
9	24:51	Virtues of smaller models	02:13
10	27:04	What Argilla looks like	03:52
11	30:55	User backgrounds	03:26
12	34:21	The non-technical POV	03:50
13	38:23	Sponsor: Eight Sleep	02:31
14	41:09	AI feedback	03:41
15	44:50	Hallucination issues	01:20
16	46:10	What is Distilabel	03:58
17	50:08	Usage & adoption	02:47
18	52:55	Where things are going	02:39
19	55:34	This is muy bueno	00:42
20	56:15	Outro	00:46

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Daniel Whitenack

Welcome to another episode of the Practical AI Podcast. This is Daniel Whitenack. I am CEO at Prediction Guard, where we’re building a private, secure gen AI platform, and I’m joined as always by Chris Benson, who is a Principal AI Research Engineer at Lockheed Martin. How are you doing, Chris?

Chris Benson

Great today, Daniel. How are you?

Daniel Whitenack

It’s a beautiful, beautiful fall day, and a good day to take a walk around the block and think about interesting AI things, and clear your mind before getting back into some data collaboration, which is what we’re going to talk about today. Chris, I don’t know if you remember our conversation… It was just me on that one, but with Bengsoon Chuah, who talked about Broccoli AI, the type of AI that’s healthy for organizations… And in that episode, he made a call out to Argilla, which was a big part of his solution that he was developing in a particular vertical. I’m really happy today that we have with us Ben Burtenshaw, who is a Machine Learning Engineer at Argilla, and also David Berenstein, who is a Developer Advocate Engineer working on building Argilla and the still label at Hugging Face. Welcome, David and Ben.

David Berenstein

Thank you. Great to be here.

Ben Burtenshaw

Hi. Thanks for having us.

Daniel Whitenack

Yeah, so like I was saying, I think for some time maybe, if you’re coming from a data science perspective, there’s been tooling maybe around data that manages training datasets, or evaluation sets, or maybe MLOps tooling and this sort of thing… And part of that has to do with preparation and curation of datasets. But I’ve found interesting – I mentioned the previous conversation with Bengsoon, he talked a lot about collaborating with his sort of subject matter experts in his company around the datasets he was creating for text classification… And that’s where Argilla came up. So I’m wondering if maybe one of you could talk a little bit at a higher level… When you’re talking about data collaboration in the context of the current kind of AI environment, what does that mean, generally? And how would you maybe distinguish that from previous generations of tooling, in maybe similar or different ways?

David Berenstein

So data collaboration, at least from our point of view, is kind of the collaboration between both the domain level experts that really have high domain knowledge and actually know what they’re talking about in terms of the data, the inputs and the outputs that the models are supposed to give within their domain… And then you have the data scientists or the AI engineers and this side of the coin, that are more technical, they know from a technical point of view what the models expect and what the models should output. And then the collaboration between them is now even higher, because nowadays you can actually prompt LLMs with natural language, and you actually need to ensure that both the models actually perform well, and also the prompts, and these kinds of things. So the collaboration is even more important nowadays. And that’s also still the case for [unintelligible 00:07:11.15] models and these kinds of things, which we also support within Argilla.

Daniel Whitenack

I guess maybe in the context of - let’s say there’s a new team that’s exploring the adoption of AI technology, maybe for the first time… Maybe they’re not coming from that data science background, the sort of heavy MLOps stuff, but maybe they’ve been excited by this latest wave of AI technologies… How would you go about helping them understand how their own data, the data that they would curate, the data that they would maybe collaborate on is relevant to and where that fits into the certain workflow? So yeah, I imagine someone may be familiar with what you can do with a ChatGPT or pasting in certain documents or other things, and now they’re kind of wrestling through how to set up their own domain-specific AI workflows in their organization… What would you kind of describe about how their own domain data and how collaborating around that fits into common AI workflows?

Ben Burtenshaw

[00:08:18.04] Yeah, so something that I like to think about a lot around this subject is like machine learning textbooks… And they often talk about modeling a problem, as well as building a model. There’s a famous [unintelligible 00:08:29.18] cycle. And in that, when you model a problem, you’re basically trying to explain and define the problem. So I have articles and I need to know whether they are a positive or negative rating. And I’m describing that problem, and then I’m going to need to describe that problem to a domain expert or an annotator through guidelines. And when I can describe that problem in such a way that the annotator or the domain expert answers that question clearly enough, then I know that that’s a modeled and clear problem, and it’s something that I could then take on to build a model around. In simple terms, it makes sense.

And so I think when you’re going into a new space like generative AI, and you’re trying to understand your business context around these tools, you can start off by modeling the problem in simple terms, by looking at the data and saying “Okay, does this label make sense to thosse articles? If I sort all these articles down by these labels, or by this ranking, are these the kinds of things I’m expecting?” Starting off at quite low numbers, . single articles, and kind of building up to tens and hundreds… And as you do that, you begin to understand and also iterate on the problem and kind of change it and adapt it as you go. And once you’ve got up to a reasonable scale of the problem, you can then say, “Alright, this is something that a machine learning model could learn.”

Daniel Whitenack

I guess on that front, maybe one of the big confusions that I’ve seen floating around these days is the kind of data that’s relevant to some of these workflows. So it might be easy for people to think about a labeled dataset for a text classification problem, . here’s this text coming in, I’m going to label it spam or not spam, or in some categories… But I think sometimes a sentiment that I’ve got very often is “Hey, our company has this big file store of documents.” And somehow I’m going to fine-tune, quote-unquote, a generative model with just this blob of documents, and then it will perform better for me. And there’s two elements of that that are kind of mushy. One is - well, to what end, for what task? What are you trying to do? And then also how you curate that data then really matters. Is this a sentiment that you all are seeing? Or how for this latest wave of models – how would you describe if a company has a bunch of documents, and they’re in this situation, they’re like “Hey, we know we have data, and we know that these models can get better… And maybe we could even create our own private model with our own domain of data.” What would you walk them through to explain where to start with that process, and how to start curating their data maybe in a less general way, but towards some end?

David Berenstein

I think in these scenarios it’s always good to first establish a baseline or a benchmark… Because what we often see is that people will come to us or come to the open source space and they say “Okay, we really want to fine-tune a model, we really want to do a super-extensive rack pipeline with all of the bells and whistles included, and then kind of start working on these documents.” But what we often see is that they don’t even have a benchline to actually start with. So that’s normally what we recommend.

[00:11:58.06] Also, whenever you work with a rack pipeline, ensure that all of the documents that you index are actually properly indexed, properly chunked. Whenever you actually execute a pipeline, and you store these retrieved documents, or these – based on the question and the queries in Argilla or in any other data annotation tool, you can actually have a look at the documents, see if they make sense, see if the retrieval makes sense, but also if the generated output makes sense. And then whenever you have that baseline set up, from there, actually start iterating and kind of making additions to your pipeline. “Shall I add re-ranking potentially to the retrieval if the retrieval isn’t functioning properly? Shall I add a fine-tuned version of the model? Should I switch from the latest LLaMA model of 3 billion to 7 billion?”, or these kind of things. And then from there on, you can actually consider maybe either fine-tuning a model if that’s actually needed, or fine-tuning one of the retrievers, or these kind of things.

Chris Benson

As you’re saying that, you’re speaking from this kind of profound expertise you have, and I think a lot of folks really have trouble just getting started. And you asked some great questions there, but I think some of those are really tough for someone who’s just getting into it, like which way to go on some of the selections that you would go with that… Could you talk a little bit about the kind of – like, go back over the same thing, but kind of make up a little workflow, this kind of hands-on on just like “You might see this, and this is how I would decide that”, just for a moment, just so people can kind of grasp kind of the thought process you’re going through. Because you kind of described a process, but if you could be a little bit more descriptive about that. When I talk to people, once they get going, they kind of go to the next step, and go to the next step, and go to the next step… But the first four or five big question marks in the beginning, they don’t know which one to handle.

Ben Burtenshaw

I can add some practical steps onto that that I’ve worked with in the past, if that’s alight.

Chris Benson

That’d be fantastic.

Ben Burtenshaw

Yeah, so one thing that you can do that is really straightforward is actually to write down a list of the kinds of questions that you’re expecting your system to answer. And you can get that list by speaking to domain experts, or if you are a domain expert, you can write it yourself. And it doesn’t need to be an extensive, exhaustive list. It can be quite a small starting set. You can then take those questions away and start to look at documents or pools and sections of documents from this lake that you potentially have, and associate those documents with those questions, and then start to look if a model can answer those questions with those documents. In fact, by not even building anything. By starting to use, say, ChatGPT, or a Hugging Chat, or any of these kind of interfaces, and just seeing this very, very low, simple benchmark - is that feasible? Whilst at the same time, starting to ask yourself, “Can I, as a domain expert, answer this?” And that’s kind of where Argilla comes in, at the very first step.

So you start to put these documents in front of people with those questions, and you start to search through those documents, and say to people “Can you answer this question?” Or “Here’s an answer from a model to this question, in a very small setting.” And you start to get basic, early signals of quality. And from there, you would start to introduce proper retrieval. So you would scale up your – you would take all of your documents… Say you had 100 documents associated with your 10 questions. You put all those 100 documents in an index, and iterate over your 10 questions, and see “Okay, are the right documents aligning with the right questions here?” Then you start to scale up your documents and make it more and more of a real-world situation. You would start to scale up your questions… You could do both of these synthetically. And then if you still started to see positive signals, you could start to scale. And if you start to see negative signals, “I’m no longer getting the right documents associated with the right questions…”

I personally would always start from the simplest levers in a RAG setup, and what I mean there is that you have a number of different things that you can optimise.

So you have retrieval, you can optimise it semantically, or you can optimise it in a rule-based retrieval, you can optimise the generative model, you can optimise the prompt… And the simplest movers, the simplest levers are the rule-based retrieval (the word search), and then the semantic search.

So I would first of all add like a hybrid search. What happens if I make sure that there’s an exact match in that document for the word in my query? Does that improve my results? And then I would just move through that process.

Break: [00:16:43.03]

Daniel Whitenack

I’m guessing that you all – you know, the fact that you’re supporting all of these use cases on top of Argilla on the data side makes me think… Like you say, there’s so many things to optimize in terms of that RAG process, but there’s also so many AI workflows that are being thought of, whether that be code generation or assistance, or content generation, information extraction… But then you kind of go beyond that. David, you mentioned text classification, and of course there’s image use cases… So I’m wondering, from you all, at this point – you know, one of the things Chris and I have talked about on the show a bit is… You know, we’re still big proponents and believe that in enterprises a lot of times there is a lot of mixing of rule-based systems, and more kind of traditional, I guess, if you want to think about it that way, machine learning, and smaller models… And then bringing in these larger gen AI models as kind of orchestrators, or inner query layer things… And that’s a story we’ve been kind of telling, but I think it’s interesting that we have both of you here in the sense that – like, you really, I’m sure there’s certain things that you don’t or can’t track about what you’re doing… But just even anecdotally, out of the users that you’re supporting on Argilla, what have you seen in terms of what is the mix between those using Argilla for maybe what people would consider traditional data science type of models, like text classification or image classification type of things, and these maybe newer workflows, like RAG and other things… How do you see that balance, and do you see people using both, or one or the other? Yeah, any insights there?

David Berenstein

I think we recently had this company from Germany, [unintelligible 00:22:17.21] over at one of our meetups that we host, and they had an interesting use case where they collaborated with this healthcare insurance platform in Germany. And one of the things that you see with large language models is that these large language models can’t really produce German language properly. They’re mostly trained on English text. And that was also one of their issues. And what they did was actually - they had a huge classification and generation pipeline combining a lot of these techniques where they would initially get an email in that they would classify to a certain category, then based on the category they would kind of define what kind of email template, what kind of prompt template they would use… Then based on the prompt template, they would kind of start generating and composing one of these response emails that you would expect for a customer query request coming in for the healthcare insurance companies… And then in order to actually ensure that the formatting and phrasing and the German language was applied properly, it would then, based on that prompt, regenerate the email once more. So prompts in LLM to kind of improve the quality of the initial proposed output. And then after all of these different steps of classification, of retrieval-augmented generation, of an initial generation and a regeneration, they would then end up with their eventual output.

So what we see is that all of these techniques are normally combined. And also, a thing that we are strong believers in is that whenever there is a smaller model or an easier approach applicable, why not go for that, instead of using one of these super-big large language models? So if you can just classify “Is this relevant or is this not relevant?” and based on that actually decide what to do - that makes a lot of sense.

[00:24:11.20] And also, one of the interesting things that I’ve seen one of these open source platforms, Haystack, out there using is also this query classification pipeline, where they would classify incoming queries as either a key terminology search, a question query, or actually a phrase for an LLM to actually start prompting an LLM. And based on that, actually redirect all of their queries to the correct model. And that’s also an interesting approach that we’ve seen.

Chris Benson

Quick follow-up on that. It’s just something I wanted to draw out, because we’ve drawn it out across some other episodes a bit… You were just making a recommendation, kind of go for the smaller model versus the larger model. For people trying to follow - and there’s the divergent mindsets - could you take just a second and say why you would advocate for that, what the benefit, what the virtue is, in the context of everything else?

David Berenstein

I would say smaller models are generally hostable by yourself, so it’s more private. Smaller models, they are more cost-efficient. Smaller models can also be fine-tuned easier to your specific use case. So even what we see a lot of people coming to us about is actually fine-tuning LLMs… But even the big companies out there, with huge amounts of money and resources and dedicated research teams still have difficulties on fine-tuning LLMs. So whenever you, instead of within retrieval-augmented generation pipeline, fine-tune an LLM for the generation part, you can actually choose to fine-tune one of these retrieval models that you can actually fine-tune on consumer-grade hardware, you can actually fine-tune it very easily on any arbitrary data scientist developer device. And then instead of having to deploy anything on one of the cloud providers, you can start with that.

And a similar reasoning for a RAG pipeline - whenever you provide an LLM with garbage within such a retrieval-augmented generation pipeline, you actually also ensure that there’s less relevant content, and the output of the LLM is also going to be worse.

Daniel Whitenack

Yeah, I’ve seen a lot of cases where – I think it was Travis Fischer who was on the show, he advocated for this hierarchy of how you should approach these problems… And there’s maybe seven things on his hierarchy that you should try before fine-tuning. And I think in a lot of cases I’ve seen people maybe jump to that. I forget which one of you said this, but “This naive RAG approach didn’t get me quite there, so now I need to fine-tune”, when in reality there’s sort of a huge number of things in between those two places. And you might end up just getting a worse-performing model, depending on how you go about the fine-tune.

One of the things - David, you kind of walked through the example of the specific company that had these workflows that involved a variety of different operations, which I assume – Ben, you were mentioning earlier starting with a test set, and that sort of thing, and how to think about the tasks… I’m wondering if you can specifically now talk just a little bit about Argilla… Specifically, people might be familiar generally with data annotation, they might be familiar maybe even with how to upload some data to “fine-tune” some of these models in an API sense, or maybe even in a more advanced way, with QLORA or something like that… But could you take a minute and just talk through kind of Argilla’s approach to data annotation and data collaboration? It’s kind of hard on a podcast, because we don’t have a visual to show for people, but as best as you can help people to imagine “If I’m using Argilla to do data collaboration, what does that look like in terms of what I would set up and who’s involved? What actions are they doing?” That sort of thing.

Ben Burtenshaw

[00:28:17.01] Argilla - there’s two sides to it. So there’s a Python SDK, which is intended for the AI/machine learning engineer, and there’s a UI, which is intended for your domain expert. In reality, the engineers often also use the UI, and you kind of iterate on that as you would, because it gives you a representation of your task. But there’s these two sides.

The UI is kind of lightweight. It can be deployed in a Docker container, or on Hugging Face bases. It’s really easy to spin up. And the SDK is really about describing a feedback task, and describing the kind of information that you want. So you use Python classes to construct your dataset settings. You’ll say, “Okay, my fields are a piece of text, a chat, or an image”, and the questions are text question, so like some kind of feedback, a comment, for example, a label question, so positive or negative labels, for example, a rating - let’s say between one and five - or a ranking. So “Example one is better than example two”, and you can rank a set of examples.

And with that definition of a feedback task, you can create that on your server, in your UI, and then you can push what we call records, your samples into that dataset. And then they’ll be shown within the UI, and your annotator can see all of the questions, they’ll have nice descriptions that were defined in the SDK… They can tweak and kind of change those as well if you need in the UI, because that’s a little bit easier…

You can distribute the task between a team… So you can say “Okay, this record will be accepted once we have at least two reviews of it.” And you can say that some questions are required and some aren’t, and they can skip through some of the questions.

The UI has loads of keyboard shortcuts, like numbers and arrows and return, so you can move through it really fast. It’s kind of optimized for that. And different sort of screen sizes. One thing we’re starting to see is that as LLMs get really good at quite long documents, some of the stuff that they’re dealing with is like a multi-page document, or a really detailed image, and then a chat conversation. And then we want like a comment and a ranking question. So it’s just like a lot of information on the screen. So the UI kind of scales a bit like an IDE, so you can drag it around to give yourself enough width to see all this stuff… And then you can move through it in a reasonably efficient way with the keyboard shortcuts and stuff.

Daniel Whitenack

Interesting. And what do you see as kind of the backgrounds of the roles of people that are using this tool? Because one of the interesting things, from my perspective, especially with this kind of latest wave, is there’s maybe less data scientists, kind of AI people, that that’s their background, and more software engineers, and just non-technical domain experts. So how do you kind of think about the roles within that, and what are you seeing in terms of who’s using the system?

David Berenstein

For us, I think it’s - yeah, from the SDK Python side, it’s really still developers. And then from the UI side, it’s like anyone in the team that needs to have some data labeled with domain knowledge. Often these are also going to be the AI experts. And one of the cool things is that whenever an AI expert actually sets up a dataset, besides these fields and questions, they can actually come up with some interesting features that they can add on top of the dataset. They are also able to add semantic search, attach records or semantic representation of the records to one of the records, which actually enables the users within the UI to label way more efficiently. So for example, if someone sees a very representative example of something that’s bad within their dataset, they can do a semantic search, find the most similar documents, and then continue with the labeling on top of that.

[00:32:25.23] Besides that, you can also, for example, filter based on model certainties. So let’s say that your model is very uncertain about an initial prediction that you have within your UI, and it’s really interesting for the domain expert or for the data scientist to go and have a look at that specific record or that range of uncertainties, and then based on that, the labeling or like the data curation, or whatever you would like to call it, becomes way more engaging and way more interesting.

And on top of that, another thing that we are starting to explore is actually using this AI feedback and synthetic data within Argilla as well, and that’s actually one of the other products that we’re working on, and it’s called Distilabel.

So nowadays what you can do with LLMs is also actually use LLMs to evaluate questions, for example, to evaluate whether something is labeled A, B, or C, or whether something is a good or bad response, and you see all kinds of tools, open source tools out there. That’s also a thing that we are looking at for integrating with the UI, where instead of doing this more from a data science SDK perspective, users without any technical knowledge would actually be able to tweak these guidelines that Ben highlighted earlier, and then say, “Okay, maybe instead of taking this into account, you should focus a bit more on like the harm that potentially is within your data, or the risks that are within your data.” And then you would be able to prompt an LLM once again to kind of label your data, and then you wouldn’t directly need the Python SDK anymore.

Chris Benson

I was thinking about, as you were describing that - I work at a large organization, and we certainly have a lot of domain experts in the organization I work at, that are either non-technical or semi-technical… And as users, they will sometimes find it intimidating kind of getting into all this as they’re starting a project. Could you talk a little bit about what it’s like for a non-technical person to sit down with Argilla and start to work in a productive way? What is that experience like for them? Because it’s one thing – like, the technical people kind of just know; they dive into it, they’re going to use the SDK, they’ve used other SDKs… But there can be a bit of hand-holding for people who are not used to that. Could you describe the user experience for that non-technical subject matter expert coming in, and what labeling is like, and just kind of paint a picture of words on what their experience might be like?

Ben Burtenshaw

Yeah, I mean, one thing I guess I’d start off by saying is that Argilla is kind of the latest iteration of a problem that has existed for a long time in machine learning and data science, about collecting feedback from domain experts. And it’s kind of gone through spreadsheets, and various other tools that were substandard, and really bad user experiences, where domain experts were asked for information, that information was extracted, and then models have been trained really poorly on that information.

So as a field, we kind of know that it’s something that we have to take really seriously, and that’s kind of what Argilla is built on top of; that’s part of our DNA as a product, is like optimizing the feedback process as a user experience problem. And so when the user sits down to use Argilla, the intention is that all of the information should be right there in front of them, inside their single record view. So what that means is they’ve got a set of guidelines that are edited in Markdown, they can contain images, links to various pages or other external documents if they need, and they can just kind of scroll through that; it’s always there, it’s always available.

[00:36:11.04] They’ve then also got like basic metrics. So they’ll know how many records they’ve got left, how many they’ve labeled, they can view their kind of team status and see what’s going on. And then on the left, they have their fields which they can scroll through, and on the right they’ll have a set of questions.

As I said, they can move through these in keyboard shortcuts, and they can switch the view so that they can scroll kind of infinitely, or they can move into a kind of page swiping… Which - yeah, if you’re looking at really small records, so like a couple of lines and you’re just assigning a symbol label to, you can do that in bulk. So as we said, you could use a semantic search, give me all the records that are similar to this one, and I’ll bulk-label those. Or you could search for terms inside those records, and you can bulk-label those… And then once you’re finished, you’ll know about it.

David Berenstein

And one of the interesting things that I’ve done personally quite often is sit together with the domain experts and their AI engineers to kind of walk them through how to configure Argilla most usefully, for both of them. And then the domain experts come with a lot of things to the table, like “I want to see this specific representation. What if we could do this? What if we could do that?” Then the AI engineers think about the data side of things. “Is this possible from our point of view, from our side?” And then me as a mediator, so to say, I’m gonna make the most out of the Argilla configuration.

And that’s also how we see this collaboration process going, where domain experts really work together also with AI engineers, because AI engineers or machine learning engineers actually know what’s possible from the data, what it means to get high-quality data for fine-tuning a model… Because whenever a domain engineer comes up with something that’s useful for them in terms of labeling, it doesn’t mean necessarily that it’s actually proper data that’s going to come out of there in terms of fine-tuning a model. And that’s also a part of, I guess, the collaboration that we’re talking about.

Break: [00:38:14.12]

Daniel Whitenack

I want to maybe double-click on something that, David, you just said in sort of passing, which I think is quite significant… And I don’t know if – some people might have caught it, but when you were talking about Distilabel, you also talked about AI feedback. So AI feedback and synthetic data. So I’d love to get into those topics a little bit, maybe first coming from the AI feedback side. I think this is super-interesting, because - Ben, you talked about how this is a kind of more general problem that people have been looking at in various ways, from various perspectives, for a long time, in terms of this data collaboration labeling piece… But there is this kind of very interesting element now where we have the ability to utilize these very powerful, maybe general-purpose, instruction-following type of models to actually act as labelers within the system, or at least generate drafts of labels, or feedback, or even preferences, and scores, and all of those sorts of things… So I’m wondering if one of you could speak to that.

Some people might find this kind of strange, that we’re kind of giving feedback to AI systems with AI systems, which seems circular, and maybe like “Why would that work?” Or maybe that just kind of produces some weird feelings for people… But I think it is a significant thing that is happening. So yeah, either of you would want to kind of dive into that… What does it specifically mean in AI feedback? How are you seeing that being used most productively?

Ben Burtenshaw

So when we create a dataset, either manually, or with AI feedback or AI generation, we have all the information there to understand the problem. We have a set of guidelines, we have a set of labels, definitions for those labels, with documents, and definitions to those documents. We give those to a manual annotator, or we’ll go out and collect those documents, and w give those documents to the manual annotator. And we’re trying to describe that problem so that the person understands it to create the data.

We can essentially take all of the same resources and give those to an LLM, and get the LLM to perform the same steps. So there’s two parts to that. There’s a generative part where the LLM can generate documents… So let’s say we’ve got 100 documents in our dataset, but we want 10,000. We can say “Generate a document like this one, but”, and add variation on top of that. And we can fan out our dataset, our documents from 100 to 10,000. We could then take those same documents, or a pool of documents from elsewhere, and we could get feedback on that. So that could be qualitative feedback. “Tell me which of these documents are relevant to this task.” “Tell me which of these documents are of a high quality, are concise, are detailed”, these kind of attributes. So we could filter down our large dataset or our generated dataset to the best documents.

We could also add labels. So we could say, “Tell me which of these documents relates to my business use case or not”, these kind of things. Apply topics to these documents… And then we can, in doing so, create a classification dataset from those labels. Or we could, in one example, take a set of documents and use a generative model to generate questions, or queries about those documents. And we could use that to create a Q&A dataset, or a retrieval dataset, where we generate search queries based on documents.

Chris Benson

When you’re doing that and you’re generating the datasets with another model, how much do you have to worry about hallucination playing into that? It sounds like you have a good process for trying to catch it there, but… Is that a small issue? Is that a larger issue? Any guidance on that?

Ben Burtenshaw

That’s one of the main issues, definitely. It is probably the main issue. And so really, it’s about both sides of that process that I described, that generating side and that evaluating side. So you get the large-language models to do as much as possible to expose hallucination by evaluating themselves. And typically, you’re getting larger models to evaluate, so that they’re a more performant model and they should hallucinate less.

The task of identifying hallucinations is not the same as generating a document. So typically, LLMs are better at identifying hallucinations and nonsense. If you give them the context, then they are not generating it. And so you combine that within a pipeline, and then you would take that to a domain expert, in a tool like Argilla. And so that’s really why we have these two tools, Distilabel and Argilla. Because without Argilla, Distilabel would suffer from a lot of those problems.

Daniel Whitenack

[00:46:09.28] Yeah. And I guess that brings us to the second tool, the Distilabel, which I know has some to do with this synthetic data piece as well… And I’m really intrigued to hear about this, because I also see some of what you have on the documentation about what are people building with Distilabel… I do know a couple of datasets, like the Open Hermes dataset, the Intel Orca DPO dataset… These are datasets that have been part of the lineage of models that I’ve found very, very useful. So first off, thanks for building tooling that’s created really useful models in my own life. But beyond that - yeah, David, do you want to go into a little bit about what Distilabel is, and maybe even tie into some of those things and how it’s proven to be a useful piece of the process in creating some of those models?

David Berenstein

I think - yeah, the idea of Distilabel kind of started [unintelligible 00:47:10.27] half a year ago, more or less, or maybe a year ago, where we saw these new models coming out, like Alpaca, and DALL-E from Databricks, Alpaca from Stanford, where there were datasets being generated with OpenAI frontier models, being evaluated with OpenAI frontier models, and then published and actually used for fine-tuning one of these models. So apparently there were research groups or companies kind of investing time in this. But what we also saw is when we would kind of upload these datasets into Argilla, and actually start looking at the data, that there were a lot of flaws within there. And then whenever [unintelligible 00:47:51.14] which is one of these specific papers that really started to scale this synthetic data and AI feedback concept came out, we thought “Okay, maybe it’s worth to look into a package that can actually help us facilitate kind of creating datasets, that we can then eventually fine-tune within Argilla.” And that’s when we started to work on the initial version of Distilabel.

So it’s kind of like an application framework, like LLaMA index or Langchain, if you’re familiar with those, but then specifically focused on synthetic data generation and AI feedback. So what we try to do is organize everything into this pipelining structure, where you have either steps that are about basic data operations, tasks that are about prompt templates or prompting. And prompt templates, you can think about either providing feedback, maybe rewriting some initial input that you provide to the prompt template, or maybe like ranking, or like generating from scratch, or these kind of things. And then these tasks are actually executed by LLMs, and these are then all fit together within a pipelining structure.

The thing for these tasks is that nowadays we actually look at all of the most recent research implementations or most recent papers, and we try to implement them whenever they come out and are actually relevant for synthetic data generation. So you really go from that kind of finicky prompt engineering, so to say, to well-evaluated prompts that we’ve implemented.

And the nice thing about our pipelining structure is also that we run everything asynchronously. So there’s multiple LLM executions being done at once, which will really speed up your pipeline. And on top of that, we also cache all of the intermediary results. So as you can imagine, calling the OpenAI API can be quite costly, and whenever you run a pipeline, a lot of things can go wrong. But whenever you actually rerun our pipelines within Distilabel, you actually have these cached results already there, so you would avoid kind of incurring additional costs whenever something within the pipeline breaks.

Daniel Whitenack

[00:50:06.00] Yeah, that’s awesome. And I know that one element of this is the creation of synthetic data for further fine-tuning LLMs, to increase performance, or maybe to some sort of alignment goal, or something like that. But also, I know from working with a lot of healthcare companies, manufacturers, others that are more security privacy conscious in my day job, part of the pitch around synthetic data is maybe also creating datasets that might not kind of poison LLMs with a bunch of your own sort of private information that could be sort of exposed as part of an answer, that someone prompts the model in some way, and this data is embedded in the dataset, and all of that. So yeah, I would definitely encourage people to check out Distilabel. You said it’s been around for half a year… How have you seen the kind of usage and adoption so far?

David Berenstein

The usage and adoption has been quite good, in terms of the number of datasets that have been released. So you mentioned the Intel Orca DPO dataset, which was an example use case of how we were initially using it, where we had this original dataset that had been labeled by Intel employees, with preferences of what would be the preferred response to a given prompt. And we actually used this label to kind of clean that, based on prompting LLMs ourselves to re-evaluate these chosen rejected pairs within the original dataset, filtering out all of the ambiguity. So sometimes the LLM wouldn’t align with the original chosen/rejected pair, and based on that, we were actually able to scale down the dataset by 50%, leading to less training time, and also leading to a higher-performing model.

And that was one of the really famous examples that kind of inspired some people within the open source community to actually start looking at this label, to start using this label to generate datasets. There are some Hugging Face teams that actually have been generating millions and millions of rows of synthetic data using Distilabel, and that’s pretty cool, to see that people are actually using it at scale.

And besides that, there’s also these smaller companies, so to say, [00:52:34.02] the German startup that I mentioned before, using it to also rewrite and resynthesize emails within actual production use cases.

Chris Benson

[00:52:48.25] That’s really fascinating. You guys are pushing the state of the art in a big way. With the work that you’ve done in Distilabel and Argilla, where do you think things are going? When you’re kind of end of whatever your task is of the day, and you’re kind of just letting your mind wander and thinking about the future, where do each of y’all go in terms of what you think is going to happen, what you’re excited about, what you’re hoping will happen, what you might be working on in a few months or maybe a year or two? What are your thoughts?

Ben Burtenshaw

I suppose for me it’s about two main things… And the first would be modalities. So moving out of text and into image, and audio, and video, and also kind of UX environments… In Argilla, but also in Distilabel, that we can generate synthetic datasets in different modalities, and that we can review those. That’s a necessity and something that we’re already working on and we’ve already got features around, but we’ve got kind of more coming.

And then the second one, which I suppose is a bit more far-fetched, and that’s a bit more about kind of tightening the loop between the various applications.

So between Distilabel, Argilla and the application that you’re building, so that you can deal with feedback as it’s coming from your domain expert, that’s using your application and potentially Argilla at the same time, so we can kind of synthesize on top of that to evaluate that feedback that we’re getting, and generate based on that feedback… So we can add that into Argilla and then we can respond to that synthetic generation, that synthetic data. And then we can use that to train our model, this kind of tight loop between the end user, the application and our feedback.

David Berenstein

Yeah. And for me, it kind of aligns with what you’ve mentioned before, Ben - the multimodality, smaller, more efficient models, things that can actually run on a device. I’ve been playing around with this app this morning that you can actually load a local LLM into, like a smaller QN or a LLaMA model from Meta… And it actually runs on an iPhone 13, which is really cool. It’s private. It runs quite quickly.

And the thing that I’ve been wanting to play around with is the speech-to-speech models, where you can actually have real-time speech-to-speech. I’m currently learning Spanish at the moment… And one of the difficult things there is not being secure enough to actually talk to people out on the streets, and these kinds of things. So whenever you would be able to kind of practice that at home, privately, on your device, kind of talk some Spanish into an LLM, get some Spanish back, maybe some corrections in English… These kinds of scenarios are super-cool for me whenever they would be able to come through.

Chris Benson

Muy bueno.

Daniel Whitenack

Yeah, this is muy bueno. I’ve been really, really excited to talk to you both, and would love to have you both back on the show sometime to update on those things. Thank you for what you all are doing, both in terms of tooling, and Argilla and Hugging Face more broadly in terms of how you’re driving things forward in the community, and especially the open source side… So thank you both. Thank you for taking time to talk with us, and hope to talk again soon.

David Berenstein

Yeah, thank you. And thanks for having us.

Ben Burtenshaw

Thank you.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Practical AI – Episode #290

Towards high-quality (maybe synthetic) datasets

with Ben Burtenshaw & David Berenstein from Hugging Face

Featuring

Featuring

Sponsors

Notes & Links

Chapters

Transcript