Practical AI – Episode #223

Creating instruction tuned models

with Erin Mikail Staples, developer community advocate at Label Studio

All Episodes

At the recent ODSC East conference, Daniel got a chance to sit down with Erin Mikail Staples to discuss the process of gathering human feedback and creating an instruction tuned Large Language Models (LLM). They also chatted about the importance of open data and practical tooling for data annotation and fine-tuning. Do you want to create your own custom generative AI models? This is the episode for you!

Featuring

Sponsors

FastlyOur bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com

Fly.ioThe home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Typesense – Lightning fast, globally distributed Search-as-a-Service that runs in memory. You literally can’t get any faster!

Notes & Links

📝 Edit Notes

Chapters

1 00:00 Welcome to Practical AI
2 00:43 Erin Mikail Staples
3 02:09 Open source attendees
4 03:54 The key to RLHF
5 05:35 Tooling for RLHF
6 07:33 Humanities in data science
7 11:22 Label Studio's workflow
8 15:41 The open data ecosystem
9 21:04 Do data labeling
10 22:33 Exciting changes coming
11 24:15 DevRel(ish) and other resources
12 25:13 Goodbyes
13 25:45 Outro

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Hello, this is Daniel Whitenack. I am here on-site at ODSC East in Boston, the Open Data Science Conference, and I am super-excited because I get to sit down with Erin Mikail Staples, who’s a developer community advocate at Label Studio . And yeah, what do you think of the conference so far, Erin?

It’s been – first-off, super-fan of what you’ve been doing, and anybody who’s creating stuff out there in this space, especially with the current Zeitgeist and explosion of interest in AI and machine learning.

It’s a little crazy…

It’s a little wild. It’s a little wild. I would be lying if I wouldn’t be saying I’m newer to the field myself, but it’s been something I’ve been very fascinated about… But all that being said, this conference is really cool to see just the breadth of, first off, people here. There are people who have very new to the industry, people who came to just learn more for their first time… But then there are people who have been practicing for years and years, and this is their third or fourth time at ODSC… And I’m also really interested about the number of people concerned about data integrity here.

Yeah. Lots of interpretability, integrity, reliability type talks.

Yeah, lots of reliability. The other one is also on missing data, and how do we approach these problems, especially with the rise of foundational models and generative AI. How does that impact it for the long, which is crucial conversations, I think, to have.

Yeah, definitely. And what sorts of different players in this space are you seeing at this conference, both in terms of open source, or different targets, like MLOps platforms, that sort of thing - how do you see that developing?

First off, I’m personally a huge fan of open source. It’s not only how I learned to code in the first place, but just a big believer in the ecosystem. I’m a huge believer in open data, I’m a participant in Open Data Week…

And you’re wearing a PyLady shirt, which is awesome.

Yeah, yeah. I’m a member of PyLadies… So again, super-important, I think, to have all these things in the ecosystem… But one of the things I think that stands out is there’s so many new innovations that like if you’re starting a tech stack from ground zero, it’s really fun to see all the different players in the game. So selfishly, working at Label Studio , one of the best things about being in this space right now is we’re a cool platform, because we can integrate with so many different data types, that it means that I almost get to play with almost every other tool or workshop or players in the ecosystem… Which is selfishly fun. It means I get to have more things to integrate with our build. As always, we’re huge fans of what the Pachyderm team is always doing… We work very closely with them. We work very closely, we’ve got a lot of friends and fans in the DVC crew. They’re not here at this conference round, but did get to work with them at PyCon, which was really amazing to see the work that they’re coming out with. The [unintelligible 00:03:36.08] crew is always fun to see around… So that’s always exciting…

Cool. Yeah, yeah, there’s so many awesome things going on, and I’ve seen maybe three or four open source packages that I – I don’t know if I’ve been ignoring, or I haven’t heard about… So that’s one of the fun things about coming to these things. I know also you gave a recent talk at PyData Berlin, about reinforcement learning from human feedback, I believe was the topic. Could you tell us a little bit about the general pitch or angle on that talk? …which is definitely a key topic these days with all the instruction-tuned models that are coming out, and all of that… So what was your angle in terms of what you were thinking about there?

Yeah, so I think one of the cool things is it was a talk that Nikolai and I gave; Nikolai is the CTO and one of the co-founders of Label Studio. And what we did at Berlin is we really made sure to expand on this idea that yes, these generative models, these larger models are becoming the norm. Yesterday I was talking to someone who’s like “I got interested in AI because I made 1,000 things with Midjourney.” I’m like “Cool!” And I’m very fascinated by, and I’m a believer - like, I don’t care how you got into it, but just the curiosity to show up to a conference and learn more is very fascinating.

But explaining to someone how it works, and then also explaining the best practices behind it is really important. Personally, I have a journalism background, I have a liberal arts background, and I think it’s really important that we incorporate the humanities in technology for the long run. So when it comes to reinforcement learning, all of these large generative models, they can all be made just a little bit better with the human signal that we can provide. And we can say a lot of things, like get into prompt engineering, which is a whole other topic, but it will never be as good as if you can retrain your own dataset with subject matter experts, or to a specific use case or condition that you’re trying to output that data towards.

Yeah. And I think one of the things that’s been on my mind recently is this topic, reinforcement learning from human feedback, especially with what’s gone on with ChatGPT… Sometimes it feels out of reach for day to day data scientists. Like, I could leverage this model, but what is the tooling around reinforcement learning from human feedback? How could I use that framework, or use tooling around that to impact my own models, or my own life? How could I connect my domain experts’ input and their preferences into a system that I’m designing? Do you have any thoughts there?

[06:11] Yeah. So one of the examples I love to point to is actually Bloomberg did this, and they probably did this early April now… And they took the financial data that they had, and Bloomberg, all the way back from - many of us know about Bloomberg from Bloomberg News, but that was actually the financial terminal that was used for stock trading. But they have these mass amounts of financial data, and how do they stack on top of it? How do they get and access that data even faster, and train it to the best use case that we have? Currently, our larger models can’t do that. They’re not experts in financial data. They’re not combing just financial data. But what Bloomberg did is they took and they retrained and they built the things – I probably fangirled over… Sorry if you’re on the Bloomberg team and I fangirled over you at PyCon, because I definitely was –

Great work.

I was like “This is the coolest thing ever. I’ll use this as an example. Also, I learned machine learning off of your repo. Okay, thanks, bye.” But we do have a model; if you want to learn and see reinforcement learning in action, there is an open source repo. It is built by myself, Nikolai and Jimmy Whittaker, who we have as a data scientist in residence at Label Studio and Heartex… But also Pachyderm as well. But all of that is built – it’s open, you can play around with it. It’s based off of GPT-2 right now, so you can go have some fun and get your hands dirty. And it’s all runnable within a Google Colab Notebook.

That’s awesome. So you mentioned it being run in a Google Colab notebook, which I think is awesome… And using also a bit of a smaller model to start with, and we’ve seen a lot of directions towards smaller open models that are accessible to data scientists, with Llama, and other things that… How do you see that trajectory going? And how will that impact day to day practitioners in terms of what they’re able to do with this sort of technology?

I think the biggest thing – I’m actually going to zoom out to answer this… The biggest thing we need to think about is context. What are you using a model to solve, or AI to solve, or ML to solve? And the more that I’ve been diving into these conferences and the ecosystem, especially at a conference where it’s a blended conference, where you have folks that are not necessarily deep in the field, or an ML practitioner, or they’re new to ML, it is so easy. And there’s a meme I always point to, that it’s like “Oh, we’re an AI-backed so and so”, and it’s like “JK! We’re just an AI. We’re just calling the API and putting a nice, pretty, shiny frontend on it”, which is no shade to anybody who is putting a frontend on a GPT API. There is no shade at all to that. But it’s like thing about what you need a model for in the first place, or what you want to use machine learning; that context is so important. I’m currently playing around with a Naked and Afraid dataset, just to play around. There’s an open source dataset out there that is a…

That’s awesome. Like videos, or…?

No, it’s context from the TV show of how many days they survived.

Oh, so literally, statistics and features of the different survival situations?

Yeah, it’s like country, their name, gender, and then how many days they made it.

And climate.

Yeah. Based on that – yeah, that’s so intriguing. I watch a lot – so, a confession… I watch also, the Alone Show, which is another survival show… I’m a huge fan.

Oh, yes. I’m a reality TV terrible junkie. That is how I de-stress, is reality TV.

So I always wonder… I have this conversation with my wife, where I’m like “Could I do it?” And maybe with a model trained off of your survival dataset, I could say “I’m from here, and this is my background. Could I survive?” I don’t know…

Yeah. And I can’t take credit for the original dataset. It is someone who I’ve made friends with in my reality TV Subreddit. So if you need to know where I spend my time… [laughter]

[09:59] That’s awesome.

But he runs a SQL database. It is actually very good – he’s very awesome updating it; it’s available on Reddit. I can share it with you, and you can post it in the links. But I’m just playing around with the dataset for fun. But in this context, I’m playing around, building demos, and just having some fun, teaching myself some new skills. I don’t need a large foundational model for that. And I think going back to your original question of “Well, all these models are getting smaller, and more accessible, and we can run it in a notebook…” We don’t need the high-powered computer models, everything single times. And if we stop and think about the context of the problem that we’re trying to solve, it can give us a lot of answers, and it can save us time, energy and computing power. That’s why I get really excited about being on the data labeling side. Again, I have a background in humanities, I’m a self-taught programmer, but I think – I don’t wanna be like “We need more people like me in data science”, but we need more of the humanities in data science, because we’re missing the context.

Yeah, we recently had a guest on the show that was talking about the intersection of art history and computer science, and how computer scientists who are analyzing and doing computer vision could actually learn a lot from what we know about art, and how scenes are composed, or how art has changed over time, and how the features that they’re actually engineering are connected to some of those things… So yeah, I think that there’s a lot of different areas where this could apply, and domain experts are so important. And I assume that with all of this reinforcement from human – reinforcement learning from human feedback; I always mess it up…

It’s okay. I’ve been doing the same thing. It’s like R, L, H, F. I get it.

Yeah. Especially since you’re from the Label Studio side, could you give a general picture or workflow for people of like “Hey, I maybe want to take one of these models, GPT-2, Llama, MPT Now, whatever the one is, but I also want to gather some domain expert feedback, and eventually get to some type of instruction or fine-tuned model off of that.” Could you just give a general picture for what that looks in today’s world?

Yeah. And we’ll try – and I feel like this is better when you have a whiteboard and a diagram and some arrows…

Oh, for sure. Yes, it’s hard.

I’ll do a quick walkthrough. So first off, you’ll create a sort of prompt. So typically, these models work with a prompt, and then you’re given a large language model. And then you start to train it. Usually, what happens when you’re training these models is you get a set of two outputs. In this case we can use “What is opossums?”, because we’re opossum fans at Label Studio. I feel like that’s natural. And you can be like “An opossum is a marsupial creature” or “An opossum is a great character for memes.” Technically, both of those are correct. But depending on context - and this is where that human signal side comes in - one answer is more correct than the other.

So if we were training, let’s say, an opossum meme bot thing, or a meme bot generator - let’s go that direction, we’ll have some fun with it - we would take the latter answer of this “an opossum is a great animal to make memes.” And that would be the better answer. If we were going for “What type of animal are we doing in like a maybe a biology assignment homework” - probably we’d pick the marsupial one. But this just gives insight of the details that you give your annotation team can really directly influence the model. That’s the labeling side.

When we move this on, all of this is put through this results from human feedback; your answers are ranked. I did a binary situations, so just two options, but you can have a multitude of options that you put in. It is all weighted. It is then looped back around - this is when you wish that we had a whiteboard - to a reward or a preference model. And this reward or the preference model tells you “Hey, I probably want to go for answers that look this.” Now, computers don’t speak memes, or marsupial, or biology textbooks, but they do know patterns and trends, which is what they pick up on. So based on that context clues that we give them, this preference model will start to preference those type of answers.

[13:59] Now, it’s really important that these reward preference models also hold in place the original things that we had, that it knows, like how is language structured, or other things from our original model that we enjoyed, and we liked… Like “Language is always structured this. Here’s a proper noun. We like to capitalize the first letter of the sentences.” Things that are important, that we kind of overthink sometimes, when talking about generative language models, at least.

After that, we want to make sure that we’re not just gaming a system. Models are – again, I don’t think models are sentient. They’re kind of just like math numbers; they’re just trying to game a system. They’re playing – I always compare it to – it’s Moneyball, essentially. Baseball fan here… So it’s Moneyball; you’re statsing out the system, and in order so that they’re just not giving you what you want to hear every time, you’ll have to calculate an error rule in there. So put an error metric or an update rule, and it basically says, “Alright, we’re gonna almost like dunk you down a little bit, so you’re not too perfect.” And that’ll prevent unwanted model drift.

Then once you’ve done that a few times, we’ll combine that with a copy of your original model that you had. Again, you’re kind of doing that checks and balances, making sure it doesn’t run away. After that, you will have a tuned language model. And then rinse, wash, repeat until you’ve got that model right where you want it, set it off to production, and then talk to your friends at the other parts of your MLOps ecosystem, and it’ll come in handy.

Yes. Awesome. Yeah. And I hope that we can link some of your slides from that talk in our show notes…

They’re awesome… Including emojis, and the full deal, which helps. So make sure and check out the show notes. The link to the slides will be in there, so you can take a look at these figures while you’re listening to the show.

One follow-up question on this, talking about gathering this feedback data… People can think about like “Okay, in the context of my company, or where I’m working, I’m gonna gather some of this data, tune a model…” But what is your perspective on sort of the open data ecosystem, and what would you encourage people to think about in terms of data that they could make openly available to help others who are also trying to do this? Or the other way around - people that are searching for maybe a place to start. What does the sort of open data ecosystem look right now, and how important is that as this sort of field advances?

Yeah. First off, you’ve got me on my other favorite soapbox of the moment, and this goes back to my days when I was a journalism student, working in journalism… But open data is one of my favorite topics to geek out on. Basically, it was something that really came actually as part of the Obama administration; he actually established federal funding for a lot of our public and civic data as a part of government accountability and transparency. So there was actual federal grants that went out to make a lot of our civic data public.

So there’s a really cool example - I believe it’s the city of Philadelphia that actually built a Sim City-like game off of their public dataset. It’s so cool. It was like a grant given. Super-fascinating. I’ll link it to you, and it’ll be in the plethora of show notes on all of that. But open data is just open, freely accessible, freely use data that is made available to the public. I love open data, I’m a participant in Open Data Week… But when it’s – it’s been federally funded, and it’s not always the best thing to be federally funded. We all know how government grants go… And if you aren’t aware how government grants go, they’re very niche-specific, and they run out, and they’re not always maintained, and it’s not always the cool, sexy job that we have. So they’re always not the best maintainer context applicant.

What a lot of these early machine learning models did, and what a lot of machine learning model do is these open datasets have given opportunities for people like myself to even learn how to do data science. I learned Python in Open Data Week. I remember going back and like “Let’s get the traffic data in New York City.” And it’s like basic – using Curl, and getting things started first, like “Can you query an API?” They’re not the most organized datasets out there, they’re not the most clean… Sometimes you get some really messy garbage data… The 2020 census is actually a great example.

[18:06] I was speaking to someone yesterday at the conference about this - the 2020 census was the first time that we were able to do it digitally. Well, she gave the example of like “Hey, I started the census on my phone. Oh, no, the pot boiled over. Oops, I accidentally encountered myself twice in the census.” Or “I didn’t fill out my address.” Or “Now I’ve got two people, or a person who lives at this address, or a typo…” Crap. Now that’s a very messy dataset. So open data can be a problem.

Let’s go to the practical application of this. If you are working in open data, or you are interested in getting more involved in open data, one of my favorite sources is if you’re publishing a story, making tutorials, making content, put your data out there and put how you processed it. And it’s not just one thing to put your data out there, but also how you processed it. In journalism, you have this phrase, “How you frame the story is how you tell the story.” Leaving out details, context, or even how you came across the source can influence how the story comes across.

Yeah, for sure.

And that’s, we see, especially evident in data-driven journalism and solutions journalism, which is interesting. And it also can be really damaging to trust and reputation. But I think ML runs the same risk right now, if we’re not transparent of “Here’s how I prepared the dataset. Here’s how I trained an annotator” or “Here’s the tools that I used”, or “Here’s how I obtained the data in the first place.”

Yeah. And like you were saying, certain things, like how you give instructions to a data annotator, or how you set up your prompt - that has such an influence on the downstream performance of these things, but it’s very frequently… I’ve definitely found the instructions you give data annotators is something very often left out of the story of how people tell what they did. It’s like “Oh, we gathered this data with these labels.” Okay, well, I can imagine my own set of instructions for getting those labels, but it could result in a totally different thing that’s happening, like all sorts of biases and other things that go into that.

I mean, I have a perfect case example of this. In January we met many of the team members at Heartex and Label Studio met up. Basically, we got our entire customer success and sales team, and the community side of things, and a bunch of our support engineers to all sit together, and we had a data labeling competition for fun at the end. And I had just finished “How to get started with data labeling”, and best practices, and I was like “Easy. I’m going to kick all of your butts.” I was totally going in like hot shit, and everything, and thinking… Well, I sped through. I was like “Whatever. Next. Great. Done.” And I sped through, because speed was a metric, but also accuracy. Well, I sped through this thing, because I was like “Whatever. I’m going to ace this. I know the keyboard shortcuts. My systems are set up.”

Yeah. [laughs]

I had the lowest accuracy score of everybody.

Oh, man…

My data was all- I was like “You failed Erin.” I was like, “Man, I’m gonna go embarrass myself right now, after all that crap I just talked…”

Yeah, yeah. I think that’s the other thing… I don’t know if you have an encouragement here, but data scientists out there who have not actively participated in the data labeling process… Yeah, that’s such a learning experience, because it gives you perspective. Even if in the future you’re not part of one of those processes, it gives you good questions to ask. If someone gives you this data set that was labeled, you should probably ask a few follow-up questions about “How did that go? What did you do there?”

Well, in academic research you actually have to disclose things like “Did you pay your annotators? Or how did you prepare the annotators when you were doing research?” Because that can put so much of a bias on a model that is built off of that data. And academically, you can’t get peer-reviewed studies done without disclosing that information. It’s part of data ethics now. And one of the biggest things - and we don’t talk about it enough - is how do you pay your annotators? Or do you outsource your annotators? Which isn’t saying that’s a bad thing to do, but again, we have to remember that so many of these models… And I think a lot of times it actually is probably – I’m gonna guess here, I don’t know, but I’d be even wondering if the smaller models that are generated, because they’re generated at home, or people dorking around on their computer… They might even have more bias, because we’re not training an annotator. I know when I’m goofing around with my Naked and Afraid dataset, I’m not annotating – I’m playing some 30-second goofing around stuff, and watching YouTube videos, just seeing what’s out there. I’m not doing the work… Which is a problem.

[22:32] Yeah. I guess bringing things full-circle a little bit… We started talking about some of these players, and MLOps, and sort of the ops around this process… We talked about human feedback, reinforcement learning, we talked about open data… What excites you about the trends that we’re seeing and what impact they could have on our industry moving forward? Maybe that’s related to people that weren’t able to participate in this process before, the tooling is better and so they can… Or maybe it’s something totally different, around tasks, or other things that you see in the future. What are you personally excited about looking forward as you bring this stuff together?

First off, I’ve been really impressed with what the Hugging Face team is doing. I noticed the Hugging Face shirt… The Hugging Face spaces have been amazing. We do have a Label Studio Hugging Face space, but the ability to get up and going in the browser has been super-awesome. There is a talk I went to at PyData Berlin that’s running Streamlit, so they’re running entire Python-based models right in the browser, and tools… I think there’s a – it’s Binders, I believe, is another tool that’s doing another - again, very similar to notebook processing; all in the browser, it makes it more accessible than ever before… And it’s just really exciting, especially as – I love that we have more people interested in this industry, but it’s also not only the interest, but the tools to do it correctly and ethically… And again, jumping on my soapbox here - this is why the open data is so important. So when we can, putting our sources, our references, building in the public, building in open source, and making almost a – I don’t wanna say paper trail, but a show-your-work sort of process is really important for the future.

Yeah. Awesome. That’s great. And as we close out here, where can people find you online? And also, tell us a little bit about your own podcast, which sounds awesome, and includes pickles.

Yeah. So I am available online at Erin Mikail on all the platforms, or erin.bio has a link to everything that I’m at. You can also chase me down at Label Studio; so it’s Label Studio, but the last .io is like [unintelligible 00:24:41.23] Join the community, come hang out with me there. We have an up and coming townhall, and getting into more workshops… So I’m very excited about that.

I also run the Dev Relish Podcast. It’s everything about Dev Rel and -ish. Also, you know, naturally, some people made sourdough bread; I got into fermentation. We’ve got a fun pickle fact, and cool pickle logos, because you’ve got to relish the developer moments in open source.

Yeah. Well, this was definitely not a sour experience. I’ve relished it very much. Thank you so much for joining, Erin. It’s been a great pleasure to talk to you, and looking forward to following up with all the cool community stuff you’ve got going on. Again, people, check out the show notes, and thank you so much.

Thank you so much. This was quite a big dill that we had going on here… [laughs]

Good one, good one.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00