Practical AI – Episode #290

Towards high-quality (maybe synthetic) datasets

with Ben Burtenshaw & David Berenstein from Hugging Face

All Episodes

As Argilla puts it: “Data quality is what makes or breaks AI.” However, what exactly does this mean and how can AI team probably collaborate with domain experts towards improved data quality? David Berenstein & Ben Burtenshaw, who are building Argilla & Distilabel at Hugging Face, join us to dig into these topics along with synthetic data generation & AI-generated labeling / feedback.

Featuring

Sponsors

Fly.ioThe home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes.

WorkOSA platform that gives developers a set of building blocks for quickly adding enterprise-ready features to their application. Add Single Sign-On (Okta, Azure, Google, Microsoft OAuth), sync users from any SCIM directory, HRIS integration, audit trails (SIEM), free magic link sign-in. WorkOS is designed for developers and offers a single, elegant interface that abstracts dozens of enterprise integrations. Learn more and get started at WorkOS.com

Eight SleepTake your sleep and recovery to the next level. Go to eightsleep.com/PRACTICALAI and use the code PRACTICALAI to get $350 off your very own Pod 4 Ultra. You can try it for free for 30 days - but we’re confident you will not want to return it. Once you experience AI-optimized sleep, you’ll wonder how you ever slept without it. Currently shipping to: United States, Canada, United Kingdom, Europe, and Australia.

Notes & Links

📝 Edit Notes

Chapters

1 00:00 Welcome to Practical AI 00:44
2 00:44 Sponsor: Fly 03:06
3 03:56 What does data collaboration mean? 03:22
4 07:18 Understanding your data 02:40
5 09:58 How to start curating data 03:14
6 13:12 Practical steps to scale 03:30
7 16:52 Sponsor: WorkOS 03:21
8 20:23 Traditional & new usecases 04:28
9 24:51 Virtues of smaller models 02:13
10 27:04 What Argilla looks like 03:52
11 30:55 User backgrounds 03:26
12 34:21 The non-technical POV 03:50
13 38:23 Sponsor: Eight Sleep 02:31
14 41:09 AI feedback 03:41
15 44:50 Hallucination issues 01:20
16 46:10 What is Distilabel 03:58
17 50:08 Usage & adoption 02:47
18 52:55 Where things are going 02:39
19 55:34 This is muy bueno 00:42
20 56:15 Outro 00:46

Transcript

⏰ Coming Soon

Changelog

We're hard at work on the transcript for this episode! Sign in / up to access transcript notifications. 💪

Player art
  0:00 / 0:00