Practical AI – Episode #290
Towards high-quality (maybe synthetic) datasets
with Ben Burtenshaw & David Berenstein from Hugging Face
As Argilla puts it: “Data quality is what makes or breaks AI.” However, what exactly does this mean and how can AI team probably collaborate with domain experts towards improved data quality? David Berenstein & Ben Burtenshaw, who are building Argilla & Distilabel at Hugging Face, join us to dig into these topics along with synthetic data generation & AI-generated labeling / feedback.
Featuring
Sponsors
Fly.io – The home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes.
WorkOS – A platform that gives developers a set of building blocks for quickly adding enterprise-ready features to their application. Add Single Sign-On (Okta, Azure, Google, Microsoft OAuth), sync users from any SCIM directory, HRIS integration, audit trails (SIEM), free magic link sign-in. WorkOS is designed for developers and offers a single, elegant interface that abstracts dozens of enterprise integrations. Learn more and get started at WorkOS.com
Eight Sleep – Take your sleep and recovery to the next level. Go to eightsleep.com/PRACTICALAI and use the code PRACTICALAI
to get $350 off your very own Pod 4 Ultra. You can try it for free for 30 days - but we’re confident you will not want to return it. Once you experience AI-optimized sleep, you’ll wonder how you ever slept without it. Currently shipping to: United States, Canada, United Kingdom, Europe, and Australia.
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome to Practical AI | 00:44 |
2 | 00:44 | Sponsor: Fly | 03:06 |
3 | 03:56 | What does data collaboration mean? | 03:22 |
4 | 07:18 | Understanding your data | 02:40 |
5 | 09:58 | How to start curating data | 03:14 |
6 | 13:12 | Practical steps to scale | 03:30 |
7 | 16:52 | Sponsor: WorkOS | 03:21 |
8 | 20:23 | Traditional & new usecases | 04:28 |
9 | 24:51 | Virtues of smaller models | 02:13 |
10 | 27:04 | What Argilla looks like | 03:52 |
11 | 30:55 | User backgrounds | 03:26 |
12 | 34:21 | The non-technical POV | 03:50 |
13 | 38:23 | Sponsor: Eight Sleep | 02:31 |
14 | 41:09 | AI feedback | 03:41 |
15 | 44:50 | Hallucination issues | 01:20 |
16 | 46:10 | What is Distilabel | 03:58 |
17 | 50:08 | Usage & adoption | 02:47 |
18 | 52:55 | Where things are going | 02:39 |
19 | 55:34 | This is muy bueno | 00:42 |
20 | 56:15 | Outro | 00:46 |