Practical AI – Episode #117
Getting in the Flow with Snorkel AI
featuring Braden Hancock
Braden Hancock joins Chris to discuss Snorkel Flow and the Snorkel open source project. With Flow, users programmatically label, build, and augment training data to drive a radically faster, more flexible, and higher quality end-to-end AI development and deployment process.
DigitalOcean – Get apps to market faster. Build, deploy, and scale apps quickly using a simple, fully managed solution. DigitalOcean handles the infrastructure, app runtimes and dependencies, so that you can push code to production in just a few clicks. Try it free with $100 credit at do.co/changelog.
Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with no ads, extended episodes, outtakes, bonus content, a deep discount in our merch store (soon), and more to come. Let’s do this!
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
LaunchDarkly – Power experimentation at any scale. Fast and reliable feature management for the modern enterprise.
Notes & Links
- Snorkel AI
- Snorkel OSS
- Snorkel Blog
- Snorkel AI | Twitter
- Snorkel AI | LinkedIn
- Snorkel Best of VLDB paper
- Snorkel Drybell collaboration with Google
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another episode of the Practical AI podcast. My name is Chris Benson, I’m a principal emerging technology strategist with Lockheed Martin. Unfortunately, Daniel was not able to join us today, but I have a guest that I’m excited to talk to today. I have Braden Hancock, who is the co-founder and head of technology at Snorkel AI. Welcome to the show, Braden! How’s it going today?
Thanks, glad to be here. Doin’ well.
Well, I was wondering if you would start off telling us a bit about your own background, and let us understand to how you got where you’re at, and then I’m looking forward to asking you more about Snorkel AI.
Yeah, absolutely. As you mentioned, I’m currently a co-founder and head of technology at Snorkel AI. The company has been around for about a year and a half now, maybe coming on two, and it’s been a blast.
Before that, I was a Stanford Ph.D. student, along with all the rest of my co-founders; that’s actually the origin story of our company. That’s what brought me to the Bay Area in the first place. I was actually not in computer science originally; I came from mechanical engineering, and just found myself consistently being drawn to machine learning, and then finally, I saw the writing on the wall and made the jump myself, going into grad school.
I’m just curious, because we hear that kind of story a lot, where people are coming in from an industry that you may or may not expect that to happen… When you were still doing mechanical engineering, what was the draw into machine learning for you at that point? I’m just curious what it was that started that process of sliding over?
Yeah, I completely agree with you; I think we see a lot of people make that shift, and it’s perhaps not too surprising, just given how rapidly the field is growing; the people have to come from somewhere. But for me, I think part of it is just how much faster the science is in computer science; the fact that you can iterate so much more quickly. An experiment can be run in seconds, or minutes, certainly set up and run in a day, often, for certain experiments… Compared to when you’ve got a mechanical rig and there are parts, and – you know, one bad transistor and the whole thing is kind of suspect… There’s just so many more failure modes and such longer timeframes that I kept coming back to – I want to be able to answer questions quickly, and I can do that so much faster in CS.
[04:15] And do you think that’s gonna be a situation that we see over and over again with people in various industries pulling in – we’ve seen a certain amount of that; I tease Daniel a lot… We ended up talking to people that come from physics a lot, and one way or another they found their way over. Do you think that’s gonna be very typical, with mechanical engineers constantly finding the need to use machine learning to get their jobs done, whether or not they jump over to the dark side or not?
I’m sure we’ll continue to see plenty of people jump in from over there, for different reasons probably. I think for a lot of people you end up finding that the best way to do your job is to use machine learning, and then you realize “Hey, this is actually a really cool tool. I think I’d like it to be more than just a tool for me.” And then you lean in and start diving in in a more permanent way, rather than just in sort of an applied sense.
Gotcha. I’m just curious - what was your first experience as you started getting into machine learning, before you made the full jump? What was that thing that was drawing you in? What kind of models were you doing? What was your tooling that made you think it might be time to make a shift?
You know, this maybe just shows how thick I am; it actually goes all the way back to high school that I had my first dabbling in machine learning, and loved it, and didn’t realize then that I should have just embraced it wholeheartedly from the get-go. But there was a lucky break for me - there was an internship program for high school students near Wright-Patterson Air Force Base in Ohio where I grew up… So I was on a project, using MATLAB of course, the lingua franca of mechanical engineers; not the Python of machine learning engineers, but… Yeah, so the task that I was assigned to was using genetic algorithms to design better airfoils, so some non-gradient-based optimization. And I thought it was so cool that even after I lost my MATLAB during the school year, I had to go back to high school - this was after my junior year; I used Excel, and I had a separate tab for each generation of the genetic algorithm, and I tried to recreate it there, because I was still, of course, a lousy programmer, but just thought the ideas were so neat.
Very cool. So is it genetic algorithms, or what actually pulled you in? And going from that, and thinking all the way to now as a co-founder at Snorkel AI, what was the cross-over right there that got you to snorkel AI?
I’d say from the very beginning one of the ideas that drew me in was there should be this different interface for getting things done, for transferring information from an expert into a program that can now do work for you. I think historically, that’s very imperative code. Very much like describe exactly what you want done, step by step. And that was less interesting to me. It felt a little bit more like “Your job’s just to translate from one language to another.” But the cool thing about machine learning or AI in general I think is that you get more a sense, in the right setup, of “If you can tell me what’s good, then I can find it.” There’s better synergy between the human and the computer, where now I can show you what I want, even if I don’t know how to get there, and you can get there, where “you” here is the computer, of course.
So I think that’s the broader idea, that it was really appealing to me all along the way, that had me coming to machine learning. Then throughout my Ph.D. I kind of dove into that problem much more deeply, of “What really is the best interface for getting domain knowledge from an expert into a model.” Those are themes that I explored for multiple years, that along with my co-founders I ended up being with led us to Snorkel, and then Snorkel AI now.
Just to dive in there a little bit - was there a particular itch that you were scratching in that context, that actually led to Snorkel AI? Was there something you can relate, where it’s like “Well, guys, we’ve gotta solve this particular issue. This is something that we need to dive into”, that might have been the specific genesis?
[07:54] Yeah, so one thing that my Ph.D. advisor was fantastic about - it was Chris Ré at Stanford - he was very good at making sure that the problems you’re solving actually will matter to people; they actually solve real problems. And part of the way you do that is by – you know, on most papers we would try and have real-world collaborators, work with another company or research organization, or government entity, or something where we can make sure that this actually solves your problem, so people are more likely to care; this is likely going to stick and have a potential to make real impact.
So very early on in my degree we were looking at what is the effective bottleneck for new machine learning applications; what is it that stops people from solving their problems as quickly as they’d like to? The realization came that that bottleneck is almost always the training data. We saw the writing on the wall - deep learning was blossoming right about then, we saw these super-powerful models, feature engineerings becoming a lot less necessary; a lot of that can be learned now. But with the one caveat of “You can do all this if you have just mountains of perfectly-labeled, clean training data, ready to go for your specific task”, and that in reality never exists, of course. So that, I’d say, was the real impetus for this line of work - this is what stops people; in Academia, it’s “Download the dataset and then do something cool with it.” But in industry, it’s–
Get the data.
I mean, steps 1 through 9 is “Where am I gonna get my data? Do I have enough of it? Is it clean enough? These annotators are doing the exact wrong thing. I can clarify the instructions… Is this good now?” It’s iterating, and 80% of the work is making that training set. After that, pulling off some state-of-the-art model in the open source and running that - that’s the easy part.
Yeah, it’s funny… You would think of AI as – I think people outside our industry look at this and think we’re doing this dark magic of AI, and producing the model, but every time we talk to somebody, it’s always trying to get set up to do that. It’s getting to the starting line of doing the actual modeling itself that people are struggling with.
So tell us a bit about Snorkel AI. How did that blossom out of this experience that as a co-founder you were having, as well as what the others were driven to do as well? Can you tell us a little bit about your co-founders, and just how the whole thing got started?
Yeah. We feel very lucky at Snorkel AI to have the founding team that we do. It’s a little bit larger than you typically have; there are five of us. It’s Chris Ré, who I mentioned was my Ph.D. advisor, myself, and then three other previous students. We were all sort of in the same cohort. Alex Ratner, Paroma Varma, and Henry Ehrenberg… All of us began at about the same time our grad school experience and picking up different projects… And all of us were just drawn to these ideas. We ended up collaborating in almost every combination you can think of between the four of us on different papers through those years.
In the beginning it was like “This is an interesting idea. Let’s run a quick experiment, pull up a Jupyter Notebook, test some of these ideas.” Then it really seemed to work, so then it became a workshop paper, and then a full paper, and eventually a best-of paper, and an open source project, and then an open source ecosystem, and other derivative projects, and lots of collaborations…
We helped a few different organizations make industry-scale versions of this internally to really prove out the concept. A paper with Google, for example, that we were able to publish. And by the time that we were at the end of our degrees it was clear that there was just such a dramatic pull for this. The ideas were very well validated at that point, over probably 35 different peer-reviewed publications, but maybe more importantly, a whole bunch of different organizations that independently had seen success with these approaches, almost always from working with us [unintelligible 00:11:32.27] through the process.
So we just learned so much through that time about what you would really need to take this proof of concept and make it something that could be repeatable and with a relatively low barrier to entry, that doesn’t require a room full of Stanford Ph.Ds to make it successful. And that’s part of what motivated the company, is the chance to now make this a fully supported, enterprise-ready and able to be shared with a whole bunch of different industries and company sizes, and in different work areas.
[12:01] Before you dive into the specifics of the product and service offerings, could you talk a little bit about what you did learn? Because with that opportunity to be doing the academic work and to progress through that over time, and have that insight before you ever actually start the new company. Can you talk about what that learning process was like and what were some of the things that had a big impact specifically, conceptually? And then from there, I’d like to go on into how that was realized in the company itself.
Yeah, absolutely. And I completely agree. I think it was a huge, huge advantage for us to have that – I mean, a really much larger period of time than you would ever get as a startup to do the learning phase. We were able to succeed, and fail, and try different variations, and really push the boundaries and intentionally try to find “Where does this fail?” Because as an academic, that’s the hat you wear. It’s like, “Let’s really suss out, let’s do every ablation we can think of. Let’s figure out, does this work for text? Does this work for video? Does it work for very dependent and correlated data?” The whole variety of the space that you can imagine, we were able to test. So I meant that by the time that we were building out the “final version”, the enterprise version, we were able to bring all these different learnings to bear as part of that design.
So if I was trying to structure categorically the lessons that we learned, I think one of the big ones was interfaces. As a grad student-supported open source project, you don’t have a lot of time to polish up the frontend for people, so it’s in the form of a Python package. And if there are unit tests, you’re lucky. Of course, we cared about that, but there’s not necessarily an incentive. Unit tests don’t lead to papers. It just means you have a more stable development as you work.
So it was fine, and it worked well, and we of course did support it as much as we could, but one thing we did realize is – you know, we were writing a lot of the same code over and over again. There were certain templates for labeling functions. I think we’ll talk more about those later, but third-party integrations, or patterns of sequences of steps that people would try, that would get lost between the forest of scripts and notebooks… Whereas if you can set up a properly-structured interface and GUI, as well as other access points, you can really dramatically improve the likelihood of success. So that’s one category, the interfaces, I’d say.
Okay. What else did you learn along the way? Was interfaces the primary driver there, or were there any other key lessons there?
Yeah, so interfaces was a big one. If I was grouping it into other areas, I’d say there’s also infrastructure, there was intuitions that we gained, and baking those in, and the user profiles, or interaction points. I can say a word about each of those…
On the infrastructure side, I think that one’s fairly self-explanatory. As a company, if you’re going to depend on a piece of software, you need it to have certain things. Basic security, and logging, encryption, and compatibility with the data formats that you care about, and dependency management, parallelization, all these things that of course, of course you want in your software you’re gonna depend on, but that again, just aren’t necessarily a part of research code. That’s meant to be more of a proof of concept.
Sure. Making it real comes down to really kind of classical software development things that you need in place to deploy from software… And I think that comes back to a point that we run into a lot on the show, and that is the fact that you can’t really separate the AI from the software the AI is running in. It sounds like y’all had a realization about that even before you got the organization launched.
Absolutely. I’d say another big piece of this is – again, as an academic you test often these ablations, you’ll test a very specific problem, and “Can the model learn what I need it to?” But in the wild, you often have actually just a problem you need to solve, and you don’t necessarily care how that’s solved; you just want a high-quality system. And so you don’t just have this one model that’s ready to go with the data that you care about, that has an output that is exactly what you care about. It’s a pipeline - you’ve got pre-processing steps, you’ve got business logic, you’re chaining together multiple models, or multiple operators. Some heuristic and some are machine learning-based.
[16:08] So this actually gets at one of the big differences, I’d say, in terms of fundamental value out of the Snorkel open source, versus Snorkel Flow, the business product. The latter is much more focused on building AI applications. An application that solves your problem from end-to-end, rather than just a point solution for a part of the pipeline that is making a training set or training a single model.
Braden, just a moment ago you were talking about Snorkel open source and Snorkel Flow. Could you now define what each of those are and describe what the differences in the two are?
Yeah, absolutely. So if you go to Snorkel.org, that’s the website for the open source project that, again, began almost four years ago at Stanford, and served as sort of our testing ground and proof of concept area for a lot of these ideas around “Can we basically change the interface to machine learning to be around programmatically creating and managing and iterating on training sets?” So that’s what that is. It’s Pip-installable, you can pull it down now; it’s got 4,000-something stars, and is used in a bunch of different projects.
Snorkel Flow is the offering – it’s the primary product of Snorkel AI. It’s based on and powered by that Snorkel open source technology, but then it just sort of expands to much more. It is now a platform, not a library; it comes with some of those infrastructure improvements that I mentioned before. It also bakes in a whole lot of the intuitions that we gained from the years of using the open source. There are certain ways that you can guide the process in a systematic way to creating these programmatic training sets or improving them systematically, really completing the loop, so that at every stage of the way you have some sort of hint at “What should I focus on next to improve the quality of my model, or of my application?”
So that platform, Snorkel Flow, is meant to be this much broader solution for supporting the end-to-end pipelines, not just the data labeling part, baking in a bunch of these best practices, tips and tricks that we learned over the years, of essentially writing the textbook on this new interface to machine learning. And it includes also some of those interfaces, like an integrated notebook environment for when you do want to do very low-level custom, one-off stuff… But also some much higher-level interfaces, like those templates I mentioned for labeling functions.
There are a number of ways where it can be a truly no-code or very low code environment for subject matter experts who don’t necessarily know how to whip out the Python and solve a problem, but do have a lot of knowledge that’s relevant to solving a problem.
Gotcha. Actually, to dive a little bit deeper into both sides of that, let’s start with the open source and build on that. What would be a typical use case where somebody would go to Snorkel.org and do the Pip-install, read the docs, and what are you offering with that and through those libraries, what’s available… And then in a minute, I’ll obviously ask you the other side, about taking it to that next level. But if you could kind of give us a sense of what the open source side experience is like, what the benefit of the libraries are, that’d be fantastic.
[19:39] Yeah. So if you go actually to Snorkel.org there’s a section that is tutorials, and we walk through a number of different, fairly simple, but meant to be instructive tutorials for different ways you could use the library. Often, one of the most intuitive places to start with that is on text-based problems. There also are a couple of demonstrations there for how to apply it to images, and then we’ve got research papers as well I can point people to for working with time series, or video, or things like that.
One very simple example, one that we actually rely on in our primary tutorial just because it’s very interpretable and almost everyone has the domain expertise necessary for it is training a document classifier. In this case, we could say the document will be emails, and you wanna classify these as spam or not spam.
One way you could do this in a traditional machine learning setting is get a whole bunch of emails that are sort of raw and unlabeled, look at them one by one and label them as “This one’s spam, this one’s not spam, that one’s spam”, and eventually you’ll have thousands, or tens of thousands, or hundreds of thousands of emails that you need, to train some very powerful deep learning model to do a great job.
But when you do this process, if you’d ever tried to label a dataset, you do find that very quickly there start to be certain things that you rely on to be efficient, or that are basically the science to you for why you should label things a certain way. An easy example here might be lots of spam emails try and sell you prescription drugs. So you may see the word “Vicodin” in an email, and that’s pretty clear to you this is not a valid business email, this is spam, and you can mark it as such. And you might eventually label over 100 emails that have the word Vicodin, and all of them are spam, for approximately that same reason, among other things. There’s other content in the email, but that’s what tipped you off.
So if you could instead just one time say “And if you see the word ‘Vicodin’ in the email, good chance that this is more likely to be spam, rather than (we’ll call it) ham, or not-spam.”
You could write that, apply that to hundreds of thousands of unlabeled data points, and get in one fell swoop hundreds of labeled examples. And those labels may not be perfect; there may actually be a couple examples in there, some small portion where it actually was valid; someone was asking “Did you see where my Vicodin was put?” I’m not sure. I won’t guess.
But basically, these noisier sources of supervision can be then much more scalable, much faster to execute, easier to version control and iterate on than individual labels are… And if you can layer a number of these on top of each other, and basically then let their votes be aggregated by an algorithm, one that we developed at Stanford, you now have the ability to get - maybe not 100 perfect labels, but 100,000 pretty good labels, and it takes about the same amount of time. And as we’ve seen time and time again in recent years, the size of the dataset seems to keep winning the day when it comes to getting high-performance with these models.
Yeah. So essentially, that open source library is helping you scale out your labeling, so that you get to the meaningful thing, meaning that you’re actually starting to create models faster. So a way to overcome that.
That’s right, yeah. It essentially is a way of building and managing training sets very quickly, often at a much higher rate of production, as well as just much larger magnitude.
So at what point, if you’ve been doing this for a while and you’ve found that utility in the libraries and such, what is a typical scenario that you’re finding with customers, where they do need to level up? Maybe they’ve used the open source software for a while; maybe they had already been doing it, even prior to you creating the company. But now it’s time – you mentioned platform, specifically… What is it that they are now facing, that is a clear step-up and they need the enterprise approach at this point?
Yeah, so I’d say there are a number of different reasons for this, and it’s a little bit different which elements of the grab bag are most important for different customers, but I can list a few of those… So one of the big ones is just the guidance. I think with the proof of concept library, the open source, over the years of using it, we knew what to look for; how accurate is accurate enough for a labeling function, how many do I need, how should I come up with ideas for what a valid labeling function could be, how could I integrate external resources that I may have, like a legacy model that I wanna improve on, or maybe an ontology that belongs to the business, that has information in, and how should I integrate that.
[24:11] So there’s a lot of what would otherwise be folk knowledge if you’re using the open source that you just only get through experience, that we’ve been able to really bake in and support in a native, first-class, guided way in the platform, and that’s a big difference-maker for a lot of people.
Gotcha. As we’re talking here, I’m looking through your website, and I went into the platform and I noticed that you’re kind of segregating out the different processes, with label and build, integrate and manage, train and deploy, analyze and monitor… Why that particular segregation? What is it that the platform brings to each of those capabilities? How are you guys envisioning this process, and if you have any insight, what is separating that from other options that you may see in the marketplace?
Yeah, so I’d say that that label and build is probably the piece of that pipeline that overlaps most with the open source, in the sense that that’s the area where you’re going to write labeling functions, and then likely aggregate these right into effectively training labels; confidence-weighted labels for these unlabeled examples that you can now train on.
That manage and version piece up next - that speaks to when you have not just a one-off project, when your goal is not just to fill a table in a paper, but really to build something that you have confidence in, that you can come back to, that you can point to in the case of an auditor, and whatnot… There’s extra value in managing all these different artifacts. You’ve got often many applications that you care about and many teams working on it, many different artifacts that you create, whether that’s models, or training sets, or sets of labeling functions… So there is an element here that’s as well just the data management side of things, and tracking and versioning and supporting all of those types of workflows.
On the modeling side, that is entirely unique to the platform with respect to the open source. We have a bunch of industry-standard modeling libraries integrated with the platform, so if you do want to train a scikit-learn model - sure; or some of the Hugging Face transformers right there. Flare is another one. XGBoost. So a lot of these libraries we’ve kind of unified behind a simple interface, so that it can be a sort of push-button experience to try out a number of different things, and hyperparameter tune, and whatnot… But with the goal really being of – you’ll find most of the time you’ll get the biggest lift by actually improving the training set rather than the model.
I guess that actually moves us on to the fourth part, which is analysis. We have a whole separate page with a bunch of different components that effectively take a look at how your model is currently performing, and where it’s making mistakes, and why it might be making those mistakes, and then makes concrete recommendations for what to do next.
In some cases it’s “Yes, actually your training set looks pretty good. The learned labels that we’re coming up with actually line up pretty well with ground truth. So if you’re making mistakes here, it’s probably because – it’s your model now, so you need to try a more powerful model, or hyperparameter tune a little bit differently. I think that’s where a lot of machine learning practitioners naturally go, immediately to the model and hyperparameter tuning, when in reality almost always the far larger error bucket is there are whole swathes of your evaluation set that have no training set examples that look at all like them.
There are basically just blind spots that your model has, and now in the platform you can go ahead and click on that error bucket, go look at those 20, or 100, or however many examples where none of your labeling functions are applying, so this is not reflected at all in your training set, and write some new supervision that will add effectively examples of that type to your training set, so that the next model you train will know something about those types of examples, and can improve upon them.
Sounds good. I’m also looking at some of the different solutions that you have, that are listed, from document classification, named entity recognition, information extraction… I’m kind of curious, since as you’re looking at this - and you guys clearly found a gap in the marketplace from the perspective that you were coming from… What makes your approach to each of these problems – because these are fairly classical problems; sentiment analysis, anomaly detection… What are some of the ways that you think you’re adding value, that you weren’t finding out there. What is that special sauce, to some degree, that you guys were really looking to introduce into the marketplace with this platform?
[28:24] Yeah, I think what really moves the needle is the fact that with this approach and with this platform, machine learning becomes just more practical, more systematic, more iterative. So all of these different problem types you mentioned, different ones – I think on the website right now we mostly focus on the [unintelligible 00:28:41.13] ones, but again, we’ve seen these used successfully, and we’ll continue to build out the areas for applying this to other modalities as well… But this paradigm is really agnostic to the data modality and most problem types. At its heart, it is a machine learning problem where you have a training set and you have a model, and when your model is making mistakes, it’s often due to what is or isn’t reflected clearly enough in your training set.
So for any of these problems, there are different types of labeling functions that you write for a classification problem versus an extraction problem, or whatnot… But fundamentally, once you scrape off that top layer, it looks very similar. So this platform really is meant to solve a wide variety of problem types, and work in a whole bunch of different industries and verticals and whatnot… Because again, under the hood, they’re all relying on the same, basic, fundamental principles about how machine learning works. And it was with that in mind that we built the platform.
That was a good introduction… I am curious though - earlier in the conversation you talked about some of the third-party integrations, and along with that, I’m kind of thinking from a workflow standpoint… Could you describe a little bit about how you might integrate in with other tools that are widely used within this industry? What kind of integrations do you have, and how that really helps the practitioner get through the process of modeling that they’re trying to do?
Yup. One of the things that we learned from the open source project was the importance of having intuitive, natural, modular interfaces to different parts of this pipeline. The labeling functions as well, the models, all that. So we kept that design principle very much in mind as we designed the platform, and we’ve made sure that every step of the pipeline can be done either in the GUI, or via an SDK that we provide.
So that means that you can write labeling functions via these nice GUI builders that we’ve got, or you can define completely arbitrary black box labeling functions via code in the notebook, push those up, and then they’re treated the same way in the platform. Same thing with the training sets; you can create a training set and then go to the models page and identify the model that you want, set up your hyperparameters and train it there with a button, or you can use the SDK to export your training set, traing your own model, and then just re-register the predictions, push them back up, just some very lightweight, assign certain [unintelligible 00:32:03.25] certain labels, and then use the analysis page to still guide you.
[32:07] So it means that we’re able to interact with a whole lot of different customer types and workflows that have different requirements. Some people know we really just need to use our proprietary model; we know that nothing works as well as this does. That’s totally fine. At that point you can pull things down from the platform and then push up the results when you’re done.
We’ve got a lot of training labels already available from crowd workers, or it’s just as a natural part of our product we’re always getting feedback that we can use, but we’d really like to be able to be systematic about how we patch up failure modes that we have, and so we wanna use the platform, the analysis tooling especially, but maybe also the models. So for them, they’re able to start in that way.
So really, any piece of this can be – that’s the test we use for ourself, is “Can I complete an application in Snorkel Flow without every opening up that tab of my browser?” and the answer is yes… Which makes it ultimately a flexible, I guess, platform for integrating with other workflows you may have.
Gotcha. So even though you guys are several years, given the work, ahead of time, getting into the company… You mentioned that you’re about a year-and-a-half into the company’s existence, which is pretty early in the lifetime of an organization… Recognizing that it takes time to get things out the door, and stuff, what other gaps are you seeing in the industry that is more of that itching that you wanna scratch? Whether it be short-term, or longer-term, what are you envisioning Snorkel Flow evolving into, and what kinds of problems that you’re not addressing today necessarily are you thinking about addressing for the future? When you guys are getting together and hanging out and talking about what-ifs, what are some of those what-ifs that you’re willing to share?
Yeah, so a few different things come to mind… One of them is that, as I mentioned a couple times, there are different modalities to consider, and the way that you write labeling functions over images is fundamentally different than the way that you write labeling functions for text. So just given where the market pull was initially, we’ve started focusing on text, but we absolutely plan to bring in some of that other research we’ve done as time goes on, over the coming months and years.
I’d say in addition, another area that’s really interesting to us, so where we would have this unique leg up based on the approach that we’re taking, is the monitoring side of things. When you acknowledge that most applications are gonna go deploy, it’s not “Great! I’ve got my model now. Deploy it, and set it and forget it.” Test distributions change, the world shifts. People talk about different topics; different words get different meanings. Covid was not a part of the discussion a year ago, and now it’s a huge part of the societal fabric of what gets talked about on social media.
So the fact that you do very frequently need to iterate on your models, improve them, as well as you’d like to know preferably more than just a single number - the accuracy of my model, is that going up or down? It’s really interesting to see what types of examples am I starting to get more right or more wrong? What subsets of my data are diverging, basically, from what they were when I was trained? What’s really interesting is after you’ve written these labeling functions, there are essentially a whole bunch of different hooks into your dataset.
They each observe different slices of your data that have different common properties, and these could effectively become monitoring tools for you, because you can now observe how those labeling functions increase or decrease in coverage over time when applied on the new data that’s streaming through your application, and inform you when – you could basically set up automated alerts showing you “Now is the time to go and update things” or “Here’s some suspicious activity going on”, based not just on “Did the number go up or down?”, but “We’re seeing movement in different parts of the space where your model is operating. Take a look.”
That maybe appeals more to the technical/nerdy side of things, but I think it’s a really interesting problem, one where you’ve got that information. You have already identified for you these very interesting angles on your problem, and so why not use those to help guide the post-deployment life of a model.
[36:09] At this point, as you’re answering that, I wanna ask - within the limits, obviously, of what you can share - customers that you have, what are some of the really interesting things that you’ve seen customers doing with this? …particularly things that were outside of what you might have expected. The kinds of things – you know, we all have problems; everybody in this industry has areas of focus that we’re addressing… What are some of the things that made you surprised, and people went “Oh. Okay. I hadn’t expected to see that.” Or just were plain cool. Just something that someone’s doing that’s just like “Wow. I love having our platform involved in that.”
Yeah. Two things that I’ve found personally very cool - one of them is the privacy preservation aspect of this approach. That was not necessarily a top priority or top-of-mind when we were developing these techniques at Stanford. It was often on problems where it’s just “I’m trying to get a good result. I want high quality. How can I get high quality?” But it’s been really cool to see different companies that have the very desirable goal of “We’d like to have our data being seen by fewer humans.” We’d like to have fewer people reading your emails, fewer people seeing your medical or financial records; how can we do that while not sacrificing the quality of our machine learning models?” So it’s been really interesting to see them, and working with them, coming up with these setups where now they can take a very small, approved subset of the data to give them ideas for how to write labeling functions, or label a test set to give them a sense of overall how is quality.
But then the vast majority of their data never gets seen by a human now. They can take these programs they’ve developed to go label those automatically, use them to train a model, and then get back just the final weights of the model. It’s really neat to see, and I’d love to see that thread continue… Not just from a privacy preservation standpoint, but also - we keep seeing articles about the PTSD that you get as annotator over these awful domains. You hear about some for social media…
Yeah, there are some horrendous ones…
Even during the Stanford days, we worked with DARPA on a project for human trafficking; in their case, it was more out of necessity of keeping up with a very rapidly-moving environment, where it’s these adversarial settings, so your training set is always losing its value because things are always changing, so they needed to be able to create training sets very quickly, and they did with Snorkel, which was cool… But also, conveniently now, there are that many fewer people who [unintelligible 00:38:25.25] sitting in front of these awful human trafficking ads. So I think the privacy standpoint is very cool.
I think another interesting application we’ve seen was we had one customer who – and I’ll try to appropriately obfuscate here, but they had an application that was affected (we’ll say) by Covid. When you suddenly have the stock market plummeting and there are certain risks associated with that for different businesses, and we were in the middle of a POV engagement with them, so they’re test-running the product to see how it worked for them… And they came to us and said “Okay, this was not part of our scoped work, but this suddenly matters a lot to us, and our typical process would take about a month. Do you think you can help us? Could we try and use Snorkel to get some result faster?”
And since it was very early on and we hadn’t necessarily had a lot of time to train using the platform, we said “Sure, we’ve got some ideas. Give us a sec.” We threw three of us in a war room for the day, ordered some burritos and hacked away, and by the end of the day we were able to extract the terms that they needed with over 99% accuracy on their application. That was achievable with a model that was trained on tens of thousands of examples, which we didn’t need to label. We were able to quickly come up with “What are the generalizable rules or principles here that we could use to create a training set to train a model that now can handle edge case and things much better than these rules?” and get then get the high quality that they needed. So that sort of live action, the nerds save the day kind of moment…
It’s a good story.
…it’s super-cool to see.
[39:57] You’ve raised several interesting points there, one of which is the fact that in real life, as this technology is more pervasive, these dynamic, ever-changing datasets are a reality we have to contend with… I mean, are you seeing the industry getting more flexible at large? Obviously, in terms of thinking about the fact that that’s something that has to be accommodated, but I would expect that that is something that has to be addressed more and more. Do you have any insight or any thoughts into where we’re going in terms of us moving along this curve from these static-label datasets that we were talking about, historically, at the beginning of the conversation, to this dynamic, especially since Covid has struck, the ever-changing world on a day-to-day basis - what’s that trajectory look like, and how are you guys preparing for that?
I think we’re definitely seeing an increased awareness of some of these issues. I think a lot of companies are still trying to figure out how to address it in the right way. We see companies realizing that – you know, schema lock-in is becoming this problem for us, because real problems change, our objectives change, we learn more about the problem… What we thought was a positive or negative classification problem is actually positive, negative or neutral, and then our old labels are garbage now, because you don’t know where the neutrals are, and the positives and the negatives…
So people are being burnt by some of these problems… I think that’s part of the reason why we’ve had such early success with inbound interest, more than we could even handle at first… Because people are aware now of some of the costs that come with machine learning. The promise of machine learning is very much being broadcast, how it’s the future and it solves a lot of problems, but there do end up being these very practical – I won’t say necessarily limitations, but gotchas, or costs really, that you need to be aware of.
I see this reflected a little bit in the way that companies are starting to prioritize more the ability to see “Where did my model learn this?” That auditibility…
Yes. We’re touching on AI ethical issues. You’ve talked about auditibility and privacy and such… You totally see that, you’re kind of maturing your way through the process here.
[42:03] Exactly. So they realize that that’s important in a way that they maybe didn’t before. I think they’re also realizing just from an economic standpoint that training data is not a one-time cost. This is a capital expenditure, this is an asset that loses value; there’s a half-life to these things… So you start seeing these ongoing, regular, cyclical budgets to get the training data, even just for a single application – not “We need more data to train more models for more applications, but to keep this application fresh and alive.”
That’s a great insight right there.
Yeah. It’s super-interesting, and it changes the way that you account for the cost of different applications you might use, because there’s a certain way that you maintain imperative (we’ll call it) Software 1.0, and there’s a different way that you maintain this machine learning-based Software 2.0 way of solving a problem. It’s something people are learning, and I think that’s all an interesting part of the conversations that we’re having with different customers as they realize how this can maybe change the way that they approach their machine learning stack in general.
Okay. As we wind up here, I’m finally getting to ask what is always my favorite question anytime we’re getting to talk to someone such as yourself… Blank sheet of paper, what are you excited about right now in the space of machine learning and AI? What is the thing that has captured your imagination, whether it’s work-related or whether it’s not work-related? Something cool out there… What’s got you going “That’s the thing I’m really interested in tracking, either on my own, or through the company, or whatever?” What’s cool?
That’s a very good question.
There’s a lot, I know…
So many things… I’d say there are a number of areas that are super-important; super-hard, but super-important. And I’m glad to see that they’re getting the attention that they deserve, or at least that we’re trending in the right direction. And that stems around the privacy, the fairness, the bias… A lot of that I think is just super-hard. If anyone says that they’ve got a solution to that problem, I’d be very dubious… But I think we are marching toward progress there, and that’s something that I’m certainly gonna watch with great interest, and hope that we can be a part of the solution there. That’s one piece.
What I think may be a little closer to my personal research agenda in history - a lot of that’s centered around how you get signal from a person into a machine. So a lot of my research through the years has been seeing how high up the stack can we go. There’s this figure in my dissertation that compares basically the computer programming stack to the machine learning stack.
Computer programming – computers run on these ones and zeros; they run on individual bytes and bits, but nobody writes ones and zeros code. We write in higher-level interfaces, like a C, or even like a SQL or something, that compile down sometimes multiple times into this low-level code that you’re then gonna actually run on. And I’d say similarly, machine learning runs on individual-labeled examples; that’s how we train it, that’s how we express information to it. But it feels fairly naive, actually, to one-by-one write these ones and zeroes, write these trues and falses on our individual examples.
So I think that there’s a lot of really interesting things that can be done around higher-level interfaces of expressing expertise that then in various automated or just sort of assisted ways can eventually result in the training sets that have the properties you need to actually communicate with your model, use the compiler, essentially, the optimization algorithm that’s in place to transfer that information.
That’s a fairly high-level description, but I think that there are interesting things yet to be done there.
Well, thank you for sharing that, I appreciate it. Braden, thank you so much for coming onto Practical AI. That was a great conversation, and looking forward to our next one. Looking forward to having you back sometime soon.
Absolutely. Thanks for having me.
Our transcripts are open source on GitHub. Improvements are welcome. 💚