Practical AI – Episode #295
Creating tested, reliable AI applications
get fully-connected with Chris & Daniel
It can be frustrating to get an AI application working amazingly well 80% of the time and failing miserably the other 20%. How can you close the gap and create something that you rely on? Chris and Daniel talk through this process, behavior testing, and the flow from prototype to production in this episode. They also talk a bit about the apparent slow down in the release of frontier models.
Featuring
Sponsors
Fly.io – The home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes.
Timescale – Purpose-built performance for AI Build RAG, search, and AI agents on the cloud and with PostgreSQL and purpose-built extensions for AI: pgvector, pgvectorscale, and pgai.
Eight Sleep – Up to $600 off Pod 4 Ultra Go to eightsleep.com/changelog and use the code CHANGELOG
. You can try it for free for 30 days - but we’re confident you will not want to return it (we love ours). Once you experience AI-optimized sleep, you’ll wonder how you ever slept without it. Currently shipping to: United States, Canada, United Kingdom, Europe, and Australia.
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome to Practical AI | 00:46 |
2 | 00:57 | Sponsor: Fly | 02:29 |
3 | 03:33 | Thanksgiving preparations | 01:24 |
4 | 04:57 | Agents in production | 01:30 |
5 | 06:27 | AI ceiling & current hype | 02:11 |
6 | 08:39 | Level of transformation | 02:10 |
7 | 10:49 | Current models are mostly good enough | 05:56 |
8 | 16:55 | Sponsor: Timescale | 02:17 |
9 | 19:34 | Robust AI workflows | 05:04 |
10 | 24:39 | Finding the right workflow | 06:08 |
11 | 30:47 | Transition from notebook to code | 03:02 |
12 | 34:06 | Sponsor: Eight Sleep | 02:34 |
13 | 36:44 | Testing and integrating | 03:22 |
14 | 40:07 | Sketching out a good framework | 07:17 |
15 | 47:23 | Roles have shifted | 01:57 |
16 | 49:20 | Outro | 00:46 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
In this Fully Connected episode of the show Chris and I will keep you fully connected with everything that’s happening in the world of AI, and discuss some of the latest trends and share some learning resources for you to level up your machine learning and AI game.
I’m Daniel Whitenack. I am CEO at PredictionGuard, where we’re creating a private, secure AI platform. And I’m joined as always by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris?
Doing very well, Daniel. I know you’re out traveling, and ironically, I think I’ll be where you are next week, but I think you’ll be gone by then.
Swapping places.
There you go. We’re trading geographies here.
Yeah, yeah. I don’t know why, November always seems to be heavy conference, event, summit, on-site month for me. I don’t know exactly why that is. It’s sort of the last little bit before the end of the year, maybe. I don’t know.
It’s to make you earn that vegan turkey that you’re going to enjoy.
Exactly, exactly. Yeah, I got it all picked out, so we’re ready for tofurkey on Thanksgiving, for sure.
Excellent.
And speaking of other things to celebrate, I wanted to mention before we hop into other discussions today that our good friends over at the MLOps community, so Demetrios and his crew, they’ve run a series of virtual conferences, one about LLMs in production, and another about data engineering and AI. And their latest in this series is called Agents in Production, which sounds very exciting. As you’re listening to this episode, if you’re listening to it right when it goes live, you can still probably catch the event live, but you can also catch the content afterwards, I’m sure, as it’s recorded. It looks like an amazing conference, talking about AI agents moving from R&D to reality. “Are you ready?” So go check it out. Their events are always great, so I wanted to mention that up front, at the beginning of the show. Chris, do you have any active AI agents in your life?
You know what? I actually don’t right now, but I probably should. I feel bad that I can’t say yes to you on that. But you know what? I’ve also read recently that the uptake on agents has been a lot slower than was expected. It was kind of one of those hype things, and I think it’s pretty hard. So maybe something to discuss.
Yeah, maybe along those same veins, I see a lot of news articles, especially as related to, I think, what over the past couple weeks or whenever it was, that people realized that Open AI wasn’t going to release GPT5. I don’t know if that was expected this year and not released in the timeline that people thought… And also some indications that maybe that next jump in the functionality of these AI models is proving more difficult than was originally thought.
One question I had as related to that, Chris, was let’s say we never get GPT5. We’re just stuck with all the models that we have now. So no more models were made in the world. What do you think the value of AI and kind of its integration across the enterprise and business and our personal lives - do you think it would still have the hype transformative effect that people are talking about?
No, I mean – and we’ve talked a little bit about that in terms of hype cycles and stuff on previous episodes… If you postulate that we’re hitting a ceiling right there, I don’t think it’s no more models. I think what happens is that there’s more open source that comes along and kind of at least catches up to where some of the leading ones are out there, and you end up having the value of a commercial model is less, because you have more open source options that are out there. And we’re seeing that in industry anyway. I mean, not everybody wants to pipe their data out to Open AI or some of the other organizations doing the same thing.
[00:08:16.24] So I think the availability of open models is going to happen regardless in terms of the uptake on that, and I think that would just kind of force that to happen sooner – if the leading models are no longer new ones coming out, that are better and better to chase, they catch up and it kind of commoditizes the whole space even faster.
Yeah. I think one of the things I was wondering was, let’s say that regardless of whether it’s an open model or closed model… So if I told you, Chris, you’re an AI engineer, actively integrating gen AI functionality, and I told you the best model you’re ever going to get on the open side, if you’re using open models, may be a LLaMA 3.1, or whatever. And on the closed source side, maybe it’s the latest Claude, or GPT 4.o, or whatever that might be. So if that were the case, would that be like a “Man, I don’t think we’re going to be able to do all of what we had hoped to do with AI”? Or do you think it’s more of a – yeah, what is the level of transformation that you think we could still get with the current generation of models, let’s call it?
So it’s kind of funny, and I know we’ve talked about this, and some of our listeners will remember some of the previous conversations, but there’s a lot more to AI than just the gen AI models. They’ve gotten all the spotlight the last couple of years, but there’s a lot you can do. And honestly, without going into detail, the things I think about all day every day, gen AI is not the center of it. It’s not the stuff in AI that I care the most about. It’s not what’s making me most productive. So are there many things you can do with gen AI to be productive? Sure. And we’re still learning how to do that. And I think that’s harder than people realized to get there, and I think that’s one of the reasons it’s plunging down into the trough of disillusionment in the hype cycle, as people are frustrated… But we will have lots of gen AI things. But I think it reminds us, as we’ve said recently, to look at the larger landscape of AI capabilities out there, and other things that we used to be excited about are incredibly productive these days. And yet we’re not talking a lot about them. Deep reinforcement learning remains amazing in what it can do. And if you combine that with robotics and other areas, there’s lots of really productive work being done out there, but it’s not getting much media attention.
Where my mind goes is that the current models that are available, if you think about that general-purpose reasoning gen AI model, are good enough, in my opinion, to do most tasks at the orchestration layer. And what I mean by that is, let’s say that you wanted to do time-series forecasting. I don’t think that – whatever model you look at on the gen AI side, it’s not the best time-series forecaster. However, there’s really good tools for that, that already exist. Facebook Prophet, or something like that. And you can use the gen AI model as almost the frontend to tools like that. You can say “Hey, I want to know what my revenue is going to be in six months”, do a forecast or something like that, and use the model to extract the data that’s needed to make that forecast and maybe call a tool like Prophet or something to actually do the forecast and get something back.
[00:11:53.10] So I think even if we got stuck with the models as they’re out there, to your point, there’s a variety of purpose-built tools and non-gen AI tools out there, whether they be just rule-based tools, or APIs, or machine learning models, or statistical models or whatever, that can do a variety of the really important tasks that we want to do. And the gen AI models that we have now could serve as a way to orchestrate between those tasks, and to create some really appealing workflows and automations and flexible interfaces, and all of this stuff.
So where my mind was going with that is I’m not that concerned if it takes a while for GPT5, or the next – in my opinion, when I think about the rest of the time I’ll have to be a developer and AI practitioner, for the rest of my career I could keep myself busy no problem, with all the things I have access to, and create some really interesting products and tools and features and integrations and et cetera.
I think that’s a great insight right there. If you look at all of the different jobs and workflows people have out there in their careers, and if you think of all these tools that we currently have today, without having to go forward, I would argue that it would be a very practical next step for Practical AI to be able to start assessing your workflows, assessing where these different tools and the models we have can make a difference, and doing some process re-engineering to figure out how you can do that. And I think that the vast majority of organizations out there have not done that sufficiently. They might have done that with a workflow or two, but they haven’t gone through – especially large corporations have thousands and thousands of them. There are so many places where productivity can be enhanced by finding the places where people are struggling through their own workflows, and where those align well with the capabilities in these models; that might be a place they want to invest a little bit and get a great long-term benefit out of it. But the next model coming, whatever model we’re talking about, whichever line of models, family of models, always gets the attention, rather than the kind of grunge work of going through your processes and finding where you can save a whole bunch of effort, a whole bunch of time in a matter of moments, to increase productivity.
Yeah, I think that there will be – just a dry parallel maybe… I know that people have drawn comparisons between this wave of AI technology and the onset of the internet and the web, that sort of thing. I think you could see that the basic components, or many of the most impactful web-based technologies that have shaped culture and shaped our lives, the building block components of those were around from the very early days of the web. And there are sort of generational jumps, like the advent of streaming, and all of what we now consume via streaming would not have been possible over certain types of internet connections and technology.
So there’s certainly generational shifts, but the building blocks were enough. And probably some of the people in those early days, working with those building blocks, could not have imagined the transformative and kind of culture-defining effects of those basic building blocks. And so I think we’re in a similar scenario where the building blocks of what we have with AI, whether that be gen AI or non-gen AI, are enough to – I don’t think it would be too far to say transform certain elements of our culture; it sounds sort of grandiose, but I think that’s sort of what’s coming. But there will likely be the kind of generational jumps, whether that be GPT5 or another language family or whatever; there will likely be generational jumps that we also don’t anticipate yet. But the tooling that we already have, the building blocks we already have are enough to create transformative technologies and products and systems.
I would agree. I think maybe there’s another show where we talk about what we think the transformation of society and culture is in the future. I don’t think that’s this show right now, but I would agree with you; with what we have today, we can go a long way. And to your point, I was in college when the web came into being, so I do remember exactly those very early building blocks, and trying to imagine… And this is not the same world that we live in today that it was then.
Break: [00:16:49.01]
I’ve been occasionally teaching workshops again. I’ll be at QCon SF next week. So those of you that are around QCon, I’ll look forward to seeing you. It looks like a good event. But some of what’s come up I think in workshops for me recently is how to think about your AI workflows going from prototype to some level of production. And I think these are things that we’ve talked about on the show before, prior to gen AI, in terms of how you want to be testing and monitoring and thinking about deployment of AI-based workflows… But I’m guessing there’s a lot of people maybe joining the show from different backgrounds after this Gen AI phase. And I connect this to the sentiment that people often, with this technology, are able to get to a point really quickly where they see an amazing workflow kind of come to shape. And it works amazing in like some of the time. Like half the time. And then the other half of the time it fails miserably. But they get sort of the taste of the goodness, and they maybe don’t know how to get the rest of the way. And I thought of an interesting parallel, because some people are using these kind of low-code/no-code AI workflow builder tools, right? Whether that be something like Flowise, or Gumloop, Dify, which is like these little interfaces you can string things together. Or it’s like built into tools like Alteryx or something like that, that’s maybe a little bit more enterprisey-focused. But they build out this workflow, and it sort of does this thing and works like half of the time and not the other half of the time… And maybe they’re using AI calls as part of that. And it struck me that this is sort of like - I don’t know if you remember, Chris, back in the day we had a phase of our AI podcast life where we were really trying to convince people that they shouldn’t run notebooks in production… Do you remember these days of data science?
Yeah, I do. That’s a little ways back, but yes, I do. Indeed.
So for those that aren’t familiar, when I say notebook, I’m referring to like a Jupyter notebook. So this is an interactive web-based code editor. And if you imagine, maybe some of you that have used Mathematica in the past - similar. But you kind of go into the screen, there’s a cell that you can put code in, you can execute that cell, you can take notes, you can execute another cell of code… And all of that state is saved. So you can execute cell one, and then go down to cell five, and re-execute cell five, and then go up to cell three, and re-execute cell three, and then go down to cell seven, and re-execute cell seven. And what happens is - so if we just rewind our mind back to the olden days… I’m a data scientist, I’m building a model, or creating a workflow, and I’m doing this in a notebook… That sort of workflow generally is good for experimentation, and produces really, really terrible code, just by its nature.
So it’s really good for experimentation, but if I’m hopping around all the time, I don’t really understand what the state in the background is. I’m like hopping around between cells… I can give you the same notebook, and you could never reproduce what I did. Even though it’s the same exact code, you could never reproduce my exact sort of steps. And I just was struck by the fact that the way that people – it’s almost like we forgot that that doesn’t work that well, and now we’re just doing it not in notebooks, we’re doing it in these low-code, no-code tools, or with agents that jump around between various tasks… And part of this is the same reason why notebooks are really terrible at producing good, reliable code, is the same reason, I think, why people are taking these AI workflows from tools that they’re using, and aren’t able to make them robust and reliable. So that’s my hot take for the day. What’s your thought, Chris?
[00:24:10.03] No, I think that’s great. First of all, I’ve gotta say - boy, it’s already making me feel aged, again, in a different way, the fact that… It wasn’t that long ago that it was all the hotness of Jupyter Notebooks that we were talking about, and that was the cool thing…
Yeah, and even like products managing all your notebooks, and such…
Yeah. It feels like you just gave a eulogy for Jupyter Notebooks, to some degree. It’s a reminder that things are changing constantly. So you bring a great point that we’re taking some of the same challenges that we had in that environment, and we’re just recreating them in the newer tools that are out there.
There was another company I worked at - I won’t name the company - before I was at Lockheed… And at that company, I remember thinking there were people at the time that knew - they knew the AI modeling bit, and there were people that knew the software bit, but they never seemed to cross over. And when you raise the point about your kind of in-process development workflow, and then how do you actually get that to some level of production, I think there’s a lot of people out there that aren’t going to know that. And unless times have really changed in that area - and my gut says they probably haven’t; people tend to focus on the thing that they want to do. What is the right development workflow, and how do you start getting to that production environment? I know you’ve gotten tons and tons of experience at that in recent years. How do you think about it? Can you frame it a little bit before you dive into it?
Yeah. Well, I was thinking about it in light of this parallel to what we went through in the data science world with notebooks, and these kind of ad hoc workflows that execute some of the time and not other times, depending on how you execute them… And in reality, the answer to running that code in production is not the file download as Python script thing. That will never work because the state and the workflow is not preserved.
How that actually gets productionized, or would be productionized in the past, is taking the logical steps that are being executed in that workflow, and taking those out of the notebook, and embedding them in actual code, in this case Python code, in functions or classes, and attaching tests to those functions or classes, just like a software engineer would do, because this is software engineering… And then figuring out – again, doing the testing on the frontend of that, whether that’s a UI system, or that’s an API, or whatever it is, to make sure that the behavior that you were testing in your notebook actually works. And that kind of sucks, because it’s a reimplementation, to some degree.
Maybe you don’t have to throw everything out. Like, you’ve got something working in your notebook, and you can bring it through and have it work… But it does take actual work to go from that notebook state to the production code.
And so I think if you just look at one of these tools, and – I do think some of these tools, the kind of low code/no code, assistant builders, workflow builders with AI stuff are useful, and can be useful in your maybe building out nice workflows for your personal life, like email assistant stuff, or automations to turn news articles into podcasts, or whatever the thing is that you want to do. But ultimately, these tools have their own opinionated way of tracing and testing, and you debugging. The same way that like debugging a Jupyter Notebook - you have slightly different tooling, you have a slightly different workflow than if you were debugging regular code.
[00:28:07.27] And so I think part of the answer, unfortunately - and I guess this is my, this is my hot take, if there is one - is I think that we’re gonna see a similar dynamic in the AI engineering world, where like a business person… Except I think the roles are different here. So similar way to like before, the data scientist would build a workflow and a Jupyter notebook, and maybe a software engineer would integrate that into actual code that’s tested, and has some form that resembles actual code… And the data scientist is probably interacting with the business person.
In this case, it’s slightly different role-wise, because the data scientist almost isn’t there. But the business person might go into a tool like Gumloop or Flowise or Dify or whatever, and build out a tool that takes market analysis things and generates – I don’t know, articles or summaries that go into some emails that are sent out to the company, or whatever workflow that they created. And it has a series of steps, and they’re like “Yeah, this works.” But it kind of does work, but it kind of doesn’t work, because they haven’t thought about all the edge cases, and it’s hard to debug, it’s hard to know when it’s down…
And so I think now it’s like that business person bringing that workflow, if it really truly does need to be scaled across an organization, or released as a product on its own; you just sort of have to take those steps out, and actually put them in functions, put them in classes in your code that can be tested. And we can talk about the methodology of testing here in a second.
So I think that the low code/no code things are cool and awesome, and have their place, just like notebooks have their place. And I still use primarily Google Colab; not Jupyter, locally, but Google Colab notebooks. I still use notebooks, I just realize their limitations, maybe sometimes better than other times. But I realize their limitations, and then I eventually write software.
So I think it’s a similar thing with these tools that are up and coming, is they’re great, and they allow for quick prototyping and business people to get their workflows and their ideas into a workflow that operates… But ultimately, this has to become software, if the intention is to make it a feature that you release, or something that scales across your organization, or something like that.
Let me ask you… I’m just curious - when you’re in Colab, and you’re doing that, and you decide it’s time to write software, how do you make your own transition? What do you do? Having done this for so long, what’s your transition look like? Are you staying in Python? Are you converting some of that over into Go, or something else? Or how do you think about it?
It depends, of course, case by case. But generally, I would say, if I’m doing it in Colab, it probably means that I’m doing something that requires Python. And so that’s – I don’t know, doing something in LangChain, or PyTorch, or whatever the thing is. Then I think when I’m ready, I essentially have two ideas in my head of where that’s going to live. Either it’s going to live in a REST API… Because you’re going to have to make this functionality available to the rest of your software. So it’s either going to live in a REST API, it’s going to be integrated into some software you’re already supporting, so that already has a codebase… Or it’s going to be run as a script, kind of an offline script at a certain cadence, or something like that.
[00:32:00.23] And so if it’s the API scenario, I have a bunch of code that I’ve written over time with FastAPI. I can just copy one of those projects, rip out the stuff that is irrelevant, and put in the stuff that’s relevant, kind of copying over from the notebook. If it’s more of the native software integration, I think that really depends on what kind of application it is, what the architecture is, that sort of thing. And so that might involve a change of language, or a change of the type of infrastructure that you’re using, or the type of database you’re connecting to, or whatever that might be.
And that’s an area that I’m keenly interested in, because historically, we’ve been developing these models in Python, and then deploying them in Python, mainly because that’s where the tools and stuff are still at. But there’s a number of use cases, especially as we go forward and we’re looking at autonomy, and we’re looking at robotics, and things like that, where in many of those cases Python is not the best language for the platform that you’re deploying to. And so you have this incongruity between the development environment and a distinctly different production or deployment environment that you’re trying to target.
And so in my world, there are many things that might start in Python, that probably should end in something like Rust, given what we’re trying to accomplish here. So I think that that still remains a very immature deployment kind of arena to work in, and I’m rather hoping that in the years to come maybe we see more tools from tool providers and open source in that arena that can actually cross over from one language to another, to make sure that it’s always the right one for what you’re dealing with.
Break: [00:33:50.18]
Well, Chris, I kind of started talking about the testing and integration of some of these workflows, and how I see that playing out… From my experience in talking with people, there’s some general confusion around how to – so let’s assume that you’re convinced that I want to rip out these various pieces of workflow that have maybe been prototyped in a low code/no code tool, and I want to put them into some software that’s an API, or a UI, or a script, a data pipeline, whatever that is… Let’s assume that you’re convinced of that. Then the question comes “Well, okay, now I have this function or I have this class in code that executes some sort of AI call, a call to an AI model. How do I test that, and what considerations might need to be in place around that?” And I often find that this breaks people’s mind.
This is also something that I think we’ve dealt with for a long time in the data science world, which is it’s just very – I guess overall it’s very interesting to me that these same types of things are popping up, but with a new audience. I don’t know if you remember back in the days of data science, when there were data scientists, they would create a model, and that model has a certain level of performance, a 90% accuracy or something. So it’s going to be wrong some of the time. So you put that model into – maybe it’s a fraud detection model; fraud or not fraud. You put that model into production, you integrate it into a software function, and now the question comes “Well, how do you test that model?” Because it’s not always going to give the same response, and it’s not always going to be right. I don’t know if you remember these discussions happening a lot in the data science world…
Yeah. I think you just wrote a eulogy for data science as well, the way you’ve phrased that. [laughs] Oh, my goodness.
But this one always really intrigued me, because my background is in physics. So if we said “Oh, this model is not deterministic, so we can’t test it.” If we took that approach in physics, we basically wouldn’t have any of the technology that we have today, because it’s all based on quantum mechanics, and everything is a probability distribution. So there is a way to test things that behave non-deterministically, like AI models, and maybe that people just sort of need a bit of a reminder about that. And I often kind of break this down into a few categories… But yeah, I don’t know. Do you come across people with this sort of mindset, especially in integrating kind of LLMs, or something like this?
All the time. I think there’s a lot of people out there… I think everyone’s still kind of figuring that out, quite honestly… If they’re not in the business that you’re in, where you’re dealing with that constantly. I think that’s one of the big unknowns with folks in general, is “How do I go about testing this? I want to get in the workflow.” I don’t think people know.
[00:40:07.01] At least – you can only do so much in a 45-minute podcast, but at least to sketch out maybe a good framework for people to think about is, number one, I think you should have tests in your code for each step of the process. So if you have an LLM-based workflow and the first step of that is translating something into Spanish, and then the next step is summarizing it in one sentence, and the next step is embedding that one sentence in a template, and then the next thing is generating an image for – whatever kind of string of things you have going on, you should have tests for each of those kind of subtasks in the chain of processing… Which partially also gets to why testing agents is hard, which is, I think, an interesting thing to maybe circle back to at the end.
I think that’s just good software engineering, what you’re describing. If you took it out of the AI world and talked about software functions, each one is a discrete function that does something, and you want to test it on its own, even though they’re all connected together to do something. I think that’s just really sensible.
Yeah. And the agent stuff makes this maybe a little bit more difficult, which we can come back to, but let’s assume that you have a workflow that you just want to execute over and over again, which is probably most enterprise use cases. So you split that up into subtasks; you have subtasks that you can test. The next thing I would recommend is to have people think about creating a set of tests in three categories. And this comes from the ideas of behavioral testing.
Just take the fraud detection piece for a second… So you’re asking fraud or not fraud. So the first category of tests you want to think about is minimum functionality tests, which would be “This is the most fraudulent thing I can think of. It should always – like, the most fraudulent Russian characters, Nigerian prince”, whatever, take your pick, “should always be labeled fraud”, right? 100% of the time, that is a minimum functionality. These are not the most in-depth of tests, but they should pass 100% of time, no matter what you do to the model, no matter what you do to your system. These are like 100% pass. And you can do the same thing with LLMs. Even though it’s not a classifier, you can say “I’m creating a bot that asks–”, it gives all the information about Prediction Guard. If I ask who is the CEO of Prediction Guard, they should always return the same name, right? That’s a minimum functionality. That’s a pretty easy question. It should be embedded in the knowledge. These are things that should 100% be returned, and that you could test for deterministically. Like, “Does that name appear in the response?”, that sort of thing.
The second category in fancy terms might be called invariant perturbations, but basically in non-fancy terms, changes in the input that don’t produce a change in the output. So the classic example of this is if I ask an LLM maybe to do a sentiment analysis of a statement, and the statement is “I love the United States. It is so amazing. It is so great.” I get positive sentiment return. If I change the United States to Turkey, I say “Turkey is so great. It is amazing. It is wonderful.” In theory, regardless of what you personally think about the United States or Turkey, that should always return positive sentiment. That is invariant change in the input. So you can make changes in the formatting, you can make changes in the ordering of things, and all of these should produce invariant changes.
[00:44:12.18] And then of course, the final one would be the necessarily variant changes, meaning a change in the input should definitely produce a change in the output. Like, if I change “I love the United States” to “I do not love the United States”, I should actually have a change in the output, and that is a very easy thing to see.
And so what you do is you create a table of minimum functionality tests, a table of invariant tests, a table of variant tests, and if you have those full tables, you can basically probe the behavior of the model and the sensitivity of your model to changes. And this sensitivity is really the thing that people get hung up on with these workflows. They do not realize how sensitive the models are to small changes in the input. So this allows you to gauge the sensitivity of your system to a real number from passing these tests, and then work systematically to improve that.
Interesting. So I’m trying to just kind of put all that together for my own learning purpose, and I’m trying to think how we can apply that to workflow directly. How do you actually fit that into the nuts and bolts of moving it into production? Where do you do that in your workflow?
Yeah, so I would take kind of the steps – let’s say I have a workflow with five steps. I take each of those steps and I produce, five functions, or five classes, or however that fits into your code. And for function one, corresponding to step one, I create a table of each of those tests, with table meaning just input equals output. The same as you were testing an API. Like, if I get this input into my API, I should definitely return this. And so you create that table, and a set of unit tests or whatever testing framework you use to go over each one of those examples in your table, and check the output to make sure it corresponds with what you expect, to get either a passing or a not passing score.
So the steps would be “I have my five steps of my workflow, I split each of those up into a function or class or whatever the relevant programming object is, and then I develop these sorts of tests for each of those functions or classes.”
Now, the one question that might come up here is “Well, should my model always pass? Or should that function always pass all of those tests?” And what I tell people is “Well, it should always pass a hundred percent of the minimum functionality tests”, because you’ve defined those from the start as minimum functionality. So if you’re not having minimum functionality, then your software shouldn’t be released.
And then the other ones, basically, I would say you should never be regressing in those. They give you a sense of what the sensitivity of your model is in those variations, and you would not want to regress. You would want to systematically make those better. So some people might treat those as a certain percentage or a threshold that needs to be above that, or something like that.
Gotcha. That takes us back to your point earlier… It just sounds like data science, good data science right there.
Yeah. Yeah. Well, it’s interesting because the roles have shifted, right? I think we went through those phases of data science being very Wild West, all the way up to like good engineering practices and testing and all of that. And we kind of now throw out the data scientist in the middle and we have business people developing these workflows, and trying to integrate them into software… And yeah, there’s a lot of reminders and learning I think that needs to be done.
And that connects to where we started out the day, which is agents and production. Agents are hard to test, because you don’t know the workflow up front. An agent determines what steps it’s going to accomplish on the fly. And so if you don’t know that workflow up front, then there’s some interesting things that you might need to do to test those. But maybe we’ll save that for another episode, and/or people could join the great learning opportunity that is the agents in production of that from the MLOps community.
Oh, absolutely. You know what - you started the show asking what agents I had, and I had to say “No.” I didn’t have any going. You got me thinking, and now that we’re talking about workflow, and testing, and getting those agents working, we’re going to have to come back to this topic. I’m going to have to bring something though to discuss.
Yeah. We’ll come up with some agent ideas and maybe work through the testing of those. I like that.
Sounds good. Okay, thanks a lot for the insights today.
Yeah. Thanks, Chris. I hope you have a good rest of the day, and we’ll swap places geographically next week…
There you go.
…and work our way towards tofurkey.
That’s perfect. Sounds like a November.
Our transcripts are open source on GitHub. Improvements are welcome. 💚