Practical AI – Episode #291

Practical workflow orchestration

with Adam Azzam from Prefect

All Episodes

Workflow orchestration has always been a pain for data scientists, but this is exacerbated in these AI hype days by agentic workflows executing arbitrary (not pre-defined) workflows with a variety of failure modes. Adam from Prefect joins us to talk through their open source Python library for orchestration and visibility into python-based pipelines. Along the way, he introduces us to things like Marvin, their AI engineering framework, and ControlFlow, their agent workflow system.

Featuring

Sponsors

WorkOSA platform that gives developers a set of building blocks for quickly adding enterprise-ready features to their application. Add Single Sign-On (Okta, Azure, Google, Microsoft OAuth), sync users from any SCIM directory, HRIS integration, audit trails (SIEM), free magic link sign-in. WorkOS is designed for developers and offers a single, elegant interface that abstracts dozens of enterprise integrations. Learn more and get started at WorkOS.com

Shopify – Sign up for a $1/month trial period at shopify.com/practicalai

Notion – Notion is a place where any team can write, plan, organize, and rediscover the joy of play. It’s a workspace designed not just for making progress, but getting inspired. Notion is for everyone — whether you’re a Fortune 500 company or freelance designer, starting a new startup or a student juggling classes and clubs.

Notes & Links

📝 Edit Notes

Chapters

1 00:00 Welcome to Practical AI 00:34
2 00:35 Sponsor: WorkOS 03:21
3 04:09 Workflow orchestration 05:24
4 09:32 Common pain points 05:05
5 14:37 What makes ML orchestration different 06:06
6 20:57 Sponsor: Shopify 01:32
7 22:46 Intro to Prefect 07:07
8 29:52 Conversion experience 07:21
9 37:25 Sponsor: Notion 02:03
10 39:43 Adam's friend Marvin 04:49
11 44:32 Agentic approach 07:18
12 51:50 What's next? 05:11
13 57:01 Thanks for joining us 00:36
14 57:38 Outro 00:46

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of the Practical AI podcast. This is Daniel Whitenack, I’m CEO at PredictionGuard, where we’re building private, secure Gen AI, and I am joined as always by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing today, Chris?

Doing fine, Daniel. Maybe a little bit too much coffee on my side. I’m just ready to go. I want to get into the show, man.

You’re fired up.

I’m fired up. I’m on it.

Yeah. Well, might you potentially need to take all of those things that you’re fired up about and orchestrate them?

Oh, yes. Indeed.

…maybe into some type of workflow?

Most definitely. Most definitely.

Well, we have a great guest for you today then… We have with us Adam Azzam, who is a principal product manager at Prefect. Welcome, Adam.

Hey, Chris. Hey, Daniel. Thanks for having me.

Yeah, I mentioned in the pre-show that this is the second episode we’re recording in a row based on some shout-outs from our friend, Bengsoon Chuah, who was building some awesome Broccoli AI things, practical AI that’s healthy for your organization… And he mentioned Argilla, which we just recorded with, but also Prefect… And yeah, we’re just really excited to hear about such a practical thing, that definitely overlaps with, hopefully, a podcast that’s trying to make some of this stuff practical.

So yeah, I’m wondering if you could tell us a little bit – maybe personally it sounded like you kind of had a background in wrestling through some of these workflow orchestration things, even before joining Prefect. So how did you kind of come into thinking about these types of problems, and what’s needed there practically from the developer standpoint?

I stumbled into workflow orchestration by accident. It was a total necessity for a previous startup that I was working on… I was working on basically a career co-pilot or like a job search co-pilot in early 2023. And there we were using medium to large language models to basically help job seekers conduct job searches in the background, and then we would let them have conversations with jobs, and we would also be able to explain and deduce why they were good fits for particular roles that we matched them with. And what that meant from an engineering point of view is I had to take your routine callout to a large language model, and then now I had to do this millions of times a week. And when you do this at the scale of millions of times, for different types of pipelines, whether or not it’s extracting schemas from particular applicant tracking systems or job postings, whether it’s conducting a conversation with a job seeker, you just start seeing exactly how these things fail and break down at scale…

And so when I went to my brain trust and was like “Look, I’m having a hard time dealing with resiliency issues at scale. What should I use for something like this?” I heard the usual player of like Airflow. So I got that spun up, and it felt like I had to learn a completely new language… And when I got stuck debugging a particularly nasty Airflow job, I turned to a good friend of mine who was running the conversational AI team at Square, and he was like “Oh, cool. I’ll help you debug this”, and then he pip-installed Prefect.

And so we spent that entire afternoon just rewriting all my pipelines in Prefect, and that’s really what kind of – you know, you speak about sort of Broccoli AI… That is what turned a very well-meaning, well-designed consumer AI app into something that, from an engineering perspective, was durable and resilient. I don’t want to presume that everybody knows what workflow orchestration is. It was a thing that – I didn’t really know what it was until I needed it.

[00:08:12.09] I think we need a definition from you, actually.

I like to think about workflow orchestration as if you tell us what to run, where to run it, when to run it, and what to do when it fails, Prefect makes it extremely easy to author those rules on your typical script that you’ve got running locally.

So if you’ve got your Hello World script, you’ve got your callout to Open AI script, you’ve got your “go scrape this web page, and then go extract a bunch of data from it”, we give you the tools to say “Great. This works locally. But now if I want to run it on a massive K8s instance, or I want to run on AWS, or I want to run it on Azure”, you can do that in a line of code.

If you want to run it every third Thursday when there’s a full moon, we make it incredibly easy to express. When you want to say – if you start getting a bunch of 503 errors, because the resource is down, or maybe you start getting 429 errors because LinkedIn is onto your scraping work, you’re sort of scraping, let’s say, pipelining off of it, you can control the custom behavior of saying “Well, pause this workflow, or maybe start using a different IP address.” So we allow you to bake in different handlings of failure, and sort of native resiliency into that. So retries, idempotency, caching, transactions.

So before we dive into kind of the golden path –

I can you see you’re intrigued, Chris.

I am intrigued. Our audience may only be listening to us, but we’re all looking at each other in video, and stuff, and I guess I definitely – I am intrigued. So go back for just a moment in the painful parts of workflow orchestration; talk a little bit about – because I get that Airflow… Sorry about the Airflow folks out there. Airflow wasn’t the right thing for you. You went to Prefect and it solved that. But before we make that, what was the pain - and I’m not trying to pick on Airflow, but just in general, with workflow orchestration, what is it that hurts so bad for people to understand kind of the pain you came through?

Yeah, so when you’re at a startup, you have to do every single job. And so let me talk about pain number one. Pain number one is I’m like a trained data scientist, I had to overnight become like a machine learning engineer and a DevOps engineer just to get stuff out into the world… And the how to sanely go from something that I have working locally, to something that can operate at real scale in the cloud, was – you know, there are specialists who can make that work really well. At the time there weren’t a lot of tools to make that intuitive, and I had other things to worry about… So a workflow orchestrator with a good, ergonomic interface to infrastructure was absolutely key for me being able to do essentially the design work of the product to actually getting it live, and accruing value to users. That’s step one, which was like – I was a big infrastructure dummy, and this allowed me to approximate a very smart infrastructure person with a few lines of code.

I would say that from the actual workflow side, infrastructure aside, is that when you’re orchestrating say LLMs in particular, they can fail for a whole host of reasons. At the time, when you didn’t really have good structured outputs - I’ll say what that means, which was… You know, at the time, if I had tens of thousands of job posts that I was trying to extract information from, even though they were just one single blob of unstructured text, you would say “Here’s the JSON blob that I expect out of this. I expect keys that tell me what the title is, what the location is, the salary…” And so I would take a document and try and ask OpenAI and say “Look, here’s the schema that I expect out of this. Can you extract this information?” and it would fail for a whole host of reasons.

[00:12:07.08] One, the API for OpenAI was brittle at the time. So it would fail because the resource was unavailable. And then sometimes it would fail because the information wasn’t available in the actual document I was trying to extract from it. Then you would have all sorts of parsing issues where what it was returning to you wasn’t valid JSON.

So there was this whole host of cascading errors of either the actual resource was down, there was problems with the infrastructure I was working with, the data quality was bad, the generation from the LLM was bad, or there were like syncing issues at the end when I wanted to go put that into a data warehouse at the end of it. And so we saw so many different cascading layers of failure that it wasn’t going to come up every single time, you wanted to react to them differently, and you wanted to be able to express those contingencies in code.

And when I wasn’t working with a workflow orchestrator, if the machine that I was on failed over, and I was say 90% of my way through a very expensive job, I would lose the entire state of my workflow, and I would have to start over from scratch.

And when things are linearly ordered, nice and serialized - maybe that’s not that big of a deal. You’re saving records as you process them, so you know how to restart over, you’ve got a cursor that you built yourself with your database… But for things that require like “First, I’m going to do these four things, and then they’re going to map into these two jobs. Then I need to go do 1000 things that come from that, reduce them to two things”, keeping track of the state of where you are in that pretty, say, complex graph of dependencies - if you fail on a single node, where do you restart from? And depending on how big those jobs are, it can get really expensive if you don’t treat failure as a first-class citizen. And so I would say that was the big thing, which was, when I was building this, I was spending about 5% of my time writing code for the upside, like “If everything works, this is how the code should run”, I had to spend 90% of my time handling failure… And since workflow orchestrators allow you to handle that failure gracefully, and as a first-class citizen, it made my life a lot easier. Does that answer your question, Chris?

It does. No, it’s a great answer.

Yeah, maybe just a follow-up on that… So you mentioned some of the things specifically with Open AI kind of failure modes, different things that were happening early on… I’m wondering if you could talk specifically – so workflow orchestration is a larger idea than maybe just machine learning pipelines or AI-related workflows. But what is it specifically about maybe these sorts of workflows that a lot of us are trying to build now that maybe further stress some of these issues? And/or maybe even providing some examples within that would be helpful. What makes machine learning workflow orchestration or AI workflow orchestration different and the same than just sort of like general workflow orchestration?

If I return back to – I gave like five or six theses of things where I was encountering failure a lot… And I think that some of those things are sources of failure that many folks are familiar with. I’m calling out to an external service, and the service is flaky, and it’s bad… That’s existed forever. As long as people are building data pipelines, upstream sources being flaky - that makes sense. Hitting deterministic errors of like - I’m ingesting data, but somebody added a new field, or they removed a field, and I’m dependent on that, and now my pipelines broken. Or I’m scraping something, and target.com, instead of labeling the name of their product with a div whose name or ID is this, they’ve changed it to that, and now all of my data is corrupted, because I couldn’t detect that in real time.

[00:16:12.02] And then the last piece is the loading part, where you’ve gotten, you’ve cleaned all your data, and now you want to go put it in a persistent place where you can go query, or do analytics on. So that’s like classical stuff. The extraction, calling out to an external service, the transformation that you’re doing deterministically in the loading. So the sort of ETL business of all of that. That’s like a persistent problem that exists far before Gen AI, or ML workflows… And that’s sort of what’s been category-defining for workflow orchestration. That’s like the single case that people usually break in an orchestrator for is when they’re doing ETL type jobs.

I would say that what’s unique these days is that since workflows in LLM land are now more dynamic, so you really can’t plot out every single thing that’s going to happen from the start, they are now basically – we’re dealing in English now, or you may not know the full space of their responses at the beginning of a workflow. So I think that LLMs introduce a dynamism component that’s hard to reason about, and has kind of escaped classical workflow orchestration.

I think the second piece is that the nature of errors here just feel totally new. So the fact that you can ask an LLM for a particular shape of a response, and then you can get parsing errors out on the end of it - that’s a new source of failure that’s now buttoned up with, say, some commercial LLM providers that give you very structured, guaranteed output, so you don’t see as much parsing errors… But now – I like to joke that you can lead an LLM to JSON, but you cannot make it think, where - like, you can say “Look, I’ve got this job description, and the title is… I’m going to give you a schema that says the job title, the location”, or whatever. And sometimes you’ll say the title, it’s required, it has to be a string, and the response you get out is “I’m sorry, but I could not find a title.” And when you’re doing this at tens of thousands of jobs, now you also have to reason about “Okay, I got now the parsing error, but the error was pushed down deeper in the stack.” Now there’s data quality errors that I have to reason about, that I didn’t really have to account for in last generation’s ETL. Things were much more deterministic, we had stronger contracts about what you’re going to get.

And then I would say the last piece is – what makes this harder is that so far… And I hate to keep throwing out random definitions. I hate being a merchant of complexity, and talking about why things are super-hard… But trying to at least motivate why this is a new source of difficulty, right? Like, we’ve got tools to handle this, but why do we have these tools in the first place? And the last piece is around like agentic workflows. Now, this is a buzz term… So what do I mean when I say “agentic workflows”? Everything that I’ve talked about so far, of like I get a document, and I want to extract stuff from it. Or maybe I want to classify it, or I want to summarize it. These are all sort of modern takes on classical ML problems, right? You don’t have to bring as much training data, you don’t have to train a model first… You’re basically throwing the weight of the compressed internet at every problem that you come across.

But with agentic workflows, what I mean by that - those are things that operate in a loop, are able to call out to external tools should it choose to, and can create, refine and reflect on its own plan, which means when you’re orchestrating agentic workflows, you have to do this interplay between who’s doing the orchestration… There are some times where I’m coming correct with a plan, and I’m saying “First extract this topic, then classify it, then write an email and send it off”, but now I have to be able to add resiliency to a workflow that I’m unaware of at the beginning of it. So if it decides “Call out to this tool, call out to this API”, I now need to be able to reason about resiliency for a workflow that I don’t have any visibility to at the beginning.

And so I would say that those last three pieces around parsing, around not knowing your your full decision space at the beginning, and then how that feeds into now having to hand off some bits of orchestration to the LLM to create its own tasks that you now have to execute, I think that’s what makes the new generation of orchestration a much harder and a much more interesting problem.

Break: [00:20:44.08]

So going back – that was a great explanation there before the break. And earlier I was driving you back to the pain, and you’ve kind of carried us forward… So now I want to dive into kind of how you get it all fixed. Wondering if you can start introducing us to Prefect Core, and talk about the open source aspect of it, and how it’s fixing, and then we’ll continue from there later. But I’d really love that – give me an intro to it now.

So Prefect is an open source Python library for workflow orchestration. You can pip-install Prefect today. Last month we put out our last major version, so Prefect 3 is out into the world, and it’s free to use.

So what Prefect enables you to do is - when we talk about where did all this pain come from… Where the pain came from were classic problems like “I’ve got this flaky resource, but I know that it’s alive” The resource isn’t dead, it just – I knew that when I tried to access it, it was overwhelmed, so I need to try again… Prefect makes it incredibly easy to say “Look, if you ever see this task fail, retry it. Here’s how many times you can retry it.” If you want to go really deep on it, in the same line of code you can say “Here’s how long you should wait between retries if you want to be respectful of the resource.”

So Prefect makes it really easy to add retries, Prefect makes it really easy to do caching. So what I mean by that is - often when you’re building LLM workflows… You know, we talked about if you have really, really complex dependencies between different parts of your workflow, if something fails over and you need to run it again - well, you can either like try and pinpoint the exact dependencies that you have, blah blah blah, or you can say “Well I’ve already successfully executed this stuff, and it has all the same inputs… So I’m just gonna hold on to a reference to what its output was, and then now I can just zoom back to where I was.” So I don’t have to sort of – we keep track of the state of your workflow, but we make it really easy to say when you want to recompute an answer, or when you want to just sort of recall it in a very inexpensive way.

And also make it very easy to add transactional logic. What I mean by this - if you’re a database nerd, you’ll know what transactions are. They’re just a way of like collocating work with each other, and then being able to control what happens if something fails. So maybe I want to have task 1 and task 2, but if I see a failure in task 2, I want to be able to express how to undo the fact that task 1 already executed So Prefect makes it super-simple to say “If I see a failure in a group of tasks, I want to go and undo maybe some writes that I did”, which is really helpful when you’re trying to build applications that are like building up knowledge about a particular user, where you get to the end of something, you realize that you’ve made a reasoning mistake, and now you need to walk back through the previous stuff that you’ve committed to, say, your knowledge graph, or your vector database, or what have you.

Those are like three nice features that we make it really, really easy for folks to express. We have handles for like timeouts… If you’ve got a really sensitive operation and you want to put an SLA on it that says “This thing can’t take more than 10 seconds to run. We can fail it out.” And then we give you really good ways of custom-handling those errors. So if something fails for one reason, say Open AI’s API is down, you can just cancel the rest of your workflow. If it’s not down, now you can reschedule it to retry later.

And I think that may happen, because - I mean, everyone just kind of walked out the door a few days ago…

[laughs]

Yeah, hot takes abound today… [laughter]

I’ve gotta say, as somebody who built on OpenAI in the early days - like, I’m still such a fan; I’m wishing the best for that team, for sure… But yeah, I think they’ve had a rough couple of months. So I’ll say that that’s like classical stuff where you’ve got your Hello World Python function. This is not complicated. It’s literally like retries equals three. It’s timeout equals 50 seconds. You slap a decorator on top of your function, and then it basically gives your Python code superpowers. And this is stuff that normally you would have had to write your whole own framework to do, and here you can do it in a couple lines of code.

I would say that the other piece around Prefect, the other two core value propositions are around infrastructure. So if you have something that’s working locally, you don’t have to be like a DevOps engineer to get it working on Kubernetes, if that’s your flavor… But honestly, if you’re like “Look, I’ve got this thing working in this Docker container locally on my machine. I want to get it working on like Amazon’s ECS”, it’s really like a one-click deploy in order to get it working somewhere else.

And the last piece is around observability. We haven’t talked about it, but observability is really this pair that comes with orchestration where if orchestration is really the practice of like “Crap, how do I get this stuff to actually run, and how to react to it when it fails?”, there are some times when it just cascades and fails all the way down. And no matter how much you account for everything, everything goes wrong, and you need to be able to see what went wrong, so that you can replan. You’re not going to be able to handle the entire universe’s amount of complexity ahead of time, so when it does fail, you need to be able to learn from those mistakes as a human being, and actually write in the logic to handle that for next time.

[00:28:17.18] So observability is really this element of “This thing failed. How do I provide the breadcrumbs to figure out why it failed? Which data provider was the one that caused me to fail? Was it a parsing error? How often am I seeing that parsing error?” If it’s Open AI’s fault, now there’s a bridge between observability and orchestration. “If Open AI has failed more than 80% of my requests in the last 10 minutes, now I’m gonna switch to Anthropic”, and we make it easy to switch those two things out for each other.

And so I would say that between just adding your most native retry caching transactions, making it dead simple to submit to infrastructure, and then the last piece around really having a clear understanding of how things are working, and importantly how to figure out when they fail - those are really the core value-adds of Prefect.

Yeah. And I would definitely recommend people check out the Prefect docs. There’s a quick start in there… Just to give people a visual of this, I see you have this kind of converting your Python script into a Prefect workflow type of thing… That’s really helpful; there’s sort of these definitions of a workflow related to getting information about a GitHub repository, like contributors, and repo information… And then those are in Python functions, like definition [unintelligible 00:29:38.14] contributors function, and then above that you kind of put these decorators task and flow decorators with options to convert this into a Prefect workflow.

So part of my follow-up on that, just to kind of really hone in practically, to give people a sense of what it’s like to do this… So let’s say that I’ve converted my Python code into one or more Prefect workflows. I can run that, in my understanding, if I’m looking at the docs right, I can run that just Python my workflow, but obviously, like you said, there’s a deployment element to this… So could you talk a little bit about – I know one of the things that’s always a struggle in my past is maybe getting something working locally, and then because I’m going to run it in a different environment, or something like that, then I deploy my workflow on top of other infrastructure. And it’s hard to just connect that local development debugging and development staging production environment… So practically, what does that look like in the Prefect world in terms of going from that Python command to run your workflow locally, to watching things flow through your workflow in a nice dashboard that you have pulled up in production?

So I’ll do my best to do this without a screen to share.

Yeah, it’s hard.

So you had said “Talk to me the experience of taking my Python code and converting it to a Prefect workflow.” For folks that are listening, it’s as simple as from Prefect import flow, and then @flow on top of your function. That’s all the work it takes to take a Python function and give it superpowers, basically. To add all the stuff that we’ve talked about. And that doesn’t take away any sort of attributes of that function. If I take my Hello World or my ETL flow that runs locally and I start adding Prefect things on top of it, it still runs locally. And so it’s not like – we don’t sort of require additional infrastructure just to have your stuff run. We detect whether or not you’re running it locally or not. And if you’re running it locally, we execute it as if it’s regular Python functions.

[00:32:03.29] Now, there’s kind of two stories of “Well, this thing is working on my machine. How do I get it to run somewhere else?” We have two ways of doing this - and I’m sorry for the vocab lesson - which is let’s say that… What’s the crawl/walk/run for anybody trying to get something to run remotely? It’s you start off on your machine, and then it’s like “Well, I’m gonna go get a server somewhere, and I’m just – if I can close my laptop and go to bed and this thing is still running on my server, that is what counts for me as running it remotely.” That’s the next step in the hierarchy of needs.

So for that, if it can run locally on your server - you go to EC2, you SSH in, you’ve got your console, you pull your GitHub repo, you pip-install your requirements… You’re still able to execute that flow on your machine, and now if you execute that function and you hit say like .serve, now in the same way that you would start a fast API application, it is now running, you can specify a schedule that it runs on… It exposes an HTTP endpoint that you can call out to if you want to invoke it on demand, and it listens to events.

So maybe you don’t want to just run it on a schedule. Maybe sometimes you just want to be able to hit a specific endpoint to manually invoke it, maybe from your Django app or something like this, or maybe you want to emit an event into the world that says “Look, when my user signs up, when my dataset is ready, I want to have this ETL flow go and operate on it. Because my data is not always ready at 6 a.m. Sometimes it’s not uploaded till 6:02.” So sometimes you want to have things be a bit more dynamic. That’s what happens in a [unintelligible 00:33:47.10]

So where are we in the hierarchy of needs? I had it on my laptop, but it stopped running when I shut it. I go and I get an EC2 instance, I have it running there… Now I can shut my laptop and it’s still running. And then now how do you do this at a massive scale? And how do you do this auto-scaling with respect to the amount of work that you have, right?

If your EC2 instance - you hit it 10,000 times, and all of those things, every invocation requires downloading some data, that machine’s gonna fail over quick. So what happens when you want 10,000 things to spin up just as many machines and you want to fan out? That would look like my flow.deploy.

So I’m here on my laptop, I’ve got my Hello World flow… I would literally just say “If name equals main”, like I’m writing any other script, flow.deploy, and the sort of autocomplete in your IDE is gonna tell you “Alright, what do you want to name this thing? Where do you want it to run? Do you want it to run on ECS? In Amazon? Do you want to run in Google Cloud Function? Where do you want it to run? Can you point us to – do you have this in a Docker container somewhere? If so, we’ll just like pull it from whatever Docker registry that you have. Or do you want to point us to a GitHub repository?”

And one of the nice things is when you build a tool like Prefect that treats failure as a first-class citizen, you see the failure modes of tens of thousands of people trying to get their code to run, and so this really allows us to build an ergonomic experience that’s like “Look, here’s the minimal stuff that we need to know in order to have a high-reliability guarantee that this is gonna work” when you shut your laptop, when you submit it to a Kubernetes cluster, what have you.

So the experience is really pretty dead simple. If it’s running locally on your machine, you can hit .serve, and now it starts a process on your machine. If you want it somewhere else, it’s .deploy, and we can guide you through exactly how to point it at remote infrastructure. Does that answer your question, Daniel? I hope I’m not glossing over anything.

It does, yeah. That’s great. I just have always felt this as a pain in my own life… So I often like to ask the selfish, practical questions on this podcast when I get the chance for them.

Yeah. You know, one thing that you had said which I did gloss over is - you had said “What’s the experience of deploying?” and then “Tell me about a shiny dashboard that you see”, or something like this. So once I go through all the work and I have my workflow running out on the world, I have this really beautiful UI that I can log into. And when I access that UI, what it displays for me is it basically tells me – it’s almost like standing on the the platform above the factory floor, right? You can see everything that’s in progress. You can see the box of widgets that everything has produced at the end, and you can see everything that was broken in the process.

And so if you’re the type of person that treats your workflows as cattle and not as pets, it’s very easy to see “Yeah, you had 10,000 jobs that succeeded. Here’s the 10 that failed.” You can easily click in, see “Okay, the 10 that failed… Why did they fail?”

If you’re on Prefect cloud, which has a very generous free tier you can run a business on, we do AI summaries of the errors… And so we will summarize like “Look, these workflows are failing for these reasons in natural language”, so that you don’t have to dig through a 10,000-line stack trace just to find out that you had an out of memory error.

Break: [00:37:17.06]

So Adam, I want to ask you a question. I understand that you have a friend named Marvin. I’m wondering if you can tell me a bit about Marvin.

Yeah, so Marvin means a lot of things at Prefect. Prefect, if folks don’t know, is an homage to Ford Prefect from Hitchhiker’s Guide to the Galaxy, and the genius alien that – well, I think Marvin’s a bit more of the genius. But the alien at the beginning of the book, that’s just like “Come on, let’s get out of here”, introduces him to the world, or to the universe…

I just reread that last week, by the way. Just completely randomly there. I just had to say that.

Amazing timing.

For folks listening, this is not like paid actors, I promise. And then Marvin is his terribly depressed android companion, that just kind of has the curse of knowledge, knows everything…

“Everybody hates me…”

Exactly. And so Marvin started off as a mascot at Prefect. And [unintelligible 00:40:46.15] So we had Marvin the duck, which was this lesson of like often the fastest way to triage failure is by rubber-ducking… So we would send like ducks off to new customers to put on their desks, so they could literally rubber-duck with Marvin… And then Marvin was the name of our first internal LLM-powered Slack bot, which is still alive today. It serves a community of 30,000 data engineers that show up in our community Slack every day, who have questions about Prefect.

So if somebody’s trying to configure particularly complex behavior, maybe that’s a blind spot in our docs, Marvin is hooked up to all of our docs, all of our GitHub issues… Everything that’s ever been written about Prefect, Marvin is hooked up to. And so in Prefect today, if you’re a community user, you can show up and you can say “Hey, Marvin, I saw this stack trace, and I thought I’d configure this correctly. What’s going on?” And Marvin will be like “Oh, well, actually, if you configure it like this, you maybe want to check out this doc or this video.”

And so Marvin has really created this personalized interface to learning Prefect, that docs never can be. Docs are always kind of shooting for the middle, I don’t know, 80% of technical prowess, the middle 80% of use cases, and LLMs, just as they’ve done for pretty much every other product, create a human interface [unintelligible 00:42:11.26] So somebody can show up as they are, and get a personalized interface to our docs.

The last piece is Marvin is our opinionated LLM framework. So Marvin is an incredibly popular project. It’s got 5,000+ stars on GitHub, and it got started in early 2023, where we had just built our internal Slack bot, and we had felt that existing options in the world didn’t allow us to write LLM workflows in a way that felt particularly natural or ergonomic to us. And as a company that more or less was like “Look, we think that writing dynamic workflows and configuring infrastructure - those are hard problems, but we should take on creating a good, ergonomic interface to them, so you don’t have to face that complexity.”

Prefect is obsessed with giving people access to sort of the complexity of the world, without having to face it themselves. They can opt into it whenever they want, but you shouldn’t have to have like a CS degree just to get a script running on another machine. And similarly, you shouldn’t have to import 1,000 agents and 1,000 different data loaders, and you shouldn’t have to learn a new common expression language just to get LLMs to do the work that you want… And so Marvin is like “You’ve got a Pydantic model? Guess what? You decorate this thing, and now this thing that was responsible for validation can now - you put a document into it, and now it can extract data from it.” You’ve got a typical Python enum, you decorate this thing, and now you’ve got a classifier. You’ve got a function, you write in a doc string that says “Here’s what this function supposed to do”, now you can get strongly-typed outputs, because we’re just reading from the function signature, its return annotation, and we’re basically trying to figure out how to make writing LLM workflows feel very Pythonic and very ergonomic. And it’s been a blast to build, and it’s just welcomed so many AI engineers into basically our ecosystem, who saw a really sane way of writing Pythonic LLM workflows that they just didn’t have in any alternatives.

[00:44:28.06] And I guess if we take that maybe a small step beyond, I know that you also have ControlFlow?

…as another piece of this stack. Could you kind of highlight maybe the differences, overlaps, kind of the distinction there? I know it’s really interesting I think what you said before, in the sense that a lot of people are trying to build these agentic workflows, and a lot of those things are very flaky and hard to debug… That’s the main thing that I hear, and the main – which is often why I think people are decomposing their agentic workflows into maybe static kind of workflows, that aren’t really agents anymore, but they can be debugged, and they can check errors, and that sort of thing. So yeah, I would love to hear about your kind of approach on the agentic side, given the sense that you’re treating those sort of failure states as first-class citizens, that sort of thing.

Yeah, so I think one thing I want to talk about is what’s the value of agentic workflows. I feel like a lot of folks just hop into it being like “Cool. Now I can make this thing agentic. But for what reason at all?” So a little context here - I truly believe that LLM workflows… And just to make sure we have the same vocab, it is like very deterministic workflows, that say “First, I’m going to call out, I’m going to extract this data. If I see the extracted data, now I’m going to classify this piece. And then if that, now I’m going to send an email.” Those are things where the human is really writing the logic. If we were to take this workflow and we were to go back to 2016 or 2017, it would look the same, but instead of calling out to an API, we would say “Now I’m going to have some hidden Markov model that’s going to go and do my entity extraction. That’s my step one. If I detect this entity, now I go to my next thing. Now I might have some interpretable logistic regression model try and classify the output coming from it before.”

So LLM workflows are really like “How do I take traditional workflows that existed in the world of ML, but now instead of having to go and train models, now I can essentially wish models into the world through natural language?” And those workflows are amazing. Frankly, I think that they solve most business problems that you need to; that’s like a very reductionist, broad brush that I want to paint with… And they’re incredibly easy to debug, they’re incredibly easy to observe, because the logic is something that you wrote. If your first if statement fails, or something like this, now you’re able to say “Well, I knew exactly where in my logic this thing failed, and so now I can go debug this tool that I called out.” Basically, it’s easy to debug because it’s not changing our mental model of how to build data pipelines. And so all of the tooling that we built in the last decade, around building robust evaluations, collecting data to try and figure out how often they’re succeeding - these are tried and true. So we’re able to basically do our own bit of transfer learning from decades of ML research and evaluating ML pipelines onto LLM pipelines.

Now, there was a paper that came out from from AlphaCodium a few months ago that was basically like “Look, if you want to, say, build your own automated software engineer, there is only so far you can get with, say, a deterministic LLM workflow that’s built on top of GPT 3.5.”

[00:47:56.22] They were able to show that even though it’s all just kind of like function calling under a big while loop, they were able to show that if you adopt this paradigm where you start turning over some of the planning logic to the LLM itself, now it’s able to outperform the top of the line frontier models on specific tasks.

So for things where the decision space is very large, like “What piece of code do I want to run next?”, stuff where the information space is very large, I’ve got a giant codebase to reason about, agentic workflows really tend to thrive… Because if the decision space isn’t known to you at the very beginning, agents are able to discover that dynamically, create and refine on a plan that you as a human, you would take a lot of ifs to get to the same style of performance.

So that difference between LLM workflows, which are basically prompt engineering, calling out, calling a function, single shot stuff for single shot tasks - that’s what I call LLM workflows; the stuff where you turn over behavior or orchestration logic to an LLM itself to decide its own plans, maybe even create its own tools… That’s agentic. And that’s really – that distinction in problem space is also how we distinguish between Marvin and ControlFlow. I promised I would get to ControlFlow.

So Marvin is really good for saying like - you want to call out to an LLM and extract something? We’re going to make it feel the absolute most natural that you can, and we’re going to make this so that if you know Python, you know Marvin.If you know Python, you know how to extract something super-easily. If you know Python, you can classify something super-easily. Summarize things super-easily. So Marvin is really a prompting library. It’s a way of creating dynamic prompts using classic Python functions, or Python objects. ControlFlow is really - now when you start turning over the wheel to LLMs to formulate their own plans, this is where we can’t rely on the tools of the past around observability. We can’t rely on the tools of the past around evals. You can still build evals that say “I put in this thing at the beginning, I got this thing at the end. How often did it succeed on my test set?” But if you’re trying to figure out “How often did this thing take the specific path through reasoning space to get to this outcome?”, that’s now much harder to design tests around. And it’s harder to debug. If something fails and you didn’t write into orchestration logic, you can’t really point to a place and say “Ah, this is where I messed up in how I designed this thing.”

And so ControlFlow is really “How do you ergonomically express agentic workflows?” How do you express dependencies between tasks? How do you tell an LLM when it has the wheel, and then how do you explicitly tell it that it no longer has the wheel, and now it’s going to seed back to human-written logic? That is what we make it extremely easy for you to express in ControlFlow. And since we have a pretty opinionated developer experience of how to express this, we’re able to give folks much better observability.

And so this is built on Prefect 3, which means that you have an orchestration library sitting behind every single action that an LLM takes. If something fails, you’ve got retries. If something takes too long, you’ve got timeouts. If something fails and you don’t want to go through all the work again, you’ve got caching. If your agent needs to create a sandboxed code environment to run untrusted code, it can now deploy its self-created function to remote infrastructure.

So it’s really how do you give agents durability and resiliency? Because those are often the biggest reasons that they fail. And so that’s how we distinguish between them. But I realize I’m monologuing a little bit, so I’ll turn it back over to you, Daniel, if there’s some stuff that you want to double-click on.

[00:51:47.15] It’s all good, I’ll jump in. We’ve covered some ground here, and I appreciate that very much. With Prefect Core, and kind of producing the open source that you’re building around, you guys have the managed workflow orchestration platform that is Prefect Cloud… We’ve talked about ControlFlow, we’ve talked about Marvin… That’s a lot.

What are you thinking about next, as you’re looking in the months and the next few years to come? You’ve accomplished so much, but as you look at what you think this space is going to turn into, because it’s evolving so fast, do a little bit of crystal ball gazing for us… Right or wrong, tell us what you think the future is going to hold, and how you’d like to play into that.

Yeah, it’s a great question, and it’s something I obviously try to think a lot about… I would say that right now there is way too much emphasis on single-machine, local LLM or agent workflows. Here’s what I mean by that. You pull up any framework in the world to run LLM things, or to build an agentic workflow, and it’s like happening in a local process, on your machine. It spins up an input field in your terminal, and then you have to type your answer to what your favorite color is, and then it goes and writes you a poem, or something like this. But businesses who are building LLM workflows or building on agents, at the model level we see that frontier models are accounting for – sorry, frontier model providers are accounting for a lot of the core sources of failure. So structured outputs from Open AI managed to wipe out just a whole host of your classic resiliency issues. And the resiliency issues I think are going to be around planning, and so the ability to add idempotency and transactions to LLM workflows is really what we would invest more into. You had a plan that executed ten things in a row, you found out at the 11th step the plan was bad. But you’ve already done ten things. How do you walk that all back? You would do that with transactions.

Where I am really interested in long term is - when Daniel asked earlier “Talk to me about how I as a human go from a locally running function to something that’s running elsewhere”, I think that the “me as a human” part of that is going to go away. And it’s often going to be an LLM in the course of solving a problem decides that it needs to create and massively parallelize a function to call out to.

And so now you need to be able to give an LLM, on demand, the ability to create and provision infrastructure, and submit a whole bunch of jobs to it. And so while we’ve built Prefect to be something so easy a human can understand, that’s going to play into a lot of strengths of “How do we expose an API that can really be taken advantage by an LLM?”, if that’s the intended audience of how to provision infrastructure.

The third piece is - at companies now you’ll have so many teams that are basically building LLM workflows in parallel to each other. So you’ll have like the conversation team, that’s trying to build in LLMs into how it talks to customers. And then you’ll have like the platform team, which is trying to use it to give feedback on internal pull requests. You’ll have teams that are trying to use a bit more human in the loop for commenting on like a design doc, something like this. But fundamentally, you have tens of thousands of parallelized executions or calls against LLM APIs, so how do you solve that coordination problem across the programs that are trying to invoke LLMs?

And so really trying to figure out, if I have 10,000 agents all trying to access the same resource at the same time, how do you govern that in a way that doesn’t cause them all to crash? If I have 10,000 engineers that are all trying to access the same LLM API, how do I make sure that they don’t all consume the same token budget all at the same time?

So that’s a bit more on the practical side of like – you know, future of agents, future of LLMs aside, there is still a very, very interesting, fundamental engineering question, which is “How do you get tens of thousands of concurrent API calls to Open AI or Anthropic all behave sanely, and not lean into a commons problem of everybody cannibalizing the same data resource?” And so that’s like the very boring thing that I like to think about. But orchestration is one of these very fun, but ultimately worst-case scenario disaster planning style things.

Well, we love the boring stuff here, which is actually not boring, and I think a lot of people want to think more about, because they do have the desire to put these workflows into production… Which is, yeah, of course, on your all’s mind, and what you’re digging into.

Yeah, we just really appreciate you taking time to join us, Adam. It was a great conversation, and I would encourage people – we’ll include the links in the show notes, but go check out the docs for Prefect, and Marvin, and ControlFlow… Try out some of these things on your own. Like Adam said, you can pip-install Prefect and be off to the races, so no reason not to try it out… And yeah, thank you so much, Adam, for taking time and joining us.

My pleasure. Thanks so much, guys.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00