As you start developing an AI/ML based solution, you quickly figure out that you need to run workflows. Not only that, you might need to run those workflows across various kinds of infrastructure (including GPUs) at scale. Ville Tuulos developed Metaflow while working at Netflix to help data scientists scale their work. In this episode, Ville tells us a bit more about Metaflow, his new book on data science infrastructure, and his approach to helping scale ML/AI work.
RudderStack – Smart customer data pipeline made for developers. RudderStack is the smart customer data pipeline. Connect your whole customer data stack. Warehouse-first, open source Segment alternative.
SignalWire – Build what’s next in communications with video, voice, and messaging APIs powered by elastic cloud infrastructure. Try it today at signalwire.com and use code
AI for $25 in developer credit.
- “Effective Data Science Infrastructure” by Ville Tuulos
- Use code podpracticalAI19 for 40% off!
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am a data scientist with SIL International, and I’m usually joined by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin, but he’s got the week off… But I do have a wonderful guest for today’s discussion, which I think is very timely and also very practical. Ville Tuulos is with us, who is CEO at Outerbounds, and was previously leading the data science infrastructure group at Netflix. He’s also the author of a really interesting and new book called Effective Data Science Infrastructure: How to Make Data Scientists More Productive. Well, that sounds super-practical, Ville. Welcome to the show, great to have you here.
Yeah, thanks for having me.
I know even before the show we were talking about your background and how your background has shaped how you think about data science infrastructure, but also how you think about AI and what really is new, what isn’t new, what trends have happened and developed over time… So could you just give us a little bit of an idea about your background and how you started in this field?
Yeah. Well, I wonder how far back I should go, but embarrassingly enough, I started at my first startup that worked on artificial neural networks already back in 2000. That’s kind of a long time ago. Believe it or not, this of course predates all deep learning and so forth, but people did artificial neural networks at the time, and this startup in particular was focusing on a very peculiar kind of neural networks called self-organizing maps, that were quite popular in the late 1990’s, early 2000’s.
[03:52] The commercial idea we had at the time is that we could help enterprises with kind of an enterprise search, intranet search, stuff like that… So it happened that – yes, I mean, companies needed help with that, but they didn’t need neural networks… But interestingly enough, ever since then, I’ve been really focusing on this question, “How can we actually improve the tooling and the infrastructure for people who built these models?”, since already back in the day I saw that a big challenge that we had is that we had all these amazing researchers, data scientists whose task was to build these models that were supposedly useful in practice, and it took many iterations, it took a lot of thinking, and trying out, experimenting… And of course, back then, the tooling was quite horrible; now it’s much better. But I think it’s really interesting to see the whole trajectory, where we came from and where we’ll be going in the future. It’s scary to see that there are also many things that actually haven’t changed that much, and we have much work ahead of us…
Yeah. Could you share some of those examples? What is the same now, maybe a challenge that you’re still facing now, that was the same as back in those days? And maybe a couple of examples of things that have really dramatically changed.
Yeah. I mean, starting with some things that are the same - amazingly enough, already back in the day we were using Python… And of course, that was highly controversial at the time; Java was pretty new, and of course, everything that needed to be high-performance was written in C and C++… So that’s actually a big change between then and now; of course, we had to build all the libraries by ourselves, and it was even seen as a big differentiator and a competitive moat that we had more performant, more scalable algorithms to optimize these models than anyone else.
Nowadays, of course, everybody can just use off-the-shelf PyTorch, TensorFlow, even XGBoost, scikit-learn has just released 1.0… So the library ecosystem is way ahead what it used to be.
Now, another big difference is that we actually spend a considerable amount of time just setting up the hardware. Amazingly enough, we had to rack off servers at the company, and we had to sit up the networking, and the storage, and the compute, and the operating systems, and all that stuff… And now with the cloud - again, that’s another thing that has massively changed. You can get this infrastructure that used to be available only for the largest companies, even at the startup, and you can get clusters of machines with GPUs… It’s just mind-blowing. And what’s really exciting today is that, on the one hand, you have these high-level modeling libraries, on the other hand you have all the hardware available… And now I think it’s kind of mind-boggling that still things aren’t that easy. It kind of feels that everything is possible, but nothing is really easy enough… And I think that is still something that we need to work on.
And of course, one interesting challenge to solve is that there are way more people working on these problems. It used to be only a handful of people. I think we had three people at the company at the time who could actually build these models, and now it’s probably a hundred, if not a thousand times more people. So it definitely feels like a new field in that sense.
Yeah. And how did your perception of data science, and the infrastructure in particular needed for data science - how did that shift as then you led the team at Netflix? Now we’re talking of course about a much bigger scale, right? But also, these models are, I assume even at that point, a really critical piece of what makes Netflix Netflix, and the value that’s added by these models… So how did your perception of infrastructure changed as you worked on those sorts of problems?
I think one interesting thing that maybe not everybody realizes is that – like, the few things that many people know about Netflix, especially the recommendation and personalization systems, they are kind of the tip of the iceberg. There are so many other things where Netflix wants to apply potentially ML and related technologies like also wants to experiment with new ideas.
I think one thing that’s interesting is just the diversity of different use cases, if you think about computer vision, natural language processing, even things that are not technically machine learning, like operations research, how we can optimize schedules, stuff like that. So it’s just the fact that there are so many different types of problems, and they come in all shapes and sizes… It’s not always about scale. Also, it’s not always that everything has to be super high SLA business-critical; there are crazy experiments… But the interesting challenge is how do you manage the diversity of all these things? That’s really interesting.
[08:21] Now, when it comes to the more technical side of things, Netflix for the longest time has been 100% on AWS, and it’s a cloud-first company… And Netflix has been trailblazing many architectural patterns when it comes to storing data in the cloud, way before other companies were doing it; also microservices, chaos engineering… So it was really fascinating to be at this company that had on the engineering side some of the world’s leading cloud infrastructure.
Honestly, what makes Netflix interesting is that it all runs on AWS, in contract to, let’s say, Google and Facebook, who have their own infrastructure, which is really kind of kind of an island nobody else can do it. But Netflix is, in that sense, closer to everybody else in the world; it is same AWS, at the end of the day, that Netflix uses, that everybody else can use as well. And they have all these practices and ways of thinking about things, and ways about building services that makes them really effective. And now when you layer something like a data science and machine learning on top of it, it’s really interesting.
Of course, all those learnings are reflected in Metaflow, which is the open source library that we started building there.
Yeah, I do wanna get into Metaflow, but before I do, maybe – what is your perception of how… Like, maybe back in those days, when there was sort of like a building hype around data science, a lot of people initially getting into it, companies experimenting… What is your perception of like from then to now, how has the average data scientist’s ability to work with these different pieces of infrastructure that maybe come across our path, like the various services in AWS, whether it be EC2, or object storage, or things even all the way up to Kubernetes and EKS and that sort of stuff - how has the average data scientist from your perspective, what they’re required to know, or maybe what they come in with knowledge of, how has that shifted over time?
Well, my initial reaction is that I still think that we are in the early days. Of course, the fact is that maybe 5-10 years back, if you wanted to do anything in this field, you basically had to know C++, and you had to know the depth of knowledge about, let’s say, [unintelligible 00:10:29.08] was much deeper… And even let’s say the recommendation systems at Netflix - they ran on Spark, and many people used Scala, and that’s a bit of a different persona, a bit of a different profile than what we see now amongst data scientists who are building a new set of very diverse models… Like, using these Python-based libraries, maybe directly using the cloud, and so forth.
At the same time, I do think all these things, and especially being able to leverage the cloud, is still harder than it really should be… Just thinking about the amazing amount of computational power that you have there. Still, most companies I talk with, there’s the feeling that, well, data scientists kind of need to know about Dockerfiles, and maybe they need to know about the CI/CD systems… I think that there are different points of view, like if that’s actually a feature or a bug…
I think that there are so many questions related to the modeling itself that at the end of the day all of us, all human beings, you kind of need to manage your cognitive bandwidth. And as interesting as it might be for everybody to know about CI/CD systems, I do think that it kind of takes a bit away from the bandwidth that you should have available for thinking about the modeling problem itself. So I do think, and I do hope that we manage to raise that level of abstraction even more.
Yeah, I guess in this case abstraction could be a good thing, and even though it is interesting to dive into these different systems and containerization and all that, it does take a lot of time, like you said. I remember there’s a tweet from Erik Bernhardson, who said “Having a data scientist learn about some of these things like Kubernetes and Docker and Terraform and these things is kind of like having web app developers learn about the Linux Kernel. It’s so far apart.”
[12:15] It’s very tough to expect that. At the same time, I have benefitted from those times where I’ve been able to maybe push something further on my own, at least into a prototyping stage within a company, and get it in front of people to see that value, without reliance on passing something off to a software engineering organization to even create a prototype… So maybe there is some tooling that’s improving around that. I know there’s things like Streamlit, and other things where you can create something that’s very compelling very quickly in terms of like a prototype… But I don’t know – one of the things Chris and I discussed on the podcast a little while back is maybe why many data projects fail in certain cases is because people aren’t able to push a project far enough into a prototype stage for people to see the value of something and actually get buy-in from the organization…
From your perspective, I know – like, also being the CEO of a company that is trying to help people with their ML infrastructure, where do you often see, like when you first maybe engage with clients or when you’re maybe just making an observation about the industry, where do you see the problems in people not being able to get value out of machine learning and AI? Where do things get blocked most often, from your perspective?
I think it’s definitely a combination of maybe three different factors… Maybe starting from kind of the easiest one, it’s technical. There are technical hurdles still. It’s just like putting the infrastructure together, it’s just building the models… Although technically all the ingredients are there, many companies are still struggling putting the pieces of the puzzle together. But I do think that that is, in a way, the easiest one of the three.
The second one is definitely the skillset of the people involved… And it’s not only that they wouldn’t be skilled; I don’t think that that’s oftentimes the problem, but also kind of what other things they should be focusing on… And especially when we start talking about actually producing business value using machine learning, really understanding the problem domain, understanding the business needs - that takes a lot of bandwidth. Much of the time these practitioners spend on either on engineering problems, or maybe modeling problems - that may be fun, but ultimately might not really affect the company’s bottom lines so much.
But then the last problem I think that needs to be quantified is really the organizational… It’s kind of like a leadership question that – kind of what you’ve mentioned before, is that “Okay, so why do many of these projects fail? Well, they don’t get close enough to production.” I think that that is absolutely key, and I think that was one thing that Netflix did really, really well. They have, even at the highest level of the organization, this experimentation culture, and they have this idea and understanding that now, first, we can potentially apply ML and AI to all aspects of the business. So I think we are entering a world where no area of life and business is totally safe from ML, in a sense. We can apply ML to all kinds of things. And I don’t mean like a crazy, general AI, but like tiny little optimization problems here and there… And they are really everywhere, in all lines of business.
But now the problem is that you may have a thousand ideas, that “Okay, we could do this and that”, but how do you know which one of those work? But nobody really knows in advance, and you can’t really ask anyone. And the only way to know is you really need to start experimenting. And not only experimenting in the sense that you hack something, a prototype in a notebook, but oftentimes really the only way to know is that you basically push these things to production, and now the production meaning not so that you have a huge team working on something in six months, but actually getting something to A/B test, let’s say.
[15:59] Honestly, I know that this is really not that easy, but the idea that you can test these ideas, you can test different prototypes and pipelines in production, alongside whatever system you have in place today, and then you compare the results, that it’s hugely powerful. And then have the understanding that you can interpret the results and decide what to do with that, and have the understanding that actually it is by design that most of these things fail… That’s kind of the whole point of experimentation - if you knew that everything is going to succeed, you wouldn’t have to experiment. But the idea is that you can afford making so many of these tiny experiments that then you can quickly decide “Okay, this doesn’t seem worth it”, and then maybe you redirect your sources to something else.
This is also a question of product management oftentimes, having product managers who really understand how to work with these ML systems… I mean, all these organizational muscles are missing at many companies.
So Ville, you started to mentioned as we were talking about trends that you’re seeing in infrastructure and ideas around where things get stuck in production - you mentioned Metaflow a couple of times, which I know is a big piece of the puzzle in terms of how you solve these problems, but also a big part of your career in terms of what you’ve developed… So could you give us a little bit of the back-story of Metaflow, sort of the origin story, I guess?
Yeah. The nice thing is that it’s actually a quite pragmatic, quite bottoms-up in a manner that – as I mentioned, when I joined Netflix back in 2017, this was before Sagemaker, this was before MLflow, this was before Kubeflow… This idea of having any kind of, let’s say, especially open source machine learning infrastructure was quite new. Of course, there were products around – you had Datarobot, you had Domino Data Lab, you had Databricks, and Spark, and so forth… But what does the full stack for ML look like? That was quite new.
So now when I joined Netflix back in the day, I saw that – okay, so obviously the company had all these basic foundational pieces of infrastructure in place. They had a large data warehouse, an S3-based data lake, they had a team managing a large-scale compute infrastructure, basically something like Kubernetes… They had also teams who had been thinking about workflow orchestration for a long time. So you had all these pieces. Again, technically, everything is possible, so it didn’t seem like the challenge was that “Okay, we need to invent some new pieces of tech, that we could do something that nobody else has done before.” That didn’t seem to be the problem. But then the problem was really that they had the organization of data scientists there who constantly complained that getting anything done was too hard, exactly for the reasons we discussed… Like, “Okay, how do I run compute?” I mean, you go to the compute teams and they say “Oh, you just have a Docker container, and you put the container image here, and tag here, and then maybe you’d better go to a CI/CD system”, and then already at that point in time you had lost the data scientist… Like, “Does this make any sense?”
[20:07] The workflow systems of course needed lots and lots of YAML to define what you want to do… And even thinking what are the kind of patterns – because remember, these people are not software engineers by training, so how you actually architect software like this? It’s hard.
Really the origin story and the idea for Metaflow was that, okay, assuming that you have this foundational infrastructure available, how can you stitch them together in a manner that would present an API to a data scientist that would kind of like help them to build these applications that they have been asked to build for the company.
And now the other interesting side of the coin was that Netflix has this culture of freedom and responsibility that meant that we didn’t want to take away all the freedom from people, saying that “Well, here’s a training API, and you can only call this one API to train your model…” It was well known that different people preferred different tools for the job. Some people preferred TensorFlow, some people preferred XGBoost; it depends on the application, of course. So the idea was that, okay, we should allow them to, at the high level, exercise that freedom and exercise that domain knowledge and expertise in choosing those modeling tools.
We kind of studied having this idea that, okay, we need to be quite opinionated about the lower layers of the stack, how you do compute, how you access data, how you do orchestration… And then leave a lot of space at the top of the stack, “What kind of modeling libraries you use? How do you do your feature engineering?” and maybe even “What are the KPIs that matter for you when it comes to monitoring models in production?” And then we started crafting that stack. And again, Netflix - there’s no top-down anything; there’s no CTO, no VP of engineering saying that everybody must use this thing… We started solving these very practical problems, and then kind of like in a very organic manner – I mean, meta flow started spreading inside the organization, because people thought this is quite a no-nonsense tool that helped them solve exactly the types of problems that they had been facing on a day-to-day basis.
Yeah. So there’s a whole variety of things that people have created around workflows… But then there’s also a whole set of platforms and projects out there related to MLOps and other things… There’s all sorts of things that maybe data scientists care about, from making sure that they can run their workflow not on their laptop, which is more maybe infrastructure compute related, all the way to “Hey, how do I version and control experiments? How do I access data?” So how far does Metaflow reach in terms of these different things that data scientists might wanna do? What pieces of the puzzle does it try to solve, and how can data scientists think about it in terms of those various buckets of things they’re trying to do?
Yeah, that’s a good question. I think because we were faced with this great diversity of different applications, we couldn’t think – I mean, at my previous company before Netflix we were doing real-time bidding for targeted advertising… And in a context like that you know exactly the application, and you know that “Okay, maybe we build a feature store, maybe these are exactly the workflows everybody follows…” And that’s one type of a challenge. Actually, it might be a great engineering challenge, but I’m in a different type of a challenge.
In our case, with Metaflow, the challenge was that we didn’t know exactly the type of machine learning applications people wanted to build. So we started thinking pretty much from the bottom up, what are commonalities across all applications? And really, it starts with the question of data. Now, the disciplined ways of accessing data, “Okay, so how do we do it quickly enough?” Let’s say you have some kind of a data warehouse, you have a database, how do you get the data out quickly, so you don’t have to wait for 40 minutes for your SQL to execute?” So that’s one thing that we were thinking about, like working with arrow; Metaflow comes with a custom S3 library, so you can get your data super-fast from S3… Small things like that.
Then on the compute - again, all these models… I mean, not even the ones that require huge amounts of data, requires still a lot of compute… So you may want to do a hyperparameter search, or maybe you have a model ensemble, maybe you want to build a separate model for every country, or maybe for every customer… So you may be able to fan out this compute to the cloud. So we definitely wanted to solve that part as well.
[24:09] Then we saw that oftentimes it’s a really sensible idea to structure these applications as a workflow… There’s a lot of confusion about these [unintelligible 00:24:15.18] and workflows these days, like “Okay, what is it giving me?” and there are so many workflow systems… But purely as a way to express this idea is the idea that you structure things as a DAG. It makes a lot of sense. So we took that as kind of a core way of implementing things.
But then we definitely wanted to separate the idea that once you deploy these workflows to production – like, running workflows that scale in production is actually an engineering challenge of its own, and we didn’t want to claim that “Well, Metaflow is the most production-grade scalable workflow [unintelligible 00:24:43.06]” so we integrated with other systems out there. And also to ease that path to production. That was really another thing. What we discussed earlier about how important it is to test these ideas as close to production as possible… We knew that “Okay, we need to provide a path all the way to the end”, and that’s why we integrated with the existing system, so we wouldn’t get the resistance from engineering teams saying that “Oh, you have this piece of Python code, but no way we are going to run this in production.”
Then really thinking about the production-based practices, starting with very mundane questions, like dependency management. What if you need a very specific version of TensorFlow? Again, we don’t want you to write Dockerfiles by hand. It’s surprisingly hard to do it in a reproducible manner… But how do we let you use the exact version of the library you need?
And then yeah, you mentioned versioning as well… There’s the idea that you should maybe version your code, maybe using Git… But how do you version your models? How do you version your experiment? How do you version your data even? We felt that these are such foundational concerns that we should also provide an out-of-the-box solution for them. I think that definitely helps, because the data scientists don’t have to think about it too much… That’s kind of what we have been doing this far.
So if you think about compute data orchestration, versioning, and all kinds of questions related to pushing things to production… Then there are things at the top of the stack that we haven’t been so opinionated about. Many of our users today - they use other model monitoring tools; there are amazing model monitoring tools. You mentioned Streamlit, Weights and Biases, many others out there… Of course, specific tools for model explainability, if that’s important to you… And then, of course feature engineering - that’s a complex topic of its own. There are some customers who use Metaflow with some feature stores that works… And of course, the modeling libraries is something you should absolutely use the best of the breed tools off the shelves.
Yeah. So it sounds like part of the philosophy here is people are gonna be opinionated in their own teams about like “Oh, we use Weights and Biases to do these bits of the monitoring, and experiment management…” But that’s not going to solve these scale and infrastructure problems, and the workflow running problems that you mentioned as well… So being able to pull in what you need I think is really a cool idea, and having that sort of modular nature of it is really great.
So I do wanna get into the actual workflow with Metaflow, but in terms of how it works under the hood - let’s say that you’re setting up a flow, and you’ve got a series of steps, processing steps… Eventually, something has to run on a server, and like you’re saying, maybe I’m running TensorFlow over here, and it needs a GPU… Or maybe I’m doing this pre-processing of images and I just need to crank through a bunch of stuff in a sort of batch, or maybe even parallel way on CPUs… How do things in code that is using Metaflow - how do those things eventually end up running on the servers? Is there some sort of containerization or something going on under the hood, or how have you built that abstraction layer?
[27:54] Good question. Metaflow, since day one, has been built with this cloud-first mindset. I think when it comes to things like compute and storage, we live in a bit of a post-scarcity world. It’s actually interesting when you think about it, that many of the systems that we even use today, like databases, even things like Spark and Hadoop, they were built with this idea that you have constrained resources, and really the engineering challenge is that “Okay, how do you allow people to run compute given that you have only 200 servers, or something like that, and you have to do the resource management very carefully?”
The mindset that we adopted with Metaflow, which I think is really useful, especially in the people’s productivity point of view, is that you work with the cloud, the cloud provides you at least this abstraction of having basically infinite scalability… So you can use some cloud-based platform… Again, we rely on existing systems like Kubernetes, like AWS Batch, what have you, to kind of farm out the containers to the cloud. You can specify the resources you need, if your function needs GPU… So you can say “I need GPUs in this case.” In other cases maybe you need a lot of memory.
I think the interesting challenge with machine learning - I think this also, like I said, it’s machine learning apart from many previous data-intensive applications is that the needs are so different. Some types of models are really IO sensitive, like maybe you need to read tons of images, tons of videos, but the model itself maybe is simple. In other cases it might be something super compute-intensive, but not IO sensitive. In some cases you absolutely need a GPU, or maybe even some custom, crazy hardware. In other cases it is cost-prohibitive to do that, and it’s very hard to have any kind of uniform one-size-fits-all. So I think having that scalability that goes in all the different dimensions - vertical, horizontal, you name it. So that’s super-useful. And cloud makes it possible. So we rely on these systems and we take care of then packaging the code, sending it to the cloud, executing on the container, handling retries… All that kind of basic, basic plumbing there.
And then yeah, the same thing like with the orchestration - the DAG execution at scale; if you have 100,000 DAGs running in parallel, it’s a thing of its own. There are some systems that do it well; we integrate with AWS Step Functions, now we are integrating with Argo, maybe one day with Airflow, what have you…
The idea is that you should be able to test these things locally. Metaflow always comes with the local mode, so you can kind of test any [unintelligible 00:30:10.13] the same way you do in a notebook… But then when you want scale, like when you want something that’s production-ready, you can use your existing production infrastructures.
So you were just getting into talking about “Hey, maybe you’re doing some experimentation locally, in a notebook”, and then eventually you go beyond that and scale up and all that… I guess first off, to set the stage, Metaflow is an open source project, and people can go ahead and try it out… And we’ll include links in our show notes to where people can find it and try it out… But let’s say I am a data scientist, I understand what we’ve been talking about so far, that “Hey, I’m gonna experiment locally, but then eventually I need to run all of this workflow, and series of processing steps on infrastructure that is in the cloud…” Could you maybe just walk through what does it look like for a practitioner to use Metaflow? Let’s say they’ve written some Python code, they’re used to working with notebooks, maybe they’re sometimes used to writing Python scripts that they maybe log into a server and they run it… What does it look like for them to install and integrate Metaflow into their workflow? What are the prerequisites and how does the integration happen?
Let me start with kind of a data scientist point of view before getting to deployment, and stuff. I can give you a timely example… Just yesterday I was actually creating an example for the book using Keras… And this was a new dataset; I was actually using the NYC taxi dataset for the example. It’s a fun dataset, publicly available. I actually started exploring the data in a notebook. Of course, notebooks are still great for visualizing and exploring data, and so forth…
[32:00] And even when I started drafting the model architecture in Keras - it’s really quite nice to be able to iterate that in a notebook quickly… I was very conscious about – I wanted to introspect that, “Okay, so what are the things that work well in notebooks? What are the things that work well on the Metaflow side?” And I kind of face the same problem that many other people face when using notebooks, is that after maybe three hours of prototyping, my notebook had this kind of a mixture of cells that I had been executing out of order, and–
Who knows what the state is…?
Yeah, exactly. It was super-convenient. I kind of felt that I’m in this garage, hacking something together, and everything is kind of on the table, and it’s kind of a messy setup… At that moment it made me super-productive. But it was absolutely 100% obvious that there was nothing in that notebook that I would dare to run in production. Even the idea that I would somehow run that… Because it was my experimentation process that was reflected in that notebook. I would like to think about production a bit differently…
So then the idea, like, what happened at Netflix as well, and what we recommend people doing is that by all means use notebooks for experimentation, for exploration, for building prototypes, but then at one point, when you have a rough idea what that workflow could look like - and really, the threshold for that shouldn’t be too high - you can almost start copy-pasting the snippets… Let’s say in this case just the 15 lines of code that define the Keras architecture… Like, in a step, in a file. And now if you use something like a Visual Studio Code, actually it’s really easy to have both the IDE as well as the notebook side by side. So I can use notebooks for exploration, and then I can still have that really “proper” IDE for writing python code.
And again, the idea with Metaflow, what we have had since the very beginning, is that it doesn’t require that you know anything more than what you would need to know in a notebook. So there are no new concepts, no new paradigms, you don’t have to change the code, it’s the same libraries, and all that stuff. So I was able to then take the best parts of my experimentation and put them in this workflow.
And now, thanks to Metaflow, I’m able to start running it at scale. Of course, in the notebook it was a rather small dataset that I was testing with, and now I could take the same concepts, the same code, and start testing that with larger scale. I didn’t have to wait for my tiny workstation to crunch all the data, but I could farm it to the cloud.
So overall, that is quite a nice pattern. You kind of get the best of both worlds. You can use notebooks where they really shine, and then in the end an artifact that you really dare to run in production.
Now, I guess the other side of your question was then ok so the deployment. Indeed, the easiest way to get started is that you run pip install Metaflow on your laptop; it works out of the box, it’s nothing else that needs to be done. We have also had this belief that the needs of an organization grow over time. You don’t necessarily have to have the most battle-hardened, most scalable setup on day one… But you can start with something simple, and probably one of the simplest things is that you can sign up to AWS Batch, and there are three configuration things that you have to set. Or if you don’t want to do it by hand, there’s the Terraform/CloudFormation template. You go to the UI, you click the button, and it sets up the stack for you. And then you can start running compute at scale, and that’s really great. So if you have more than one person working on these things, you probably want to have decentralized metadata tracking, so people can share their results… That’s quite convenient, it comes as a part of Cloud Formation; not too hard. And then there’s the orchestration system. Again, part of the stack, depending on how you want to do it; you have freedom there to set it up in a few different ways.
And then of course, the larger the organization, now largest organizations, they might care about setting up the data governance rules, like lifecycle policies, thinking about “How do we harden the service deployment so it’s highly available?” stuff like that. But I think realistically, these infrastructure stacks - they need to start small, and they need to be able to grow with the organization.
I think many systems have the problem that either they are super-easy, but then they don’t scale as your company grows; you kind of outgrow them at some point. Or then they are way too enterprisy and you have to scratch your head about the Kubernetes deployments and whatnot before you can get even the simplest thing done.
[36:05] Yeah, and I’m looking through your documentation, which is great, but it seems like there’s this concept in Metaflow where, like you said, in a lot of these workflow management systems you’re writing YAML, maybe you’re writing JSON, you’re writing config files and Dockerfiles to manage these various steps… Which is definitely doable if you want to get into that. But in terms of – an approach you’re taking is this sort of decorator pattern in Python, where you’re defining maybe a class that’s your data flow, and some steps within that class that are decorated with a Metaflow decorator step… And then you’re connecting those different steps within your actual Python code to create your workflow, which is then maybe farmed out to some infrastructure. Did I get some of that right?
Yeah. And the interesting thing is – I mean, there are so many kind of systems that look like that, and I always say that the devil is in the details… Many systems – let’s say like the tons of systems that let you specify workflows in YAML, and oftentimes they say “Oh, you can run any Docker container, and you can run any code inside the Docker container that makes a step”, which kind of like, as a user, it kind of like pushes the hardest problem to you, that “Okay, wait a minute… So how do I define what code runs in this container, and where do I push my container? And what are the dependencies I need inside the container? And how do I move data between these containers?” So in a way, just like having a workflow, that’s kind of the easy part. That’s why I wanted to have this self-contained thing in Metaflow, that you have everything in one place; so you define the code, you define the dependencies, you define the resources, you define the workflow… Because then you get [unintelligible 00:37:43.15] that you can actually run the full thing in production and it does something useful.
And by the way, one thing that I definitely want to mention, which is really important, is that this is never a waterfall. It’s never so that you prototype, and then you deploy, and then you declare mission accomplished. If the project is successful at all, what inevitably happens is that either something fails in production, in which case you have to go back and start debugging, or the business stakeholder or whoever comes to you saying that “Okay, now we want better results. Can we improve accuracy? Can we add this new dataset?” and whatnot, so you kind of have to start iterating again…
That’s really the challenge with many systems, that even if you are able to do that one deployment, how do you come back from production. Everybody always talks about going to production, but how do you come back from production, and then keep iterating, and maybe start having multiple versions running in parallel? So it’s all these small things that really matter a lot when you think about running ML for real.
Yeah, definitely. I totally agree with that. So we talked about how you might set up your workflow in Metaflow. You can pip install Metaflow locally, connect it to AWS cloud resources… In terms of that step from, let’s say – because you talked about being in the notebook, and then kind of moving into this Metaflow workflow… But then eventually – let’s say that I create a pipeline that I really like, and I know that I need to run it, like, “Trigger when this happens.” Or I need to run it every Friday, to do this or that. What does that look like? Because I assume you could run your Python code locally with these Metaflow decorators. That gets sort of farmed out to resources in the cloud, and interact in various ways… But what does it look like to go from that to some sort of automation or something that’s running kind of hands-off and you’re not running Python locally?
Yeah. I can paint you a picture that I saw at Netflix that works really well. Now, I know that not many companies are yet at this stage. I do hope that the world will advance over the next three years or so… But I do think what’s really useful is that you have some kind of a centralized workflow scheduler. I know that many organizations are struggling with this question - should they have many different infrastructure stacks, and kind of different departments have their own, and ML has its own, and data engineers have their own…? But the fact is that ML is not an island, and especially if you want to use these things in production – I mean, producing real business value, you have to integrate with whatever is the outside reality out there. So there’s a lot of value in having a decentralized system.
[40:21] What the decentralized system needs to do is only to take care of this seemingly simple task which is that you have workflows, and the system needs to keep the workflows running, farming out then the compute to whatever is your compute backend - Kubernetes, or AWS Batch, or whatever.
Now, the beauty of this setup is that oftentimes these data science workflows run in tight conjunction with data engineering workflows. So you have ETL, and you can imagine that you have maybe a daily ETL that takes some raw data, takes some streaming event data and kind of massages that to new tables… And whenever that table updates, then maybe you want to update your models. And then in the best case, there is a triggering mechanism that automatically, whenever the data updates, then triggers the ML update as well. And there’s maybe some piece of information carried around, saying “Okay, these are the new partitions available” or “This is the latest hour”, or however you want to do this.
And then if you have this centralized scheduler, if you have this triggering mechanism in place, you can start constructing this almost like a web of workflows that comprises both of the data engineering side of the house, ETL, and even now if you want to use something like a DBT or create expectations, you can kind of tie that really nicely upstream, so you have a really nice ETL, like always data quality is there.
And then you have something like – let’s say the ML workflow is managed by Metaflow, and then oftentimes even what happened at Netflix is that there’s kind of the ETL after, that it might be that let’s say you do produce some batch predictions, and now you kind of have to load those batch predictions to another place. Or let’s say in some cases, some decision support systems even wanting to have those predictions in Tableau, or Airtable, or something; so you have another piece that then takes those results and pushes them to something else.
And now, of course, as the complexity grows, you want to layer observability tools on top of [unintelligible 00:42:06.24] that okay, what if something is late, how can you trace what’s going on? And of course, there’s a lot of additional infrastructure that you need there. And if I’m looking at this workflow orchestration landscape in the world overall, I think we are not quite yet there; many companies have maybe multiple [unintelligible 00:42:24.18] that are not connected; many companies still use this cron-based scheduling, that like “It always run at 3 AM, no matter what.” It runs at the same time, which is kind of silly. Also, the observability parts are missing. But I think that that vision is really great. I think it really helps a lot in [unintelligible 00:42:40.10] the ML really close to the rest of the organization, so it’s not like an island, like in some walled garden somewhere.
[42:48] Yeah. Well, I appreciate that very much. I know that we’ve covered a lot of different topics today… I do want to mention, again, your book, which is “Effective Data Science Infrastructure: How to Make Data Scientists More Productive.” We’ll include a link in our show notes to that, because this is something our listeners will really get into, because it is so practical. I’m just looking at the flow of your book right now, which goes all the way from notebooks, to workflows, to Metaflow, to production, and scaling up, which I think is a super-practical book, so thank you for your work on that.
Our listeners - you’ll definitely want to check it out, because we do have a 40% off discount code for the book from Manning. You can use the code “podpracticalai19” for 40% off of the book, which is pretty cool.
Well, maybe to end, we usually like to ask our guests some future-looking question, and I think you’ve already started to go there in terms of where you would love to see infrastructure go… But maybe in terms of data scientists and the infrastructure that they’re working with, what is what you’re hoping to see maybe in a couple of years in terms of data scientists workflow? How do you see that abstraction layer advancing and changing over that time period?
Yeah, I think that there’s work to be done at all layers of the stack. Again, as I mentioned a few times during this episode, I’m excited about the fact that we have so much compute available. I think that we can make that even easier. That’s exciting. Definitely a lot of work to be done on the orchestration side, [unintelligible 00:44:25.06] There’s just a lot of work to be done there.
Overall, I think at the higher levels, the fact of course is that many companies - if not most companies out there - are still struggling with the questions of how do they use ML to power their business, not only produce the most? And I think that goes back to even that organizational mindset change, like with the experimentation culture, and how do you divide work between engineers and data scientists, data engineers? So I’m super-curious to see how that evolves, and I’m already now when I’m talking to companies I’m always fascinated by all kinds of ideas, and all kinds of business opportunities that people are coming up with. And some ideas, of course, don’t end up working so well, but there are some amazingly promising ideas out there. I’m sure this will only grow tenfold, a hundredfold over the next three years.
So I think it’s pretty much inevitable, and the kind of parallel that I always draw is kind of to e-commerce and the web, back in 2000… Even setting up an e-commerce store took a lot of engineering work. Today you just sign up to Shopify, or you go to Squarespace, and you don’t have to write a line of code and you can get something that works amazingly well… And I think it’s inevitable that we will end there with machine learning infrastructure as well… But maybe it will take another 5-10 years.
Yeah. Well, I definitely look forward to that time. It’ll be a good time. But thank you so much for joining us, Ville. It’s been really wonderful to chat about your projects and your thoughts on data science infrastructure. I look forward to seeing how Metaflow grows and what you do in the coming years. Thank you so much.
Our transcripts are open source on GitHub. Improvements are welcome. 💚