Practical AI – Episode #211

Serverless GPUs

with Erik Dunteman, founder of Banana

All Episodes

We’ve been hearing about “serverless” CPUs for some time, but it’s taken a while to get to serverless GPUs. In this episode, Erik from Banana explains why its taken so long, and he helps us understand how these new workflows are unlocking state-of-the-art AI for application developers. Forget about servers, but don’t forget to listen to this one!

Featuring

Sponsors

FastlyOur bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com

Fly.ioThe home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with extended episodes, make the ads disappear, and increment your audio quality with higher bitrate mp3s. Let’s do this!

Notes & Links

📝 Edit Notes

Banana - Scale your machine learning inference and training on serverless GPUs.

Chapters

1 00:07 Welcome to Practical AI
2 00:43 Erik Dunteman
3 02:34 What does serverless mean to you?
4 07:17 What's the secret sauce?
5 09:30 How does serverless affect our workflows?
6 13:21 Sponsor: Changelog++
7 14:21 What languages do you prefer?
8 17:20 The Banana workflow
9 20:33 The necessary minimum skills to use Banana
10 24:51 A typical win
11 26:20 Incompatible workflows
12 30:11 Future Banana-AWS compatability?
13 32:53 Tips to choose your GPU
14 35:23 What's the future of serverless?
15 37:46 Outro

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?

I’m doing fine. It’s been interesting times… Though it’s not what we’re gonna be talking about today, I’ve been watching the showdown between Google and Microsoft over ChatGPT, and Bard… things are happening as we’re recording this, so…

I thought maybe you would have been too distracted with the new Harry Potter game…

Well, there is that, but… Yes, we all have our secret little things that we do to keep entertained.

Yeah, yeah. Well, I forget – one of our recent guests brought up that quote of “You don’t need to do machine learning like Google.” And you’re talking about like Google, and Bard, and all of these things… And when you think about those things, you think about “Oh, these data centers full of GPUs, and these huge supercomputers that they’ve got at their disposal to do things”, which isn’t the type of GPU infrastructure that most practitioners have access to… And that happens to be maybe the topic of what we’ll get into today a little bit…

Excellent.

…with Erik Dunteman, founder of Banana Serverless GPUs. Welcome, Erik.

Thank you. That was a beautiful lead-in. I definitely want to help people get that Google-level infrastructure without that level of effort, so… Glad to be here.

Awesome. We’re really excited to have you. I have to say, I did spin up a model in Banana leading up to this conversation, so I’m pretty excited to talk about it. But before we get into the specifics of all the cool things that you’re doing, I know that our listeners, like I say, they’re probably very familiar with GPUs, and why they’re important to AI and machine learning modeling… But maybe they’ve just heard of serverless as like this cloud thing that is like a thing that people do in the cloud, they do serverless things, and they’ve never thought about serverless GPUs. Could you just step back for a second and describe, first off, for people that might need just a very brief intro, what do you mean when you say serverless? And then kind of take us into serverless GPUs. Is that a new thing? Has that existed before? I’m curious to hear your perspective.

So I love your specific phrasing, “What do you mean when you say serverless?”, because serverless is one of those terms that nobody has really pinned down exactly what it defines. Our working definition is this idea that when you need capacity, when you need servers to handle your requests, when you’re in periods of spikes and surges of use, you have more servers. When you have less use, you have fewer servers. And when you have no use, you have zero servers. And the idea of this is to make it so that you as an engineering team and as a product don’t need to think about your compute as a fixed cost. It allows you to essentially view it as pretty much per request, pay as you go.

Funny enough, serverless really does mean servers running under the hood, but the -less is that you just don’t need to think about it; you think about it less. Happy to dive into what the details of that mean in regards to GPUs, but…

[04:13] Serverless has been around for about 10-15 years. I don’t know my exact timelines, but it’s been a concept within CPU-based compute, serving things like websites, backends… And people have been wanting this to exist for GPUs for a long time, and nobody’s really cracked it… And that’s the challenge we’ve been working on.

I know that you talked about websites, about backends, that sort of thing. Just in general, when we’re talking about serverless GPUs, in your mind is the use case that you have mostly on like the inference side, or on the training side of what practitioners are doing, or is there a little bit of both?

The vast majority, at least from what we’ve seen, is on inference. And I think inference is where the value of serverless comes in the most. There’s other tools for training where it’s not as latency-constrained, where you could use other infrastructure orchestration tools. But for specifically inference, serverless is one of the keys to the kingdom, if you could really do serverless well. So we just as a team have chosen to focus mainly on inference, real-time inference. So if there’s a user at the other end waiting for a response, we’re the ones responsible for making that response happen quickly.

Gotcha. And why has it taken so long to get to serverless GPUs, versus serverless CPUs?

One of the biggest problems in serverless is what’s called the cold boot time. Cold boot as in, you don’t have servers running; a request comes in, and that request coming in triggers a server scale-up going from zero to one, then one to many. And the time it takes in order to get resources provisioned and ready to handle requests in CPUs can take a couple seconds on a platform like AWS Lambda; it could take multiple seconds, maybe 10 seconds for cold boot. And that’s just simply spinning up the environment, spinning up a container, or a micro VM, whatever they’re running, and getting an HTTP server ready to handle that particular call or set of calls for the user before then shutting down.

So cold boot has been a big blocker, and it’s primarily the initialization time of the application before handling jobs. On GPUs and machine learning - exponentially harder. Reason being we’re running 20-gigabyte models. Those models can’t be taking up RAM before a call comes in, because that is not serverless; then you’re just running an always-on replica. So the cold boot problem is deeply exaggerated when you get to GPUs, because not only do you need to provision the GPUs and the environment or the container, you need to load that model from disk, onto CPU, onto GPU. That process could take 10 minutes for some models, and it’s just been a pretty huge blocker for most GPU use cases. So for that reason, this product hasn’t existed before.

Definitely not trying to delve into the secret sauce, if you will… But can you kind of lay the landscape of how you even start to think about that problem? Like, what are some of the different ways that you might address, and maybe different orgs… As you develop competition over time, probably different people will take different approaches; how do you even think about that landscape? Because that seems like a daunting task, when you talk about 10 minutes to get it moved over, and stuff. That’s huge. How do you even start to approach the problem?

So this is definitely one of our most prized pieces of IP, our cold boot tech, so I can’t dive too deep into the details.

No worries. Whatever works.

[07:57] What is publicly known - you’ve got to think about getting… Well, firstly - constraint. Constraint - you cannot take up GPU RAM. If you have a 40-gigabyte A100 machine - if you put a model into that RAM, that portion of the RAM, or like that machine entirely, if you’re not virtualizing it, it’s just like taken. You paid for it, it is dead space; if you’re not using it, that’s massive GPU burn without any utilization. So a constraint model can’t sit in RAM, at least GPU RAM.

So when we go about the cold boot problem, what we’re really thinking about is, “How do we get the models, specifically the weights, as close to RAM as possible, without actually occupying resources, or more precious compute resources, like 40 gigs of limited RAM?” That’s hard. But if you have a terabyte of storage on the machine, you could at least have local caching the model. So you could take that up passively between calls, without sacrificing that piece of hardware, because you could fit so many more models onto the disk.

Gotcha.

And then you can start thinking about how do you start pre-caching this on the CPU, if the CPU has enough RAM? I’m not saying that’s something we do, but these are like the frameworks in which you would start thinking about it, is “How do we get that RAM or that model as close to GPU RAM without actually taking up GPU RAM?” Because in the end, GPU RAM is - that’s where the cost goes. Because once you use that, that machine is tied up, and it’s not usable for anything else.

In your experience - I mean, I know you’ve been likely talking to tons of different clients, different use cases that are really kind of thinking about how their workflows could adapt to the serverless workflow… I’m just thinking about my own workflows; we’re running a lot of models, but none of our models on my team are like receiving thousands of inferences per second, or something like that. It is very much in the zone where we kind of had a burst of activity, and then we’re kind of down for a bit, not getting that much, and then maybe another burst that we need to process. So in that case, I would probably be willing, in my own use cases, to put up with somewhat longer of a cold start, like response for the model when it comes up, and then subsequent ones during that burst being much faster. What have you noticed with clients? What is the tolerance there? Where are you trying to get, and where do you think is reasonable for most workflows, I guess?

I don’t have a perfect answer for you on this, in that, ideally, cold boots are zero.

Yes, that’s true, I guess. [laughs]

On a serverless platform in general, unfortunately, you do have to start thinking about the servers, because you want to avoid cold boots when avoidable. In the case of Banana, if you have a model, it’s undergone a cold boot, it’s handled the first call, it’s ready to go, we have it configured to hang around for 10 seconds just in case more calls come in… And that 10 seconds is completely configurable by the user. So if no calls come in, we consider it “Okay, we’ve gone through the surge, we could scale down.” That particular replica scales itself down. If calls start coming in again, cold boots are incurred, again. Only if the existing replication you have can’t handle that throughput, it starts scaling up more.

So because we give users the ability to fine-tune their autoscaler, in a sense, or fine-tune maybe –

Configure…

[11:42] Yeah, configure; you can configure the autoscaler. So we have some users who choose to run always-on replicas, with a minimum replica count. So at any given time, maybe you have a baseline of two GPUs running. But you can surge to 20, if you need. So we have some users doing that. We have some users who have gone away from the default 10 seconds idle time to go longer, because they know they would rather pay for those GPUs to be up, and handle any traffic that may come in, than have more frequent cold boots.

The reason I give the context about Banana is I’ve been really surprised by how few users increase their idle time. Right now, at least the majority of the customers we’re serving are more price sensitive than latency sensitive; or at least given the general trade-off we give them, in that they could configure the idle timeout, and through that tune how much they pay, versus how much they wait. But most users would rather have machines shut down and then incur that cold start time. And that’s a great thing for us, because that allows us to chip away at this cold start problem, and give users an exclusively better experience, of the faster cold starts are, the more willing users are to take those cold starts, because it’s less impactful on their inferences… And the less idle time you could run on your GPUs following calls before they start shutting down, because it’s not as risky. Yeah.

So as you were kind of describing that, that was really – it’s a very interesting mesh of skills, it seems, to do what you’re doing there… Because you obviously have to have a pretty good understanding of deep learning in general, and kind of the AI space, and the performance characteristics around that. But you also have to go very, very deep in terms of network engineering, and architectural considerations, and such like that. It also kind of brings different cultures together, for instance, in terms of like the choices of languages, and stuff, distinctly. Do you tend to go with one language for everything for simplicity’s sake, or do you tend to go with different languages that are catered towards specific use cases? …by way of example, like Python for deep learning-specific things, and Rust or something C++ for infrastructure things? Or do you stick with one like Python for everything, because that way you have a simpler setup to govern? How do you make that strategy-wise?

So the obvious language for hosting ML model inference is Python. It’s almost a requisite, as in all of our users are running in it, so therefore the framework that we give users to build off of, which is essentially boilerplate for a server - that’s written in Python. We don’t need to maintain that too much. It’s an extremely simple HTTP wrapper, and the vast majority of our work on the pipeline infrastructure side is all done in Go. So we’re probably 95% Go; we have some TypeScript for our web app, some Next.js that we’re running, and then when you get deep into the runtime, we work on C++ and CUDA as well. But that’s a small subset of our engineering team works at that level; the majority of us write pipelines in networks within Go.

[16:17] I’ve gotta say, it’s kind of funny that you bring that up; Daniel and I love Go. We actually met in the Go community, because we’re both Go – we were at the time kind of like the two AI-oriented people in the Go community… So it’s just a little bit ironic to hear that.

That’s awesome. I have been so disappointed in Python… I mean, Python’s an amazing language. It’s where I learned my first bit of serious general-purpose programming, was Python. But I’m saddened to know that the language you chose for GPU programming, basically, is a language that like has a global interpreter lock. It does not have great multiprocessing built in. I wish Go were the choice there. It doesn’t seem like it’s gonna happen, but I’m a huge fan of Go. I think it’s a great language to write in, and I could go on for a long time about this. In fact, one of the reasons I learned about the Changelog network was listening to the Go Time podcast.

Yeah, for sure. Shout-out.

Yeah, shout-out to that other podcast.

Yeah, definitely, definitely. It’s cool to hear about the setup of how you thought about this problem, and how you even structured the team, and that sort of thing. I’m wondering, at this point, if you could kind of just give us a sense for like, if I’m a data scientist, or even just a software engineer trying to integrate a model into my stack, what does the workflow as of now look like for me with Banana? What do I do to get a model up and going, and maybe just a couple examples of that, to give people a sense… It’s a bit hard on an audio podcast, but I’m sure you’ve done similar things in the past, so…

Yeah. Well, I’d love to give a visual demo, but going through it audio-wise, generally, the process looks like this… A lot of people are building off of standard models, say a Stable Diffusion, or Whisper; at least for this current hype wave of all these new, exciting open source models coming out.

Until next week.

Yeah, until next week, and then the next one comes out. Thankfully, we have these one-click templates that you could use on Banana. So in a single click, you could go from an open source model that somebody has published on Banana, and bring that into your own account, and start using it yourself. So within a few seconds, you could have a functioning endpoint, for popular models that have been put up by the community.

And then we see, naturally, the step beyond that, moving from you effectively have an API, you don’t really know what’s running behind the scenes - you can fork that code, you can start working on it yourself, and customizing it for your own use case. So if you’re doing some fine-tuning, if quite honestly you want to go away from the standard, or like the big model templates, and roll it yourself, just have whatever deep net that you’ve built, that’s where you start getting into sort of the local dev iteration cycle. And this is where I shout out a previous guest, Nader over at Brev…

We recommend users go and have an interactive GPU environment, so that you could load your model, test it against some inference payload, shut it down, iterate… If you’re doing something like a Stable Diffusion, you want to make sure that the image transformations server-side are happening correctly. That’s where you iterate. You’re doing all of this within the Banana framework. We have an HTTP framework you could find open source online. That’s generally the building point for most users.

So you’re modifying a function within that, that is the inference function; it takes in some JSON, runs the model, returns some JSON. Do that iteratively until you have your customized model that works to the API you’re hoping for, and then you push that to GitHub. And then from there, you can go into Banana, you could select that repo, and we have a CI pipeline built in. So when you select that repo, we build the model, we deploy it; every time you push to main, we rebuild and redeploy.

[20:17] So we generally recommend users to, if they’re shipping new, fine-tuned versions, it’s usually them updating, say, a link to an S3. Then in the build pipeline, we bundle that model into the container itself, and get that deployed to the GPUs.

So I’m kind of curious, and this is sort of a follow-up, largely because of the medium we’re in; since we’re audio only, and we don’t have the ability to show the process that you’re describing… Just for clarity, your typical customer/user, what skills would they typically have to productively use Banana? What are those necessary minimum skills for them to be able to really engage productively and move through things?

A lot of our users are quite surprisingly full-stack engineers and not deep experience data people and ML people. So as long as you can wrap your head around using frameworks, or abstractions, like Hugging Face, for example - if you could use a pipeline like that, pull it locally, that’s something you could deploy into Banana.

So some Python expertise in order to write the code in the first place. It’s an HTTP server, so you write that… You wrap it around, say, a Hugging Face model; you don’t need to fine-tune it, you could use the standard models, and then learn fine-tuning later. And ideally, you do have some knowledge of Docker. Ultimately, what is deployed to Banana is a Docker file. If you build within our template, generally, you don’t need to do things that are too custom, unless you choose to. But a little bit of knowledge of Docker helps. So Python, Hugging Face, Docker - that’s effectively all you need in order to get something deployed onto Banana.

I’m just on the site now and kind of looking through some of your community templates, which are pretty cool… I mean, you have all sorts of things - Codegen, T5, Santacoder, all sorts of things with a sort of one-click Deploy button to get them up and going.

One question I had - when I deploy… Because it looks like based on your docs I can call it with like the model ID from Python, for example. So I could like integrate this directly in a Python app. Can I also call it sort of like as a REST endpoint, or something like that? Or is the primary use case a client integration?

We do have public documentation for the REST endpoints.

It’s not officially supported. We try to encourage people to go through our official SDKs, which at this point are Python, TypeScript, Go and Rust. That said, anyone who wants to go directly into the REST endpoint, there’s documentation to do.

We like being able to boil it down to a simple banana.run function, where you just give a model key, you give whatever JSON in you want your server to process, and then you receive the JSON out from that. But our goal is to be able to give people access to levels of abstraction that they choose to run in.

For example, because we have a public REST endpoint, people have integrated Banana into their Swift applications, or into their Ruby applications. So it’s an HTTP call, in the end. People could unwrap their APIs and go at it directly; feel free.

I guess that leads right into my next question, which is - does anything stand out in terms of how people are using this serverless workflow that maybe surprised you, based on what you’re seeing?

I’ve been amazed at the quantity of fine-tunes that are deployed through Banana. If you look at the analytics of people deploying from our one-click templates, versus people deploying from custom repos, 80% are customer repos. And that means that people are coming to serverless because they have a unique API that they need to run somewhere, and that they can’t simply run with a standard API provider, or even an API provider with fine-tuning features. They want to go to own the API themselves, own the application logic itself, to fine-tune it themselves, and just dockerize that up and send it on to Banana.

[24:17] So the vast majority of our users are doing custom workloads, which to me was surprising. A little Banana lore - we previously started as an ML-as-an-API company; the idea of showing up, click the model you want, and you get an API for that. And there’s a lot of pull there; especially right now with the hype. There’s so many people who want to integrate AI into their applications without touching the AI at all. So it has been surprising for us seeing how many people are running custom code on us. It’s been validating of the idea that the platform approach versus the API approach has been the way to go.

Could you kind of walk us through what a typical one might look like, where someone’s doing that kind of custom thing, just to give us a sense of what it is that you’re seeing? Whether it’s fictional, but realistic, or a real case example, whatever works for you.

So one thing users are doing just as a very basic example of if latency is an extremely sensitive thing for them, and cold boots are particularly painful, what they’ll do is they’ll engineer a conditional, like a boolean in the JSON that they send in, that’s called the warm-up. So they’ll do like warm-up = true, and make it so that server-side, they actually don’t perform any heavy computation. It’s just intended as a warm-up call. So if architecturally they need servers running, like fully warmed up by the time the actual inference starts running, they engineer this into their endpoint.

Another thing is if people want to run fine-tunes, or run multiple models side by side, and start doing some model chaining, we see people building that into Banana as well.

And then, lastly are just – basically, state-of-the-art moves so fast right now that the second Stable Diffusion launch, for example, suddenly, there’s inpainting. And inpainting is the next thing that came out a week later, and that’s some random code people found in a GitHub, and they integrated themselves. So customization, in that sense, allows users to stay as far ahead as they possibly can, if it’s necessary for their use case.

Could you highlight something you have in your mind as maybe like a workflow that would not be appropriate for the sort of serverless GPU infrastructure? Like you say, fine-tuned models, inferencing, using these state of the art templates - is there something where you would say, “Hey, maybe that’s not fitting for the serverless use case”?

Yeah. So inference land, if you have completely steady traffic all the time, don’t use serverless. You’ll get unnecessary cold boots, and it just slows down your inference, and you’re paying effectively the same. So that’s the inference side. Training side, we’d like to think that you could currently train on Banana, though I often find that training is a more interactive experience, or at least in like the initial prototyping phase. Once you have pipelines built in to, say, automatically collect data and batch train, that actually does work on Banana, because you could just fire that data as the payload, train the model server-side, uploaded to S3, return the call, and then the replica shuts down.

But most training jobs, or most like exploratory training jobs, I would not recommend doing on serverless, in part just due to the observability that you need to see; the tracing, setting up things like - this is outdated tech, but TensorBoard, smart visualization tools… Also, keep in mind, I’m not a training expert. Perhaps there’s space in the training that people would see value in serverless, but generally, I’d recommend avoiding serverless.

[28:01] And then lastly, if you have any jobs that are batched, as in you know exactly when they’re going to happen, it’s a bit easier to automate your own infrastructure and build it yourself to do that. Ideally, we make serverless so good that you don’t need to think about that, but I think in the current state of serverless, a lot of batched processing jobs, if you’re, say, running an indexer across an internal database, and you don’t need to have it running all the time, that’s where running on serverless may be a bit too much lift in order to port it into serverless, versus just doing it yourself.

I’m also looking through your website while we’re talking, and I’m in the docs, and I kind of hit the SDK area, which you kind of talked about a little bit ago, with the different SDKs, in Python, Node, Go, REST… Did you mention Rust earlier, or did I mishear that as REST? I may have misheard something.

I did mention Rust. I actually don’t know if we have it documented. We launched it two days ago, if I recall.

Gotcha. Well, so the thing that got me thinking here - that’s very leading edge; it’s very like out there. I’m kind of getting the sense that your customers are adopting more forward-leaning languages in general for what they’re doing, and that’s why they’re leaning forward into this new concept of serverless GPUs. Is that consistent with what you’re seeing? Are you really kind of targeting the types of software developers that are kind of early adopters, paving the way, versus somebody that’s maybe in some of the older, more enterprisy languages, maybe not quite as risk-taking, and such?

That’s very much in line with what we’ve been seeing. We find that a lot of our users are adamant Vercel users, as an example. So they’re in Next.js. They’ve chosen a relatively modern framework to build their frontend apps in, and they make the same decisions for their backend. They’re often TypeScript-forward. If they want to do systems level, they’ll do Rust or Go; for these reasons, we’ve chosen to offer these official SDKs.

Yeah, that’s really interesting. One of my questions in kind of thinking about this is like the different use cases that you could have, the different industries that are rapidly adopting AI, integrating it in their software stacks… Everybody’s adopting AI, right? But it’s certainly making a lot of strides in certain areas… And certain industries, let’s say healthcare or something like that, have very unique constraints around even like their own inference data leaving to go to some hosted model somewhere that’s not in their own infrastructure… But in other words, when I go to Banana, I see all I have to care about is like deploying a model, there’s my model ID, I can think about like the timeout and all of that, it’s all very functional, and I don’t even have to give a thought for where that’s running.

I can see the opposite end of that is like certain industries would probably be a little bit uncomfortable with that, but there’s a whole lot of developers that are just wanting to bootstrap these amazing AI-powered things very rapidly; there’s so many things coming to market like that… So I guess that would be fitting in in that way. Do you have any plans in the future for Banana serverless, but connect my AWS infrastructure, or something like that, to run in the Banana way, or something like that?

Short answer? Yes. Long answer…

It’s complicated.

…it’s gonna be a long time. Yeah, it’s very complicated. And one of the things that we see with serverless is the fact that we have economies of scale sharing everyone as tenants within our cloud, because that allows us to do more efficient bin packing, and make it so that when you’re not using a server, like when the server containers shut down, you’re not charged. If you’re running on your own cloud, you still need to have the underlying resources running.

[32:04] We’re a venture scaled business, we want to hit that million-dollar annual revenue, ideally. Or sorry, not million-dollar, a hundred-million-dollar annual revenue; ideally more. And I think getting into that, we’re eventually going to have to start thinking about how do more traditional enterprises integrate this… Though choosing our niche right now, we see significant pull that could get us to $10 million annual just from these new teams who aren’t bound by such constraints of needing to run in their own cloud.

So long answer, restated, we’ll get to it eventually. And I’m sure it’ll be a necessary part of the product, but it loses out on a lot of the magic that we’re currently providing. So we’d rather just focus on these new and upcoming startups that are running on us.

Yeah, that makes a lot of sense. It does make me wonder, because you are creating so much magic for the users, and a lot of that – like you’re saying, thinking about what GPUs are you spinning up, how are you bidding on these, how are you how are you allocating them? Have you learned any sort of like general – like, you can get GPUs from a lot of places, there’s a lot of different kinds of scales of pricing, there’s a lot of different ways to run GPUs in the cloud… Have you found any just sort of like good practices, or things that you’ve found to be useful just generally, in terms of thinking about using GPUs in the cloud, that you would love to pass on to listeners?

So we use this phrase called “Skate ahead of the puck.” It’s a phrase from hockey, where - don’t go to where the puck is, go to where it’s going. So applying that to auto-scaling - auto-scaling really has two components. You’re auto-scaling the underlying nodes, the hardware that’s running the GPUs, that’s running the Kubernetes cluster, or whatever your deployment target is. And then secondly, you’re auto-scaling the deployments themselves, going from replication of zero to one, to many, within the confines of whatever nodes you have set up.

So you’re effectively auto-scaling to things: Kubernetes pods and the nodes themselves. So my recommendation - if people are building things like this in-house, what they should absolutely do is use a platform that has an automation API for the underlying VMs. Right now, GPU cloud is sort of the Wild West, there’s a lot of new players; traditional hyper scalar clouds like Google Cloud, AWS, Azure - they have the automation, but the GPU prices are not as competitive as you could get on some of these newer clouds.

So my biggest recommendation for people building mature systems would be to choose a provider that you get ideally guaranteed access to GPUs, which allows you to scale your GPUs up ahead of the demand of whatever workloads you’re running within your cluster. And then it doesn’t have to be homogenous, the workloads deployed; just as long as you maintain GPU capacity to handle those, you should be good. But because you’re auto-scaling, the applications within Kubernetes allows you to have a little more lead time for like super-slow scale-ups on the GPUs.

This has been a super-instructive conversation. I’m learning a lot. I want to extend your analogy one question further, because you’re talking about skating ahead of the puck; not skating to the puck, but where it’s gonna go. You are pioneering this field, you are out there on the front, you are leaning forward, and you are supporting other people and other organizations that are trying to lean forward as well, so I’m going to ask you, where is the puck going? Short-term, middle, long-term, how do you see the future? For those who are not in your industry, but are going to be supported by you, tell us the vision. What’s it going to?

Fine-tunes are going to be huge. I think there’s two camps for where AI is going to be going. There’s the one-model-rules-them-all camp, which is there’s going to be some mega model that does everything, and then there’s the other camp, which is what we’re leaning into, which is the best model for you as a user is a model that’s trained on data from you; specifically you. And we see customers deploying fine-tunes on us, not just for their use case, but for their end user.

Imagine you are building a writing assistant app; how do you fine-tune for every single one of your end users, and deploy that, and make it so that that user has a unique model as essentially a companion, almost a clone of them? And where the puck is going is where every human on Earth, just like they have a phone in their pocket, they’re gonna have a fleet of models fine-tuned just on them. And that’s one thing that we’re excited about with Serverless, is in order to do that viably, you’ve got to have serverless. You can’t have it running all the time. So we’re very excited in this sense. If you’re not looking into user-level fine-tunes, I think it’s a very interesting space to be in, because it gets you so much further than any application-level stuff you can do to make the experience better.

That’s awesome. Yeah, I think that’s a super-exciting way to close out the conversation. This is a really exciting time to be in this space, both in terms of what’s possible with fine-tuning and those sorts of technologies, but also like new infrastructure coming up, like what you’re building. So thanks so much for taking time to chat with us, Erik. It’s been a real pleasure.

This is awesome. I appreciate it, guys.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00