Practical AI – Episode #240

Generative models: exploration to deployment

get Fully-Connected with Chris & Daniel

All Episodes

What is the model lifecycle like for experimenting with and then deploying generative AI models? Although there are some similarities, this lifecycle differs somewhat from previous data science practices in that models are typically not trained from scratch (or even fine-tuned). Chris and Daniel give a high level overview in this effort and discuss model optimization and serving.



Neo4j – NODES 2023 is coming in October!

Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at – The home of — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at and check out the speedrun in their docs.

Notes & Links

đź“ť Edit Notes


1 00:07 Welcome to Practical AI
2 00:43 Daniel at GopherCon & IIC
3 03:31 Local inference & TDX
4 08:23 Cloudflare Workers AI
5 09:43 Implementing new models
6 16:14 Sponsor: Neo4j
7 17:11 Navigating HuggingFace
8 20:21 Model Sizes
9 24:34 Running the model
10 30:20 Model optimization
11 34:17 Cloud vs local
12 39:26 Cloud standardization
13 43:00 Open source go-to tools
14 46:21 Keep trying!
15 48:18 Outro


đź“ť Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another Fully Connected episode of the Practical AI podcast. In these episodes, Chris and I keep you fully connected with a bunch of different things that are happening in the AI and machine learning community, and we talk through some things to help you level up your machine learning game. My name is Daniel Whitenack, I am founder at Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?

I’m doing great today, Daniel. How’s it going?

It’s going good. This week, I was – well, it’s been an interesting couple of weeks for me, in that I was at the Intel Innovation Conference out in San Jose the week before last… And then this week, I was at the Go programming language conference called GopherCon, and taught a workshop there…

Oh, you’re so lucky…

So that was really enjoyable… So two weeks in sunny California, or mostly sunny California, I guess… That was really cool. So maybe even just highlighting a couple of cool things that are happening in those communities at Intel, there were a couple of things that were highlighted that might be of interest. One is, it seems like Intel is really diving into the idea of AI-enabled applications on your local machine, which I know is something we might talk about a little bit in this show in particular… That is like, hey, if I want to build a desktop application that people actually run on their laptop, and I want that to run Stable Diffusion as part of the application, and not reach out over the network to some API, how would I build that, and what would those sort of like - AI PCs is, I think, what they were calling them - what would those have to look like? And they’re thinking about that with some of their processors, which is interesting.

And then on the data center side, they had a bunch of things, including announcing the Intel Developer Cloud, which is cool because you can go on there, similar to other cloud environments, and spin up either a VM, or actually connect to a bare metal instance that has their latest generation of processors, including these Gaudi2 processors, which are from Habana Labs; they were acquired by Intel, I forget when… So they would be sort of on the data center side; you’re running accelerated workloads on these. And we’re actually running some of our Prediction Guard stuff on these Gaudi processors, and seeing really great performance.

So those are a couple of things highlighted from there… And yeah, I don’t know, have you heard those themes in your conversation as well, in terms of either new processors, advances in data center technology, or this kind of local inference side of things?

I have quite a bit, actually… And I’m certainly not an expert on microelectronics by any stretch, but I have friends who are, and listen to them closely when they talk… There’s a bit of an ongoing revolution on the microprocessor side. Many of us that have been in the AI world for a long time, there had been – for instance, GPUs from NVIDIA had been kind of a core to that, but there’s a lot of chip types that have been coming out by a number of different vendors to compete with that. Famously, Google was probably the first one well-known with their TPUs, Tensor Processing Units… But there’s all sorts of specialized chips and chiplets that are coming out, that are enabling these types of things. So I think Intel is definitely one of the global leaders in that, and looking forward to having – it’ll be nice when everyone’s laptops and phones and everything are all completely equipped with everything they need.

Yeah, yeah. It’s super-interesting, especially for use cases where it’s like your personal assistant, AI-enabled personal assistant, that really is tied to you personally. Applications like that - I think you’d want to run a lot of those things locally, and not be sending a lot of that data all around. So that’s kind of interesting.

They also talked a lot about confidential computing, which is an interesting topic that I think maybe some of our audience at least wouldn’t be familiar with as much from what we talk on this show about, but it is very connected to the AI world in the sense that if you are running kind of secure workloads through AI models, whether you’re doing that on NVIDIA chips or other chips like we’ve talked about, there are ways and toolkits to enable you to actually secure the environments that you are running those models in and actually provide at a station to know that nothing has been tampered with inside of those kinds of secure environments.

So I’m going to surprise you… I actually know quite a lot about that; those are trusted execution environments. Let’s just have touched on those quite a lot…

I think Intel’s version is like TDX, Trusted – yeah, something…

[00:06:09.06] They have a couple of different versions. That’s the one that’s out in the marketplace right now. But yeah, it’s the idea of ensuring that when you – normally, if you’re running a program and it has to transit, obviously, from system to system, every system has a processor it’s processing on… And even if you’re running encryption at the application layer, you have to unwrap that encryption for the processing to happen in the chip. An adversary, if it’s on the order of a major nation-state, has the ability to steal unencrypted information that had been encrypted in transit, straight out of the processor memory. And Intel and other vendors are starting to push Trusted Execution Environments and products and services around that, which protects and guarantees the safety of that data inside the processor. Something I’ve spent some time on, actually.

Yeah, that’s super-interesting, and I think even the CTO in his talk had like a T-shirt that sort of had a Venn diagram kind of thing between like security and AI… And at the intersection of that is a lot of, you know, what he talked about, this sort of idea that “Hey, whatever hardware you’re running on, if you can combine AI workloads with these sort of trusted or confidential computing ideas”, that can be very powerful, and take care of at least some of the security and privacy concerns that people have with AI workloads in general, which is cool.

So yeah, the two are converging in a big way, because while Trusted Execution Environments, which are referred to as TEEs, have been around for years in processors. Now that we are having large federated workflows, which is really classic on cloud-based AI jobs, where you’re distributing an AI inference or training across many, many systems, with very, very important data that you would not want to get into an adversary’s hands, that Federation is really kind of pushing AI and chip providers together in that way to guarantee that. We didn’t see lots of workloads that would be falling in that category until we hit the AI space, and it’s chock-full of them.

So I keep remembering things that happened over the past couple of weeks while I’ve been traveling, and people I’ve mentioned… But one maybe other noteworthy thing for people to be aware of more of the infrastructure side, which I think we will talk a little bit more about in this episode, is that Cloudflare announced their Workers AI, and I think this is the latest in this sort of series of serverless GPU solutions. So these Worker AIs are Cloudflare’s version of the serverless GPU type environment that we’ve talked about with things like Modal, or Baseten, or Banana… There’s a lot of these coming out, but I think it’s worth noting that a very large player like CloudFlare is now kind of dipping into the serverless GPU space, which I think also signals that we’ll be kind of seeing in the cloud side more and more push towards serverless GPU workloads and environments that support that.

Interesting. Very interesting.

[00:09:31.21] Well, that’s a bunch of infrastructure and confidential infrastructure and computing and security stuff that has crossed our paths in the past couple of weeks… But one of the questions that you asked me leading up to this recording was about - things are moving so fast, and I think deploying and managing an AI workload may look different now than it even looked six months ago… And it’s been a while since we talked through the kind of developer or technical team perspective on how you might, if you want to use one of these models that’s coming out all the time… So Mistral AI’s model just came out, the ones that received huge, amazing amount of funding just earlier in June, and now they have their first model out. It’s released Apache 2, so you can download it… So the question is, let’s say you want to use one of these great models that’s coming out these days, and you want to host it in your company’s infrastructure, or even just play around with it as a developer… What does that look like currently? Because there’s also, along with these models that are coming out, new tooling that’s coming out all the time. So what does that look like these days, and what are the various options and things to consider as you’re interacting with these models and considering even hosting them yourself, or integrating them in your own infrastructure? That’s a fair question, because it’s been a while since we talked through some of the infrastructure, I think, Chris.

It has. And for what it’s worth, I’m gonna brag on you for a second, since I know that you would not do that to yourself… With Daniel being the founder of Prediction Guard, this is a topic that he is a global expert in; really, really knows what he’s doing… And as we were talking about – I’ve had so many people asking me these questions that Daniel was just talking about lately, and I was like “Well, one of my best friends is a real pro at this…” So thank you – if you can kind of start walking us through… And this is a moving topic, as you just pointed out; it has changed in the last few months, and will continue to evolve over time… But yeah, if you can start walking us through what that looks like today. We’re in the beginning of the fall of 2023; something that might help the rest of us for at least the next few months.

Maybe one note on this is – I’m also getting these questions all the time, and like you say, I’m deploying models all the time with Prediction Guard… I think a lot of people, if you’re a developer or infrastructure person, you just have that natural desire – even if you end up using a model that’s behind some API that’s hosted by someone else, it can be useful and instructive in building your own intuition even to just try deploying one of these models. See what’s involved, see how they run, that sort of thing. It’s also kind of worthwhile, from my perspective, to experiment with different models, before you, say, lock yourself into a certain model family or something. It’s relatively easy now with the tooling to get somewhat of a sense of how these different models perform, and build up that intuition for yourself, even if you end up using a model that’s behind an API.

I mentioned I was at GopherCon this week, and that was some of the questions that came up, too. I taught a workshop on generative AI, and that was a good, long discussion in there that people had a lot of questions about, was “Hey, let’s say I didn’t want to use one of these APIs. How do I pull down a model and use it?”

So yeah, let’s jump in, let’s first maybe talk about something that I know that we’ve touched on before, but just to emphasize here… Where can you get models? And let’s say that we’re putting aside for a second the kind of closed proprietary chunk of models; these would be ones from like OpenAI, Anthropic, Cohere etc. They have their own APIs, they host those models… Let’s say that we’re interested in either – an open access model, but it could be either an open and somewhat restricted model, or an open and somewhat permissively-licensed model. And we’ve talked about that on the show, too… For example, there’s models that come out that are licensed for commercial use, or non-commercial use, or research purposes only… But let’s say you want to use one of these open access models.

[00:14:11.10] The first question that might come up is where do I find these models? The best place that you can find these models is on Hugging Face. So if you go to the Hugging Face website, just, and you click on Models, you’ll see that there’s, at the time of this recording, around 345,000 models on Hugging Face.

A few to choose from.

Yeah, yeah, a lot to choose from… And think about this, those of you that are familiar with GitHub - how many GitHub repositories are there? There’s a lot of GitHub repositories that are like someone tried something in one afternoon, and uploaded something to their GitHub repo, right? It doesn’t mean that’s the most useful thing for you to use in your workflows, although you could kind of learn from it maybe. It’s similar on Hugging Face; there’s a lot of people that might be like “Oh, I tried fine-tuning this model, and now I uploaded it to my repo on Hugging Face.” And similar to GitHub, one of the things that you want to look at just as a practitioner is look at how many people are downloading the model, look at how many people are hearting the model, or liking the model… And you can filter by those things. So if I click on Model, I can then click on a filter like the task that I’m interested in, a computer vision task, or an NLP task, or an audio task… And then I can look at both the trending models and how many models were downloaded, filter by things like licenses and languages… So yeah, I think the first thing to be aware of is just the landscape of models and where you find them. And the best place for that currently, although there are other repositories, is by and far Hugging Face. Go there and treat it similarly to GitHub, in that there’s going to be a lot there that might not be of interest to you, but there’s going to be some really great things there as well.

Break: [00:16:15.28]

Okay, Chris, I’m on Hugging Face, and I see a bunch of different models that are potentially available to me, and I can click on, for example, Object Detection, and see that the trending model that I’m looking at is from Facebook, DETR ResNet 50… It seems like people have used ResNet quite a bit. 603,000 downloads, and so maybe that’s a good place I want to start if I’m looking at object detection.

If I go to, let’s say, automatic speech recognition, up at the top would be OpenAI’s Whisper model, which is a great choice, and released openly, that you can use or speech transcription. If I go to, for example, text generation, which a lot of people care about these days, the trending one right now is this new Mistral 7 billion model that we mentioned earlier was just released. So let’s take those as our kind of examples. Let’s say I want to run something like OpenAI Whisper, or I want to run text generation with Mistral, 7 billion… Or there’s even a range of sizes of models; the 7 billion model from Mistral, Falcon 180 billion was released recently…

So one question that I think people have is “How do I know which model might serve my task well?” And one thing I’d like to recommend to people is even before you try to download the model yourself and run it, you can go in and click on these models; like, if I click on Mistral 7 Billion version 0.1, if you notice on the right-hand side of the Hugging Face model card for that model, a lot of these models already have a hosted interactive interface that you can just click the Compute button and see the output of the model. So it’s kind of like a playground that you can see a bit of the output of; you can do the same thing with a lot of computer vision models, or audio models… And then below that, you’ll see a little thing called “Spaces using Mistral 7 Billion”, or if you’re on Whisper, “Spaces using Whisper.” These are little demo apps that are actually hosted within Hugging Face’s infrastructure, where people have actually integrated Mistral 7 Billion. And a lot of these are kind of just a simple input/output interface.

And so even without downloading the model, if you’re just trying to get a sense for what these models do, you can click through some of these spaces that are using them, or just look at that kind of interactive playground feature, and just try – upload some of your own prompts, or upload some of your own audio or whatever that is, to see how the model operates. I think a lot of people might miss this, if they’re just scrolling through.

Let me ask you a quick question. If you’re looking and you’re trying to narrow down which model you want to pick, we’ve talked on previous episodes about some of the concerns that go with different sizes, and such… So are there some models that, unless I have a very large infrastructure available to me, many GPUs for instance, that I should probably disregard? Is there like a minimum and maximum practical threshold, that let’s say that I have some hardware, but not everything that I would dream about, that I might want to go for?

So there’s kind of an answer to this, and then a follow-up…

One is for this sort of transformer language models, oftentimes if you go much beyond 7 billion parameters, maybe pushing it up to kind of 13 to 15 billion parameters, you’re not going to be able to run it very well, just by default, by downloading it and running it with the kind of standard tooling on anything but a single accelerated processor, like a GPU. And even then, most of the time not on a consumer GPU. However, the follow-up to that is that a lot of people have created open source tooling around model optimization, that may allow you to run these models on consumer hardware, or even on CPUs. And I’d like to talk about that here in a bit, that a lot of times you may want to consider this sort of model optimization piece of your pipeline when you’re considering how to run the model… Because sometimes the sort of default size and default precision of the model might not be best for you, both in terms of your needs, in terms of performance, or in terms of the hardware that’s available to you.

[00:22:10.25] But I would say, in this phase of like “What model is going to be good for me?”, go ahead and put that sort of hardware concern - although it’s important, put it a little bit to the side and focus on which model is giving me the output behavior that I want. Because you have a certain task in mind, and if you could figure out “Hey, this model kind of does what I want, and it seems like it’s giving pretty reasonable output”, and then you find out “Oh, well, I can’t run it on the GPU that I have, or I need to figure out how to run this on a CPU”, then that kind of narrows down the type of tooling that you’re going to have to use for optimization. Or you might not need to optimize at all.

So kind of start with the smaller models, and build up to something that fulfills the behavior requirements that you have by just using some of these demos, using some of these spaces… And then think about “Okay, I’ve now figured out I need Falcon 180 Billion… So what does that look like for me to run that in my own infrastructure?” Then there’s kind of a follow up series of things that we can talk about related to that.

Gotcha. Thanks. So I was kind of getting ahead of myself then a little bit in terms of worrying too much about hardware first.

Yeah, yeah. I think the question – well, maybe it’s because I come from a data science background, right? My data science experience always tells me “Start with the smaller models, and work your way up to the bigger ones until you find something that behaves in a way that will work for you.” And then figure out the kind of infrastructure requirements around that. Because if you start smaller and work to bigger, it’s going to be easier to work with that smaller model infrastructure-wise, and latency-wise, and all that. But some people do have really complicated sets of problems, where they need a really big – like, let’s say I wanna produce really, really, really, really good synthesized speech, or really, really good transcriptions from audio… I’m going to need maybe a bigger model than a really, really small Open AI Whisper model. So it has to do with the requirements of your use case as well, I would say.

Okay. So let’s say you identify a model, and you’ve kind of picked what you want to do. Where do you go from there?

Yeah. So let’s say that you’ve picked a model, and let’s take the first case, where it’s a model that could reasonably - or you think it could reasonably fit on a single processor, a single accelerator; or by your own sort of infrastructure constraints, you need it to operate on a single accelerator. And even if you don’t have those infrastructure constraints, I think one recommendation I often give is it’s just way easier to run something on a single accelerator, or a single CPU. So I personally recommend to people, even if it’s a bit larger of a model, convince yourself that you can’t run it on a single accelerator or a single CPU before you make the jump to spin up a GPU cluster, or something like that. It’s just a lot harder to deal with, even with some good tooling around that side, which we can talk about.

So yeah, let’s say that you’ve found a model… I don’t know, let’s say it’s our Mistral 7 Billion model. You should be able to run that on a single instance with an accelerator or a GPU. I would then look at that model, and depending on the type of the model – oftentimes in the model card on Hugging Face, hopefully, if it’s a nicely maintained model on Hugging Face, then it will likely, just like a readme on GitHub, it will likely have a little code snippet that says “Hey, here’s an example of how to run this.”

[00:26:18.12] What I usually do in that case is I just spin up a Google Colab notebook – because I want to see how this thing runs, and how many resources it’s going to consume. So I’ll spin up a Google Colab notebook. If people aren’t familiar, Google Colab is just a hosted version of Jupyter Notebooks, with a few extra features, like you can have certain free access to GPU resources… There’s similar things from like Kaggle, and Paperspace and Deepnote and a bunch of others.

So spin up one of these hosted notebooks and just copy-paste that example code in that notebook, and try a single inference. And oftentimes, what you can do in these environments is, if you look up at the top-right corner of Google Colab, there’s a little Resources thing… And once you load your model in, you can actually look at “Oh, how much GPU memory am I taking up? How much CPU memory am I taking up?” And that gives you a good sense of “Hey, I loaded this model in, I performed an inference… If I just do nothing else - like, the most naive thing I can do - then I am consuming 12 gigabytes of GPU memory”, or something like that. And that kind of tells you, if you don’t do any optimization, then you’re going to need a GPU card that at least has 12 gigabytes of memory. And so maybe you use like an A10G, or you could use an A100; that might be a little bit overkill, in this case. But one of these with maybe 24 gigabytes of memory, you have a little bit of headroom there. Now you’ve narrowed down not only the model, but potentially the hardware – assuming you don’t do any optimization, potentially the hardware that you could use to deploy it. So as of yet, I haven’t spun up really any infrastructure. This is kind of my standard thing, where I’m like “Hey, what’s the deal with this model? How do I perform a single inference, and what kind of resources am I going to need?”

It’s a nice little cheat code equivalent of finding out what you’re getting into, it sounds like…

Yeah, yeah, for sure. And if you happen to have – the other way I’ve done this in the past is if you happen to have a VM, or maybe it’s just your own personal workstation, and you have a consumer GPU card, if you have Docker running on that system, you could pull down a pre-built Hugging Face Transformers Docker image, and just run it interactively; open a Bash shell into that Docker container, and run an inference, just like I said, or spin up the model, load it into memory in Python… And then in another tab, or another terminal, just run Docker Stats, and it’ll tell you how much memory you’re consuming, and that sort of thing. Or run NVIDIA SMI, or the similar for other systems, or other processors, that would tell you how much GPU memory you’re running.

So this is kind of a next phase that I do. The first is maybe what kind of model do I want, the second is “How do I run an inference with this model?” Then kind of is a whole branching series of funness, which is either you go down the path of saying “I want to optimize my model in some way to run it either faster, or on fewer resources”, or I want to go down the path of saying “No, this is fine. I can run it with the resources that I figured out it needs”, and then you kind of move on to the deployment side of things.

Break: [00:30:02.04]

Okay, Chris, let’s say that we want to follow the path on our choose-your-own-adventure that you want to do some model optimization on your model.

The reason you would want to do this is one of two reasons. One is “Hey, it turns out I crashed my Google Colab trying to run Falcon 180 Billion, because I ran out of GPU memory”, and it turns out you need more GPU memory for that, or multiple GPUs. And I don’t either have access to that, or don’t want to pay a bunch of money to spin up a GPU cluster and run the model in a distributed way. Or it’s maybe even a smaller model, and you want to run it either faster, or on standard, non-accelerated hardware.

I heard a talk at GopherCon about a workflow where people were running a model at the edge in a lab to process imagery coming off of a microscope. And it was all disconnected from the public internet. So in that case, you just have a CPU - maybe you need to optimize on the CPU… So there’s gradually more and more options that are out there to do this. Some people might have seen things like LLaMA CPP, which is sort of an implementation of the LLaMA architecture that’s very efficient, and it allows you to run LLaMA language models on like your laptop, or on – I think a lot of people were running them on MacBooks, with M1 or M2 processors.

If you want to kind of scroll through this set of optimizations stuff, if you go to the Intel Analytics BigDL repo - that’s BigDL, like Big Deep Learning… First of all, the BigDL library does a lot of this sort of optimization, or helps you run these sorts of models in an optimized way… But they also have this little note at the top, which is actually a very – I’ve found it to be a very helpful little index as well. They say “This is built on top of the excellent work of LLaMA CPP GPTQ, GGML, LLaMA CPP Python, Bits and Bytes, QLoRA, etc, etc, etc.” These are all things that people have done to run big models in a smaller way, I guess would be the right way to put it. Bits and Bytes is a good example of this. Hugging Face has a bunch of blog posts about this, where they’ve run the big BLOOM model in a Google Colab notebook by loading it not in full precision, but in quantized way. But there’s a lot of different ways to do this, and that’s kind of a good reference to see a bunch of those different ways.

At some point for a future show we should come back and revisit that. That sounds really cool.

Yeah, yeah. And I think it probably deserves a show in and of itself. People might refer back to an episode that we had with Neural Magic on the podcast, where they talked about the various strategies for optimizing a model to run on commodity hardware like CPUs. But there’s a ton of different projects in this space, both from companies and open source projects, like OpenVINO, and Optimum, and Bits and Bytes, and all of these. So if you are needing to take this big model and either make it smaller, or run it more optimized on certain hardware, then you might want to go through this model optimization phase. Assuming you did that, or you didn’t need to optimize your model, then we get to deployment. Now, Chris, what’s in your mind when you think of these days where my people want to deploy models?

[00:34:18.07] Yeah, it’s one of those situations where a lot of people I’m talking to are trying to decide between cloud environments - and we’re seeing some people that had dived into cloud pulling back in investing in their own… As well as starting to explore some of the other chip offerings. So people are kind of reconsidering that Go Cloud when it’s too big for you now, and looking at these open models in their own hardware and trying to figure out “Okay, I don’t really know how to do that at this point.” So that’s where I’m really curious, is let’s say that we go ahead and buy a reasonable GPU capability in-house, but it’s not too big. What can I make of that, if I’m willing to do a little bit of investment, but we’re not talking millions and millions of dollars kind o thing?

Yeah, yeah, so it might be good for people to kind of categorize the ways that you might want to deploy an AI model for your own application. And even before I give those categories, I think I’d also normally recommend to people that – I think still the best way to think about deploying one of these models, if you’re deploying it to support some type of application in your business, or for your own personal project, or whatever it is, any type of scale, I think you’re gonna save yourself a lot of time by thinking about the deployment of the model as a REST API, and then your application code connecting to that model. A REST API, or a gRPC API, or whatever type API you want. But the purpose of the model server is to serve the model. And then you have your application code that connects to that. Now, that could be running on the same machine, or the same VM as your application code, or it could be running on a different one. But as soon as you make that separation a little bit… I don’t really promote people microservice everything, but I think in terms of model serving, it’s useful because you can take care of the concerns of that model, maybe the specialized hardware it’s running on, and then take care of the concerns of your application separately. And if your application is a frontend web app, or is an API written in Go, or Rust, or whatever it is, then you don’t have to worry about “Oh, how do I run this in a different language?” or that sort of thing. You just handle that through the API contract. So that’s maybe one…

Kind of classical separation of concerns, that any developer would be doing.

Yup, yup, exactly. And then you can test each separately, all of that good stuff.

But if we think about the categories of how you might deploy these things, there’s the case where you would want to run this in a serverless way. We already talked about what Cloudflare just released, but there’s a whole bunch of these options, like CloudFlare, and Banana, and Baseten, and Modal, and a bunch of different places where you can spin up a GPU when you need it, and then it shuts down or scales to zero afterwards. And there are – so depending on the size of your model, and how you implement it, the sort of cold start time or the time it takes to spin up that model and have it ready for you to use might be somewhat annoying for you, but the advantage is you’re not going to pay a lot. So you can at least try that first; there’s kind of more and more offerings in that space. But a lot of them have – like, Baseten, the Cloudflare thing, whatever it is, you’re gonna be running it in someone else’s infrastructure. So if you have like your own on-prem thing, or something like that, maybe a little bit harder to deploy that sort of serverless infrastructure, because they have optimized those systems for what they are. So likely, in that scenario you’re signing up for an account on one of these platforms, and you’re deploying your model there, and then you can interact with it when you want.

[00:38:22.12] A second kind of way you could do this is like a containerized model server that’s running either on a VM, or a bare metal server that has an accelerator on it; one or more accelerators on it. So you could spin up an EC2 instance with a GPU, or you could even run this as part of an auto-scaling cluster that’s like a Kubernetes cluster, or something like that. But these would be VMs that have a GPU attached, or something like that… And they would be probably up either all the time, or they would have uptime that’s different from the serverless offerings.

And so you’d just be paying for that all the time. And in those cases, maybe you could use a model packaging system. Baseten’s Truss is one that I use, but there’s other ones as well, Seldon and others, that will actually create a model package in a dockerized way that allows you to deploy your system.

Is there any standardization yet in that space? Or does each vendor have its own approach?

I think each vendor has its own approach. If you look at Hugging Face, they have the TGI or Text Generation Inference project, which I think is what they use a lot to serve some of their models, and that kind of is set up differently than Baseten’s Truss, which is set up differently than Seldon’s system… There are some standardization in that like if you have a general like Onyx model or something like that, there’s various servers that take in that format. But the way in which you set up your REST API might be different in different frameworks. So this is a very framework-dependent thing, I would say.


Yeah. And there’s also an additional layer of choice here, not only in terms of what framework you use, but also in terms of optimizations around that. So there’s certain optimizations like VLLM, which is an open source project that not only – so this doesn’t modify the model, but it modifies the inference code that allows the model to run more efficiently for inference. So this is not the sort of model optimization that we talked about earlier, which is actually changing the model in terms of precision or in other ways, but this is actually a layer of optimization of how the model is called, that helps it run faster.

So yeah, there’s a lot of choices there as well… And I think once you get to that point, and you’ve chosen - let’s say you’re using Baseten’s Truss system, and you’ve deployed your model either on a VM or in a serverless environment, or whatever system you’re using, I think then it kind of gets to these additional operational concerns about like “How do I plug all this together in an automated way? So if I push my model to Hugging Face, or if I update my inference code, how does that trigger a rebuild of my server, and then redeploy that on my infrastructure?” And that gets closer then into what is more traditionally DevOpsy infrastructure automation type of things, which is its own whole land of frameworks and options and that sort of thing. But it’s more of a standardized thing that software engineers are familiar with.

[00:42:02.12] From my perspective, if we were to just summarize, you kind of go from model selection and experimentation, which I would say don’t spin up your own infrastructure necessarily for that… And once you figure out a behavior of a model that works well for you, then decide if you need to optimize it, to run it in the environment you need to; if so, optimize it. And then once you’re ready to deploy it, think about a model server which is geared to specifically inferencing of your model. And that’s the separation of concerns. And you can use a framework like one of these we’ve talked about, or you could build your own Fast API service around it, or whatever API service you like, and deploy it in a way that is ideally automated so that you can do all the nice DevOpsy things around it.

That sounds really good. So you’ve done a fantastic job of laying everything out… I think I’ve [unintelligible 00:43:06.16] at the moment, trying to cover everything.

Be careful what you ask for, Chris.

So as we are winding up for this episode, what are some of the kind of open source go-to tools that pop top of mind for you, that you tend to find yourself going to over and over again, for folks to explore?

Yeah, I think on the pulling a model down and running it for inference, just that sort of series of things, there’s really nothing in my opinion that beats the Hugging Face Transformers library. And for people that aren’t familiar, this is not just for language models and that sort of transformers, but this is general-purpose functionality that you can use also for speech models, and computer vision models, and all sorts of models, both in terms of datasets and pulling down models, and extra convenience on top of that; there’s not really anything I think that is more comprehensive than that. And Hugging Face has a great Hugging Face course, where you can – online, if you just search for “Hugging Face course”, it’ll walk you through some of that.

In terms of the model optimization side of things, I would recommend checking on a few different packages. One of those is called Optimum. It’s collaboration between a bunch of different parties, but it allows you to load models with the Hugging Face API. So similar to like how you would load them with Hugging Face, but then optimize them on the fly for various architectures, like CPUs, or Gaudi processors, or special processors.

In terms of like quantization and model optimization of the actual model, like the model parameters, you could look up Bits and Bytes by Hugging Face, OpenVINO by Intel… This BigDL library from Intel, which - I mentioned that readme in that GitHub also links to other things that people have done… So it’s nice that you can kind of explore that as well. And there are other projects, like Apache TVM and others that have been around for some time and do model optimization.

Yup, and we’ve talked about that one before.

Yeah. And then on the deployment side, there’s an increasing number – the one that I’ve used quite a bit is called Truss from Baseten. That allows kind of packaging and deployment of models. You don’t have to use their cloud environment – you can deploy to their cloud environment if you want, or you could just run it as a Docker container, but it’s really just packaging. But there’s other ones I mentioned too, like the TGI from Hugging Face, or VLLM if you’re interested in LLMs… So yeah, there’s kind of a range there. And of course, each cloud provider has their option to deploy models as well, like Sagemaker in AWS, which a lot of people use also.

So I think you’ve given us plenty of homework to go out there and explore a bit.

Yeah, there is no shortage of things to try. It can be a little bit overwhelming to navigate the landscape, but I would just encourage people - you know, that first step of figuring out what model you need to use doesn’t require you to deploy a bunch of stuff; just try it in a notebook. And once you’ve figured that out, then find a way… Even just search for like - oh, you’ve found out you want to use LLaMA 2 7 billion. Just search for – the great thing now is you can search and say like “Running LLaMA 7 Billion on a CPU”, and there’ll be a few different blog posts that you can follow to figure out how people have done that.

So just follow that path and kind of follow some of the examples that are out there. It’s not like any of us that are doing this day to day don’t do the exact same thing. When we deployed recently on the Gaudi processors and Intel Developer Cloud, I just went to the Habana Labs repo, where they talk about Gaudi, and they have a example, or whatever it was called… And there’s a lot of copy and pasting that happens. So that’s okay, and that’s how development works.

Fantastic. Well, thank you for letting me pick your brain on this topic for a while…

And like I said, I think you’re almost [unintelligible 00:47:41.10] after this one. But that was a really, really good, instructional episode. I’ll actually personally be going back over it.

Cool. Well, it’s fun. Chris, thanks for letting me ramble on. I’m sure we’ll have some follow-ups on similar topics as well.

Absolutely. Alright. Well, that’ll be it for this episode. Thank you very much, Daniel, for filling both the host’s and the guest’s seat this week. Another Fully Connected episode. I’ll talk to you next week.

Alright, talk to you soon.


Our transcripts are open source on GitHub. Improvements are welcome. đź’š

Player art
  0:00 / 0:00