Practical AI – Episode #259

YOLOv9: Computer vision is alive and well

get fully-connected with Chris and Daniel

All Episodes

While everyone is super hyped about generative AI, computer vision researchers have been working in the background on significant advancements in deep learning architectures. YOLOv9 was just released with some noteworthy advancements relevant to parameter efficient models. In this episode, Chris and Daniel dig into the details and also discuss advancements in parameter efficient LLMs, such as Microsofts 1-Bit LLMs and Qualcomm’s new AI Hub.



Changelog News – A podcast+newsletter combo that’s brief, entertaining & always on-point. Subscribe today.

Fly.ioThe home of — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at and check out the speedrun in their docs.

SentryLaunch week! New features and products all week long (so get comfy)! Tune in to Sentry’s YouTube and Discord daily at 9am PT to hear the latest scoop. Too busy? No problem - enter your email address to receive all the announcements (and win swag along the way). Use the code CHANGELOG when you sign up to get $100 OFF the team plan.

Typesense – Lightning fast, globally distributed Search-as-a-Service that runs in memory. You literally can’t get any faster!

Notes & Links

📝 Edit Notes


1 00:00 Welcome to Practical AI 00:43
2 00:43 Keeping you fully connected 01:12
3 01:56 Edge usecases in airports 02:19
4 04:14 YOLO V9 06:48
5 11:02 Programable gradient information 02:22
6 13:24 Purpose of reversible functions 01:39
7 15:03 Sponsor: Changelog news 01:49
8 16:52 GELON; faster and smaller models 06:38
9 23:30 Usecases for 1-Bit LLMs 01:36
10 25:06 1.58- Bit? 01:47
11 26:53 Qualcomm AI 01:42
12 28:35 Keeping up with the trends 01:14
13 29:48 Local or cloud? Both. 02:12
14 32:00 This is the way / There is no wrong way / Deployment strategies 03:53
15 35:54 MLOps and DEVOps 04:24
16 40:18 Learning resources 01:34
17 41:52 Outro 00:54


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another Fully Connected episode of the Practical AI podcast. In these Fully Connected episodes Chris and I keep you fully connected with everything that’s happening in the AI and machine learning world. We’ll take some time to dig into the latest news articles and releases from the AI community, and hopefully share some learning resources that will help you level up your machine learning game. My name is Daniel Whitenack, I am the founder and CEO at Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How’re you doing, Chris?

Doing great, Daniel. How’s it going?

It’s going great. I’m spending a few weeks in the UK, which is a lot of fun, and have got enough sleep to not be jet-lagged quite as much… So that’s encouraging.

Okay, so we have a transatlantic podcast going here today.

Exactly. Worldwide.

That’s right, across the pond.

Practical AI worldwide, 21st Century Incorporated.

We need rebranding.

Exactly, yeah. Yeah. Well, Chris, one of the things I was going through - I don’t know how often people are flying these days, but one of the things that stood out to me as I took my flight across the pond was now when you board at least some flights, you don’t even give them your ticket, right? You just go up and there’s a little - I guess you would call it a kiosk, a little edge device that takes a picture of your face and matches it, I assume with what was your scanned passport, which you scanned at the time of check-in, and you board your plane, of course… And it was really, really fast as well. And the same thing happened, you know, crossing into the border into the UK. As long as you have a certain passport, you just go up to the little machine, and scan your passport, and then it takes your picture… And I’m assuming - I could do a little bit of research; I’m assuming what’s happening under the hood is that it’s matching your actual facial features up with the image on your passport, and computing some score of shadiness or something like that, or risk associated with you not being the person in the – but I was amazed at how fast it was. And I’m assuming - I could be wrong, but I’m assuming maybe some of that’s running at the edge, not reliant on an internet connection to do that facial recognition… I’m not sure if you know or if you’ve had also this experience, Chris…

I don’t know what they’re using algorithmically… But I definitely partake of the technology. It’s an area that I forego privacy, and always buy my way into expeditious processing. So yes, I’m curious.

Well, I don’t know in that case if you have a choice. Maybe there is an opt out situation or something, I’m not sure. But it’s pretty cool that some of this technology is being applied at the edge, and in a very seemingly efficient way, such that you could use it on a mass scale like that, or I don’t know if you’d consider that a mass scale, but it’s definitely in use for many – you know, there’s a huge flood of people going through those stalls, and the computation happens very quickly, and reliably enough to make a judgment.

In the midst of all the hype around generative AI, one of the things that stood out to me over this last news cycle, Chris, was the release of YOLOv9. So we’re on the ninth iteration of this YOLO model. Did you happen to see any of the videos of YOLO 9 in action, Chris?

I haven’t seen the YOLO 9 one, but I’m kind of stunned. You know, when you think about it, YOLO has been around a long time, it was occurring to me… Because we actually had some conversations about YOLO back in the very first days of this podcast, which has been, you know, closing in on six years now… So v9 is a long time coming, and we haven’t really gone back and touched such models in quite a while. We’re long overdue.

Yeah, yeah. So as everyone is freaking out and enjoying the hype over large language models and other generative types of models, Sora and all the things coming out, in the background somewhere there’s these amazing computer vision people that are just really cranking and innovating actually at the architecture level of neural networks in really interesting ways. So it might be good to set a little bit of background for this…

Chris, you mentioned we’ve been kind of talking about YOLO for some time… So if people just search for YOLO object detection, you’ll see a huge set of articles, and GitHub, and everything about YOLO. YOLO actually kind of made a splash because it processed entire images in a single pass for object detection, and bounding box detection… So if you think about – if you’ve ever seen one of those videos of like a street, with a bunch of people walking around, and cars, and dogs, and shops, and scooters, and whatever…

[00:06:10.05] With their boxes around them…

Yeah, and they have their boxes around them, and they’re labeled “person”, or whatever… That’s likely YOLO. So what happens is that single image in a YOLO model goes into the model, and then outcomes the bounding boxes and the actual classification of those bounding boxes… Which is interesting, because previous models, previous to YOLO, I’m still sure some models do this in a multi-stage way, which is more computationally expensive. So they actually take multiple passes through a model, or multiple models to compute both the bounding boxes and the classes.

Yeah, I remember way back when we were first starting, and I was at a different employer. I was at Honeywell, leading AI there at the time… I remember just as YOLO 2 came out, we were using that for a couple of projects that we were working on way back in the day… But that’s like before dinosaurs roamed the earth by AI standards. But yeah, way back.

Yeah. And I think even we have a podcast episode maybe about Fast R-CNN, or whatever it’s called, the fast version of our R-CNN…

We did. You have good memory.

Yeah, yeah. [laughs] That one’s cool. I mean, that one – I think how that one worked was you pass your image in, and then it detects the bounding boxes of objects, and then in a second pass it then classifies each kind of subsection of the image as its class, which also is very effective, but it’s less efficient computationally than the YOLO kind of single-pass thing. And as you mentioned, there have been multiple versions of this… So between YOLO and now version two, version three, all the way up to version nine, each version of these in some ways has - and not just in kind of trained with more data way - they’ve actually made kind of very significant discoveries and improvements in neural network architecture training methodologies, this sort of thing that has led it to be kind of the go-to solution for at least real time object detection in images… Which is why you see all these videos of the bounding boxes around people, and such.

They’ve at least gotten the visual bit a little bit nicer than they used to, where you had the big clunky boxes overlaying everything.

Correct. Yeah. Well, the v9 version of the project, which dropped – at least if the date on the archive article link is right, that would have been the 21st of February of 2024, as we’re recording this… So not that long ago, but it was developed by an open source team and kind of built on top of a codebase from Ultralytics YOLOv5, and it’s released, I believe, under the GPL 3 license is the code that they released. But it seems like what they focused on with YOLOv9 was continued focus on efficiency, to where you can do real-time object detection, meaning like as the frames of a video are coming in, you can process those in real time with the model. So efficiency is really key in these types of applications.

[00:09:39.10] And then they focused on one of the fundamental challenges of deep learning models, of these deep neural network models, which is called the information bottleneck principle, which happens because especially as you kind of propagate – if you think about a neural network, what it is is a big data transformation, right? You take a bunch of matrix data in the frontend, maybe representative of an image, and that gets processed through successive layers of processing. And then out the other end comes maybe these indication of classes, or other things.

The information bottleneck principle talks about the errors or the lack of information, or the loss of information that you lose as you process an input through the successive layers of the feed-forward process of that neural network… Which in some ways can be addressed by having bigger networks and more data; maybe you’re less prone to these informational problems. But it’s more of a problem when you’re dealing with these very efficient, lightweight networks, like the YOLO networks, because you have less layers to deal with, and you don’t want to lose any information that might be relevant to the classification of the outputs.

I notice within YOLO’s 9 docs they talk about also reversible functions as well. Does that feed into – no pun intended… Does that feed into the ability to not lose data by reversing that feed-forward through a function backward? How do you see that utility?

Yeah, so the interesting way that they dealt with this, or kind of addressed this at least in this version of the model, is something that they’re calling programmable gradient information, or PGI. And the PGI portion of their research and advancement relies on a couple of things, but one of the main things is this focus on, again, improving the informational efficiency of the network. And one of the ways that they’ve done this is with what they call an auxiliary, reversible branch. And this gets to these reversible functions that you mentioned.

So the concept of a reversible function, for those that maybe that’s new to them, means that the function and the inverse of the function can transform data without the loss of information. And so again, there’s that loss of information piece there.

And so, it’s a little bit hard to describe this on the podcast without having a whiteboard or a visual, but if you think about this PGI functionality that they’ve added into the network, it’s kind of like they’re bolting on this auxiliary reversible branch, which helps deal with this information loss as gradients are calculated during the training process. And so during the training process, this reversible branch helps not lose that gradient information, as during the forward pass and during the calculation of the updates of the weights of the model. And that helps it be very efficient during the training process, but it’s called auxiliary, which is key, because you can actually unbolt it and take it off for inference, which means… I think part of the problem in the past with these reversible branches and efforts at this were they helped with the information loss, but it also decreases the efficiency in terms of computational efficiency of the model during inference.

I’m gonna throw a question at you, and I realize this is not your your thing, but just in case… As you’re using a reversible function in that programmable gradient information process that you’re talking about, and in a normal feed-forward network you’re maintaining the weights as they’re going through and those are changed… And are you reversing functions to maintain that back in the same space to where you’re actually maintaining a new weight and you’re keeping that gradient information for maybe future feed-forward passes? Or do you have any sense of what the purpose of that is?

Yeah, I think that – so definitely we’ll link some of the papers and the explanations in the show notes, so feel free to look at that for accurate information, and let us know if we get it wrong… But yeah, I think that the idea is that – and the reason why this is especially useful in the training side of what they’re trying to do, and it’s kind of unbolted during the inference side, is that during the training time it’s really crucial that as you’re calculating the updates to your weights, you can do that in a very informationally-accurate, precise manner, especially for these lightweight networks, which have fewer parameters to train… And so maintaining that information, especially as you’re calculating updates based on the gradients is really important.


Break: [00:14:55.01]

We talked a little bit about YOLO version 9’s Programmable Gradient Information. I had to remind myself, PGI, Programmable Gradient Information. The other piece of the architecture - and I think this is just really interesting… You’ve sort of got all of this going on on the LLM side, where things are getting very interesting ways to fine-tune and preference-tune, and all these families of models… On the computer vision side - man, they’re really, really thinking deeply about the architectures going into these models, which have made them so, so efficient.

The other thing that kind of is a combination of things that have come in the past that they’re utilizing in this YOLOv9 is a Generalized ELAN architecture. So this is kind of a progression of a couple of things that have been in YOLO models in previous generations, but they’ve combined them in kind of a unique way. It stands for Generalized Efficient Layer Aggregation Network, or GELAN… And this combines a couple of things from previous generations of YOLO, and from things like CSPNet. This has to do with how features are aggregated, and gradients are aggregated through the model in a very efficient way… Again, leading to a very parameter-efficient model, meaning a smaller set of parameters in YOLOv9 will have similar performance to maybe models with many more parameters. So this leads to the efficiency overall.

It’s pretty interesting, they talk about being able to adapt to a much wider range of applications without sacrificing speed, or accuracy… Is that a form of fine-tuning the model, or something that they’re doing ahead of time that you’re then fine-tuning on top of that?

At least how I read some of that flexibility in was yes, there’s kind of a parameter-efficient – this is a parameter-efficient setup for fine-tuning maybe to a variety of types of scenarios, or even training a new model from scratch in an entirely new domain, and doing that very efficient. And some of the things that I’ve seen - you know, people have already quantized this model, using things like OpenVINO, which is very popular for these kind of edge vision cases… And running this very efficient – so real-time object detection on even desktop or laptop CPUs.

So the new architecture developments are both geared towards - yeah, that efficiency, but also squeezing every ounce of performance out of parameter-efficient models, both in terms of training and flexibility across different use cases.

Yeah, I think there’s great applications for this on the edge, where you’re not in one of the giant clouds, with essentially - if you’re willing to pay for it - infinite compute available to you, whether it be training or for inference, either way. So the fact that this can run on just about anything – I mean, back in the early days we could do YOLOv2 on smaller equipment, but it didn’t run smoothly. You’d have points where it would overwhelm the computational cycle, and so it’s nice seeing something like this has come this far. It’s quite an open source library.

Yeah. And there’s a link that we’ll add into the show notes, which includes a notebook for running YOLOv9 in a Colab Notebook, even, like I say, on CPUs. So in terms of the efficiency, one of the things that I saw was YOLOv9 operates with 42% fewer parameters, and 21% less computational demand than YOLOv7, yet it achieves comparable accuracy. So you know, it was already fairly accurate, and kind of an industry standard, but now with much fewer parameters. And I think that that is definitely a trend that we’ve been seeing not only in computer vision, but in other cases, where you see things like Ollama, or other things, Llama.cpp, that are allowing you to run large language models on a variety of hardware, including just on your local laptop… And you know, quantization type of libraries, like Bits and Bytes, and Optimum, and BigDL, and these libraries that allow you to run maybe 7 billion parameter large language models, or other generative AI models, but in lower precisions, so that you can run them on a variety of hardware, optimize them for a variety of hardware.

[00:21:50.21] We also had Neural Magic on the show a little while back now, who has a set of libraries for optimizing models to run on CPUs… And yeah, so there’s a lot of kind of precision and quantization that can happen even on top of the use of these parameter-efficient models.

One of the interesting things also that I saw this last new cycle, which at least in the circles that I run in, with large language models, people were talking about a lot - which is this release from Microsoft, or a paper from Microsoft, that I think is titled something like “The era of one bit LLMs”, which is interesting, because a lot of people have talked about going from maybe Float32, to Float16, and 8, and 4-bit precision, that sort of thing… And this kind of brings in this idea of 1-bit LLMs, with this architecture Bitnet… And so I’d found it interesting that we got both YOLOv9, but now comes on the LLM side this 1-bit architecture… And it seems like a similar thing is happening - I don’t know if you remember back when we were talking about R-CNN and some of the larger computer vision models, we’ve seen the progression to more and more parameter efficiency and flexibility across deployment scenarios… And now we’re seeing that maybe in a more rapid way, with LLMs, and this 1-bit LLM, but also all the other quantization and that sort of stuff that we’ve seen on the generative side.

Do you have any sense from an application standpoint, like where you might go with these 1-bit LLMs, like what are some of the use cases that come to mind for you?

Yeah, I think it’s interesting… So this 1-bit LLM that was released - they talk about it having similar performance to a model of the same parameter size, but more computational efficiency, because of course, these parameters or bits are actually not just zero and one - and we can talk about that here in a second - but more computational efficiency. So I think that this is really interesting for cases where you do want to run maybe an LLM on an edge device, in a scenario like think about disaster relief, and you have a device out in the field that’s giving help to first responders, or something, giving them information, or processing information from training documents or something, and you’re using an LLM to provide answers… It’s likely a very spotty internet connection in that case, and so having something that could run on device in a variety of scenarios would be quite relevant.

So one scenario would be lack of connectivity. I think another scenario would be very latency-sensitive scenarios, where you want a response very quickly. You don’t want to have to rely on network overhead or things going out of a network that you’re operating in for security reasons; that sort of thing might be a good use of these.

Yep, that sounds interesting. They have a term in here that I’m curious about… Referring to BitNet, they talk about it being a 1.58-bit LLM. And Hugging Face in their paper notes that all large language models are 1.58. Do you have any comment about that, what that means?

The reality is if you – I think they talk about this in the paper; if you go down to a truly 1-bit LLM, each weight of your model is either zero or one, right? Then yeah, you would expect to lose a lot of information that might be important… And so they make a slight compromise in here. Maybe it’s unfair to call it a compromise. They make an astute conversion from bytes, in other words zero to one, or bits, to what they call ternaries. So these are basically triplets, or three bits together. So for example a weight could be -101, or something like that. So you’ve got three numbers that represent certain information, and that’s where they kind of get this 1.58 bit.

So this is also why it’s kind of they release this new type of architecture that processes these ternary bits, or ternaries, these combinations of three bits… And that’s presented in the Microsoft paper. But yeah, I think this is only the kind of latest… My prediction would be that we’ll see many more things like this, where people are trying to be parameter and compute-efficient with large language models.

We’ve seen models getting more and more efficient and more compact over time… And as we’re looking at so many smaller, very capable models being used out on edge devices, do you envision something like this, where they’re really targeting efficiency in terms of being able to do that in something like small electronics? Or is that a little bit overly ambitious for where this might take us in a reasonably foreseeable future?

Yeah, it’s actually a good question, because one of the things that we saw also – I don’t know if it was this week, but recently at least, was Qualcomm’s announcement and release of a huge number of (I forget how many) a whole bunch of models on what they’re calling the Qualcomm AI Hub, for models that run on device, on their Snapdragon processors and other things, at the edge, on small devices.

So these wouldn’t be like the small devices of like a microcontroller, or something like that… There’s still a good bit of power in these processors… But it is super interesting that Qualcomm has made the effort to make these types of models, whether that be object detection, or large language models, or other things available in optimized forms to run on very small devices. And I think it’s a trend that we’ll keep seeing.

Break: [00:28:22.25]

It seems somehow like in computer vision it took maybe what five years we’ve been doing, five or six years we’ve been doing this podcast, and over that time we’ve seen computer vision models shrink down and down, and become faster, and more parameter-efficient… It almost seems like that’s happening much faster on the large language model side, and generative model size… It’s like shrunk from five years to one year where a lot of that’s coming out for on-device usage…

When we and the rest of the Changelog team are looking at what content to bring onto the show, and there are various guests, and there are all sorts of topics and advancements going out… It’s become quite challenging to narrow it down to just what we can cover in these shows… And largely that’s because of what Daniel was just saying, that tremendous acceleration in the advancement of this technology is very hard to keep up with and report on, especially trying to figure out what folks are most in need of hearing, or being pointed to. So on any given week, which of the dozens of things that that are happening do you want to do…

[00:29:45.22] And I would say for those out there listening, in this episode we’ve talked a lot about parameter-efficient models, whether it be the Qualcomm AI models, or the 1-bit LLMs, or YOLO, and running these on device, and at the edge… It might be natural to think “Oh, the news cycle has totally switched to local models, running all the models locally, and that’ll solve all the problems.” And I think the reality is in the future it’s going to be kind of both/and. You’re not going to serve – let’s say that you integrate a model into some social media application, or whatever mobile application, or you’re serving a web app, and it’s got some AI integration, or something like that. It’s very unlikely, I think, that you’re gonna want to serve up millions and millions of requests using only local models… And in the same way, if you’ve got an enterprise batch use case, and you want to process 1.5 million documents through a large language model, you likely don’t want that running on your Mac M2, or something like that. That’s not the deployment strategy for that scenario. But yet, you will see a lot of models running at the edge, or locally. And I think the reality is that we’ll go into kind of a both/and sort of scenario, where yes, a lot of things you’ll be able to run locally… But the same as like – I mean, you can run a lot of software locally, but it doesn’t mean that you’re also not running software in the cloud. You know, AI is just a new layer in your kind of software stack. So we’re gonna run it locally, and we’re gonna run it in the cloud.

That’s exactly right. That was where I was gonna go; anyway, you just hit it. And that was – it’s following the maturity trend of software. And just as we have huge software systems that you can only run in the cloud, and are massive scale, and you have apps on your phone, and you have also very small micro-electronics which have even smaller software functions on them integrated in, maybe in the BIOS… All these different areas, and we’re seeing models doing the same thing.

One of the things that we’re often asked to address, and we have done repeatedly over the years, is what’s the current way to do training and deployment? And I think, to your point, Daniel, there are now – now that we’re maturing rapidly in this industry, there are many ways; there’s not one right way to do it anymore. It’s kind of figuring out your use case, figuring out what mixture of different model types need to contribute into that, and what the architecture for all those models and how they communicate through the software, and what hardware is available to them… So it’s become quite complicated. There’s no longer the way, you know, to borrow the Mandalorian saying; it’s now many ways. Do you have any thoughts on how people might approach that? How do you think about it when you’re doing things in Prediction Guard and trying to help your customers move forward?

Basically, you kind of have to split things up a little bit by stage of your project, and also the use case that you’re considering… What I mean by stage of your project is I really encourage people, especially if they have a generative AI use case, the best thing you can do to get a sense of like - let’s say that I want to summarize news articles related to stocks that I want to trade, or something like that. The very best thing you can do is not jump right to “Okay, I’m gonna fine-tune a model for that, or spin up some crazy GPU infrastructure, or something like that.” The best thing you can do is just get some off the shelf models, and if you want to either run them – the easiest cloud way to run those would be to run them, if they’re small enough, in just a colab notebook or a hosted notebook environment like that. That’s more than enough to figure out if they’re going to work for your use case.

[00:33:58.06] Or if you want to go the more local deployment route, there’s things - like I already mentioned, you know, of course, if you want to run YOLO, that’s easier now than ever, and there’s quantized versions of that that you can run on a CPU even. You don’t need even a special type of hardware. But then for the generative side of things, there’s things like Ollama, and LLM Studio, and Llama.cpp, and these things that will allow you to prompt models and figure out if they’ll work for your use case locally.

So that’s kind of exploration stage. Then you have to decide, okay, well, if this project is a work project, I figured out maybe that I can prototype this and figure out it might work… Then you kind of have to play through the scenarios in your mind “Oh, if this is a mobile app, and I’m processing customers’ private data, maybe it makes sense to try to run a model at the edge, in my mobile app, on their device.” A Qualcomm AI model from their AI Hub, on their mobile device, and that would be really good. But if it’s a web app application, and there’s not as aggressive of a security posture, probably you want to figure out how you’re going to run and host that model in a way that makes sense to you even from a public endpoint, that’s just a product, like Together AI, or Mistral, or something like that… Or you’re going to figure out how to run it in a secure local environment with either a product that can host that model in a secure environment in your own cloud, or in your own network, or your own kind of self-deployment of that model, using things in your cloud infrastructure like SageMaker in AWS, or other things like that.

Yeah. It’s increasingly – it’s becoming part of the software and your larger architecture. We’ve seen in the recent couple of years especially the strong rise of MLOps, which kind of corresponds to DevOps in terms of deployment and all those things… Do you tend to think of it in more of an integrated way? Or do you still at this point in time, as we’re in 2024, think of it as separate approaches from the software? How do you parse those two sides of that coin?

It’s interesting, I think at least in my own mind I tend to separate them out, maybe depending on some of what’s involved in a project. So if it’s the use of a pre-trained model, I think the burden is a lot more on kind of the traditional DevOps monitoring, testing, uptime, automation deployment, that sort of thing… Because likely, you’re just interacting with the model via an API, like you would integrate any other API. Now, there’s certain things that can help you, like versioning prompts, and testing for data drift, and that sort of thing… But it’s not so dissimilar, I would say, to your traditional software development.

Whereas if you really have a unique scenario, and you’re fine tuning a model for a unique scenario, you’re likely going multiple iterations on curating your dataset, on training your model, on evaluating your model, on versioning your model, releasing it in your model servers, updating it with new data that comes in… And I think some of that specific MLOps type of software will likely appeal to the people that are doing that process, which are usually data scientists, and not software engineers. And versioning your model, versioning your data, evaluating your model, and the way that those systems are set up, like Weights and Biases, or ClearML, and these types of things are quite useful in terms of versioning your model out when you’re training it like that. So I think MLOps is alive and well, but I also think that with the rise of this kind of API-driven AI development, a lot of that does, or can fit into more of the DevOps side of things.

[00:38:18.04] Yeah. When you’re using an API that somebody else is hosting, maintaining, has fine-tuned, all that, you’re basically using it as a service like any other service that would not be AI. And so you just treat it as an API along the way.

Yeah, yeah. And where that’s maybe slightly different is you are getting kind of some variability out of that API, both in terms of performance and latency, which are maybe common across software projects… But also in terms of the performance output of the model, especially if you’re using a closed model product, like an Open AI, or Anthropic, or something like that… They’re making improvements to their underlying model under the hood all the time, and it is really more of a product. It’s not just you’re hitting the model, there’s layers around the model, which are product layers, that can influence the behavior of that model. I mean, you just kind of look at what’s happened with Gemini over the past three or four weeks. We don’t need to get into all of the details of that. If people want to look it up, they can.

But I think a lot of those issues that that product had were actual product issues that were at the product layer surrounding the model. Not performance necessarily, or biases in the actual model, but in the filters around the model, and how things are modified in and out of the model… And so that actual product that you’re interacting with can really cause small changes in how things go into the model on the product level, and can make huge changes in the quality of the outputs of the model.

That sounds like some pretty good practical AI advice right there… I think for me at least that very much helps me to kind of contextualize the different things that we may be doing at work for myself, and as we’re making choices and decisions in how we’re going to tackle different problems… So I appreciate you sharing that guidance there.

Yeah. And I guess we’re talking about the MLOps side of things, and we’ve talked about practicalities of deployment schemes, and quantization, and all of that this episode… And in terms of a learning resource for people, if they want to dive into some of this, there’s a lot of great ones out there one. One is to follow the MLOps Community podcast, which is a podcast that Chris and I love, and have collaborated with over time. Demetrios, shout-out to the great things you’re doing.

Funniest guy in AI.

Yeah, check out everything that they’re doing over there. I also ran across this Intel MLOps professional certification from Intel. If you just search for Intel MLOps certification… This is totally free, as far as I can tell. There’s seven modules and eight hands-on labs, and talking about software solution architectures for machine learning and AI, API and endpoint design, principles of MLOps, optimizing the full stack… So it really seems to be a good set of things to look at if you’re wanting to think more about the practicalities of these deployments and other things.

Alright, sounds good. Well, thanks for sharing your wisdom again today. Really good episode. I guess I’ll see you in the UK for the next few weeks to come.

Sounds good. Yeah. Thanks, Chris. We’ll see you soon.

See you later.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00