AI is more than GenAI (Practical AI #285)

Featuring

Daniel Whitenack – Website, GitHub, X

Notes & Links

Chapters

Chapter Number	Chapter Start Time	Chapter Title	Chapter Duration
1	00:00	Welcome to Practical AI	00:48
2	00:48	Sponsor: Speakeasy	00:53
3	01:47	Show rundown	00:57
4	02:44	AI ≠ Generative AI	00:51
5	03:34	Tour of AI ML history	01:11
6	04:45	1st phase of ML	01:27
7	06:12	Parameterized software functions	03:30
8	09:43	Forming the model	03:16
9	12:59	Foundation models	06:24
10	19:23	1st aside	03:26
11	22:49	2nd aside	02:27
12	25:16	Most recent phase	03:36
13	28:51	Current state of AI	03:15
14	32:06	Still efficient	01:40
15	33:46	Role specific models	01:15
16	35:01	Combination of models	01:06
17	36:06	It's not fairy dust	01:13
18	37:19	Wrapping up	01:58
19	39:17	Outro	00:46

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Daniel Whitenack

Welcome to another episode of Practical AI. This is Daniel Whitenack. I am the founder and CEO at Prediction Guard.

And I would normally be recording with my co-host, Chris, but he has had to be out for a couple of weeks… And I was thinking during this time, I’ve been doing a bit of teaching at different conferences, and I’ve kind of honed in on a set of interesting learning materials that I thought would be a great thing to share here. We have these episodes normally called Fully Connected episodes, and these are where Chris and I keep you kind of connected to everything that’s happening in the AI news and trends, but also try to provide some learning resources… And what I’ve found as I’ve gone to different conferences and given workshops is that sometimes there’s a bit of confusion around two things. One is that AI equals generative AI… And that’s not surprising, because there’s so much that we hear about generative AI, even on this podcast. We’ve, of course, talked about it a lot, that there’s this association that sort of anything AI is generative AI. But then there’s another thing, another notion which is this idea of not knowing kind of how generative AI came about, and it kind of popped out of nowhere, when in reality there was a sort of progression towards generative AI, and it fits within a landscape of AI and machine learning and data science that has been going on for some time.

So I thought it would be good in this particular Fully Connected episode or this episode where it’s actually just me - sorry to disappoint those out there that don’t want to hear more of my voice, but it’s just me today… And I’m gonna go through some of this kind of distilled set of learning resources that kind of set AI within the context of a wider set of methodologies and a history of development. And then also maybe help us understand the landscape of AI methodologies beyond generative AI, and how those things are still used quite a bit, and might even be combined with generative AI in very interesting ways.

So that’s what we’re going to embark on today, is a little bit of a tour of AI machine learning history, setting generative AI in that history, and along the way we’ll talk about kind of the practical things and the knowledge that you might want to go out and research more if you’re interested in any of these different kinds of techniques that are still used quite widely outside of generative AI.

So let’s go ahead and get going. There’s kind of a first generation or first phase, in my mind, of what I think of with AI and machine learning, and that’s kind of 2017 and prior… Or a lot of times what I think of, since I’ve only been alive for a certain period of time, and been working professionally, I kind of think of this as the 2010 to 2017 period, that I would label kind of that data science statistical machine learning period. This is the time when I first started getting into data science, when I came out of Academia… And this phase of data science machine learning is really focused on kind of small-scale model building, and models that were sometimes neural networks and sometimes they weren’t neural networks; they might be gradient-boosted machines, or decision trees, or other things like this.

And during this phase of data science AI machine learning development, the kind of primary role of those working in this field was to curate a set of example inputs and kind of known outputs. We talked about this a little bit in a previous episode when we talked about “It’s all about the data”, but this would be training data, where we have example inputs and known outputs. And then we also have a software function, which is parameterized. And what I mean by parameterized is think about what a software function does. A software function is a data transformation. I have inputs to that software function, and I have outputs of that software function.

[00:06:30.27] And if you think about some software functions, they could be parameterized, or they could have parameters. So let’s take the example of object recognition. The software function that executes that data transformation of an image in and a set of labels out, or maybe just a set of binary out, “Is there a cat or is there not a cat in this image?”, that’s a data transformation from an image to a label. And inside of that software function, we could think about a really dumb software function that could execute this. For example, think about all of the pixel values of that image. I put that image in my software function… I could just calculate the percentage of red pixels in that image, and if that percentage is above a certain number, greater than or equal to, I could say it’s a cat. And if it’s less than, I could say not cat. That’s a really terrible function to do this data transformation, but it would operate in software.

Now, me as a developer, I could choose an appropriate percentage for that parameter, the parameter that is the percentage of red pixels. And that might be an expert knowledge that I encode into the system, but I would hardcode it. But maybe I don’t know the exact value for that parameter. And so what am I to do? Well, as a software engineer, as a developer engineer, I could actually execute a separate process that’s iterative, and actually try all the parameters that are possible, all of the values for percentages that are possible. And because I have a set of example inputs and example outputs, I could try that set of parameters for all of my example inputs and outputs, and just choose the one that gives me the best results. In other words, the most accurate results. And then I would set that as an ideal parameter, which was found or fitted or trained. And that’s what’s happening in a training process in this kind of 2010 to 2017 statistical machine learning, supervised learning way of going about things.

Now, I’m being a little bit general here… Obviously, there’s a whole variety of other methods that are maybe unsupervised methods… If you’re curious about those, you could look at clustering, and there’s kind of semi-supervised methods, and other things. You can look at those, and I’ll talk about actually some of that here in a second. But generally, that’s how the process would go. I would choose a function that would be parameterized, I would put in a set of example inputs and known outputs to a training function that’s iterative, and I would find the ideal set of those parameters. And then once I’ve found the ideal set of those parameters, then I can use that ideal set of parameters in my function to process new inputs, which is called inference, or prediction.

Now, that function, if you think about what is the model in this case, the quote-unquote model, the machine learning, the statistical model here - the machine learning or statistical model here would be the combination of the software function and the set of parameters that are needed to execute that software function for inference or prediction. Those together, in my mind, form the model. You need both of them. And that’s why people are maybe confused about how to license models, because some people might use licenses for code, some people might use licenses for data, and other people might make up their own, because they don’t know which one is right.

[00:10:23.03] So this is kind of that 2010 to 2017 era of data science, machine learning, and this is still used widely. If you’re curious about this, you might listen to the episode about Broccoli AI, AI that’s good for you… This was Bengsoon, who created this sort of model, an NLP classification model for a real use case. And this fits a variety of use cases…

So time series forecasting, where you have a time series, and then you’re trying to predict future values. Or you have images, like I said, and you’re trying to predict labels. Or you have text input, you’re trying to classify that into spam or not spam… It’s a whole variety of things around classification, which is that labeling… Regression, which is a prediction of a continuous value, like a score, something like that… And so there’s a lot of these methods that are still used very widely.

If you want to think practically about this, most of these models, especially the smaller scale ones, might not even need a GPU to train them or to execute the inference, at least at many different scales… But as the model grows, of course, it’s harder to execute. And in particular, during this phase neural networks were trained - and neural networks have been around for a while - and that’s just a function that has a bunch of these sub-functions in it that take input, again, executing a data transformation… But the goal of this sort of neural network is to model more generally a complicated relationship from input to output, without you from the start using a lot of your expert information to construct the architecture of that particular function. It’s more generalizable, maybe.

So this is this first phase… Again, keeping things practical in this phase, there is a practitioner that’s oftentimes curating data, working with domain experts… They might be using an MLOps platform, like a Weights and Biases, or maybe a larger platform like a Databricks or something like that to train models often, update them often, store them in a repository, monitor them… And so there’s a lot of iterative model training, model monitoring, all of those things that still goes on quite a bit in an industry.

So that gets us to around the year 2017. Now 2017, there’s some interesting things that happen… And that has to do with foundation models and transfer learning. So in that first phase that I talked about, most of what I was talking about was training a model from scratch. In other words, you have a software function with a set of parameters. Those parameters need to be fit, you use a training process to train those… And oftentimes you might start those parameters out in that training process, either in a random sort of way, or maybe according to a specific distribution… But ultimately, you’re kind of starting from scratch. You don’t have a great place to start with those parameters.

In this foundation model world or transfer learning, if you remember around 2017 I think Google released BERT around this time, and others released other models… And I think what people found was, if you’re doing kind of these tasks generally with maybe text inputs or image inputs, like with a YOLO model, or text inputs like with a BERT-based model, a lot of these tasks were very similar one to another.

[00:14:25.10] So if you think about doing object recognition, maybe Google trains a model that recognizes 10 different classes in images: an airplane, and a car, and a dog, and a cat, and et cetera. And you maybe want to do something specific to your domain or your company. Maybe you’re an agriculture company, and you want to take pictures of bugs on plants, and classify those to keep track of what bugs are in your field, or something like that. Well, that’s a very similar task to the general object recognition task that this model was already trained on.

And so the key piece, or one of the key shifts that we saw during this period of foundation models and transfer learning was that a large player like Google, or a big tech company, or a set of academics, actually, might train a very large-scale model, meaning they might have millions of example inputs and outputs, and the model or the function itself might have millions of parameters… And they train on a huge amount of data, that model, and output an ideal set of parameters that’s a really, really good starting point for anyone that wants to downstream, carry on the fine-tuning or carry on the training of that in a process called fine-tuning, to produce their own domain-specific model.

Now, this fine-tuning process that we’ve been talking about - you would need a very small scale of data to train or continue the training or fine-tune that model as compared to the original dataset that trained the model from scratch. And if you want to kind of get some intuition about this generally, if you think about the size of the model, like one with millions of parameters, or even tens of millions, or hundreds of millions, or billions of parameters, you need a lot of data to fit each of those billions of parameters. You can think if you have billions of parameters and 10 data points, that’s not very much data to train. Billions of parameters - maybe not enough to train even a linear regression model. You could do maybe a better job there.

So you need a lot of data to train this larger model, which could then downstream be slightly adapted or slightly tweaked with a smaller set of data for your domain, to produce a domain-specific model or a model that’s specific to your use case.

But the mechanics of this are not dissimilar at runtime. You still have your new inputs come in to your adapted model, even though it’s now a fine-tuned model, not a model you trained from scratch… And you process those new inputs, through the data transformation, the outputs.

There is maybe a significant difference here practically though, because these models are meant to be more general and be applied to many different use cases. Now those models are much bigger, the software functions are much bigger, but more importantly, the number of parameters that go into those software functions is very large; maybe even gigabytes large or more.

[00:17:55.05] So when you execute that many operations on that sort of input, it’s a lot of matrix multiplying, it’s a lot of processing and data transformation… And so this is where we really start to be stressed by this capacity and need for GPUs or other specialized processors that would help us process these very many calculations very fast… Because we want to recognize an image maybe in real time, or close to real time, not waiting seconds or even minutes to process that image. And so practically, there’s a need for a different kind of generation or set of hardware to run these larger models, which were kind of trained as larger models to handle a general set of use cases, and then followed on with transfer learning after that.

And this is also still quite widely used. So data scientists and data analysts, or machine learning engineers or whatever, they still use these models and still use models like this or do things like this in practice; fine-tuning you’ve probably have heard of, whether that’s fine-tuning a large language model or fine-tuning a non-generative type of model, which we’ll talk about here in a second. This process of fine-tuning is used very frequently, because it requires you to maybe curate less specific training data.

Now, I want to go maybe over a couple of asides here that are important to note before we hop to the latest generation of models. And those couple of asides I think are subtle, but important for kind of the general understanding of this flow, and this flow of thought and where things fit into the major picture here. The first of these asides is let’s think about maybe why this foundation model or transfer learning process actually works.

Well, as it turns out, if you look at one of these models like a BERT model, or one of these other models like an object recognition model, a lot of the processing of that software function - you can kind of think about it like a lot of the heft of that function, most all of those parameters in those functions and subfunctions within the model are really dedicated to what’s called either feature representation, or embedding, or creating an internal representation, however you want to put it. It’s about taking that data in, whatever mode that’s in - so an image, or text, or audio maybe… And taking that in and transforming that from its original format - maybe a set of pixel values, or a set of vocab indices or something like that in the case of text - and actually translating that input into a really efficient and dense and good representation, that can be used for the downstream task.

So if you think about kind of your model function as a big pipe, 90% or more of the flow through that pipe, through that data transformation is dedicated to the translation of whatever that input is, in whatever format, like an image or text, into this internal representation. Sometimes called features, or oftentimes called an embedding, or a hidden representation. Then a very small amount, maybe 10% or less of that processing has to do with taking that representation and then performing a downstream task, like machine translation, or speech synthesis, or object recognition.

[00:21:55.18] And so because a lot of that has to do with representing the input, that representation of the input for text or images or audio is then transferable between different text and image and audio tasks. I could have a model like BERT that embeds or represents my text input, and then I could kind of bolt on various heads onto the end of that model or the end processing of that software function, and do all sorts of things, like machine translation, or sentiment analysis, or named entity recognition, or NLI, or different things. So this is kind of one of the key pieces that makes this foundation model or transfer learning piece work so well.

Now, the second aside that I would like to mention here is that one of the reasons why we were able to boost the model size and kind of boost the generalizability of this was because, one, we kind of were able to understand about how to make these models more generalizable, as I just mentioned, with this feature representation… But also, it requires a lot of data to train these. And one of the things that was figured out quite interestingly was that you could do things like have a huge set of images or videos or text, or for example scrape the whole internet, and you could construct a task - in other words, you could construct your example inputs and your example outputs, your training data in an unsupervised way from that large scrape of data. And let me explain what I mean by that unsupervised. I mean, in kind of the old days, 2010 to 2017, we would hand-curate “This text corresponds to this output”, and we would manually create that label. But in this new kind of regime, where we boosted the size of these models, we would actually do this in an unsupervised way. So we would scrape the whole internet worth of text data, for example, and then I can construct a meta task, quote-unquote, or a set of example inputs and example outputs, by just, for example, taking a sentence and removing the last word, and then having the last word be the prediction. And that’s an autocomplete task. Or I could take a sentence and remove a word at the beginning or mask it, and have the model try to fill that in. I can construct that task automatically. I don’t need to have a human do that. I can actually do it programmatically, as I scrape all of this data.

So you might be seeing where I’m going with this, but tasks like this autocomplete were tasks that were commonly used in the training of these large foundation or base models, I think mostly with the original intent of having these be transferable to a wide variety of downstream tasks, that would be executed via this fine-tuning process.

Okay, so those were the two asides about these meta tasks, why those are important. And then also this idea of feature representation or embedding, and I’ll return to that embedding piece in a bit to kind of set that in context. But let’s get to this final stage of “AI”, and this is 2022+, this most recent time maybe people are most familiar with. A lot of people hopped on the AI train during this time period, and we’ve already talked about this a little bit, but a couple of things that we haven’t maybe talked about on this podcast is in this generative AI phase, people carried on kind of curating more and more data, increasing model size, using these meta tasks like autocomplete to train these generalizable, large foundation or base models.

[00:26:12.23] And as it turns out, if you train a huge autocomplete model on a whole internet’s worth of data, obviously the internet contains blog posts about coding, and tweets, and movie scripts, and all sorts of things. And so if I then have as input to that model a blog post about “Practical AI is:” and ask the model to autocomplete that, the probabilities associated with that prediction of those next words would actually predict maybe a quite relevant blog post.

And so now people start to realize “Well, wait a minute… Because of the scale of this training and the generalizability of this model, maybe I don’t even have to execute this fine-tuning process. Maybe I actually want to just create a large enough model, that I can prompt or instruct, and then kind of hone in the probabilities of the autocomplete to actually autocomplete what I would autocomplete maybe as a human if I followed these instructions, or if I generated this blog post, or if I translated this text into a different language.” And that’s what people started doing.

We’ve talked a little bit about this on the show before, we’ve talked about how one of these generative models often starts as a pre-training, which is trained on these kind of meta tasks like autocomplete over a huge corpus of data, which is not human-curated… And then there’s a period also from the model producers, still not us consumers, but Meta will do that pre-training and then they’ll use their own curated dataset, which is still large and very expensive to create, but their own kind of high-quality curated dataset of prompts, which might include chat examples, and tool usage, and various tasks like question answering, or machine translation, miscellaneous kind of instructions around summarization… And this fine tuning dataset is then used to further train the model and produce maybe an instruct model, or a chat model, or a code completion model. And then we, if we want to consume that model, could use a chat type prompt or a certain instruction to use the model and execute accordingly.

So yeah, this is kind of how this latest phase of models kind of fits into the history, and now we’re in a phase where primarily we’re consumers of these models. We’re not the model trainers. So practically speaking, if we think about this, whereas in previous generations of model usage - which, again, is still quite prevalent across industry - we might need an MLOps system to curate or to train models and monitor models and calculate metrics. And we might need data annotation systems to label outputs and all of those things. Here, the primary shift - and maybe this makes more sense now as you’re hearing things, but the creation of optimized prompts, the curation of those prompts, getting domain experts to write and iterate on those prompts, evaluating LLM workflows… These things are the things that people are thinking about a lot now, that they’re not the ones training these models, but they’re sort of “programming” them, or executing reasoning over them just at the inference layer.

[00:30:05.23] Now, I mentioned a bit ago this feature representation or embedding piece of the puzzle… And we set that in the context of foundation models and transfer learning. So I want to circle back to that for just a second as a kind of afterthought here, I guess… Which - hopefully, now you have some context for where those come about, but we of course see those popping up as a very important piece of generative AI workflows, because these models, as it turns out, what kind of people learned and started actually on purpose training these models to do, they’ve found out that these internal representations or these embeddings, if I take text and represent it in a very dense, efficient way for use in these downstream tasks, actually text embedded in that space could be used in a semantic search or retrieval fashion. And so if I take a sentence about eating great Chicago deep dish pizza, and I embed that or represent it into this set of numbers of vector, and then I take another sentence about pizza and I do the same thing, and I take another sentence about flying airplanes and do the same thing, the two sentences about pizza will be closer, quote-unquote, in that vector space, according to their vectors that were calculated or their embeddings that were calculated; they will be closer than the one about airplanes. And so this now allows us to semantically search over pieces of text and over other things, connect those things together, and that has become a key piece, as we’ve talked about in past episodes, to things like retrieval-augmented generation, where we’re pulling out information from knowledge bases and injecting that into models.

Now, taking a breath - hopefully, all of that was interesting as we went through those different generations. I’d like to maybe emphasize a couple of things in your mind. So you can see that all of these modeling methods have their place, and they’re still with us. So it’s still very efficient and effective to utilize a time series modeling method to do forecasts, for example. That’s in that first generation of models, the statistical machine learning models. It’s still quite relevant if you’re maybe training a new computer vision model for your factory, and that’s very specific… It doesn’t need to be general at all. It doesn’t really make sense in many cases to maybe utilize a huge heavyweight vision LLM, for example, or an LVM, a language vision model, to execute at very high throughput on a machine or a manufacturing line for a computer vision model that is trying to find flaws and parts as they’re coming off of a manufacturing line. Efficiency-wise in how that model would have to be deployed, that would be a challenge, to say the least, and so maybe you want a very specific model that is a little bit maybe smaller, more accessible, but you also don’t need to be training a computer vision model from scratch. Often you would start with a foundation model. And then finally, of course, we’re all familiar with a bunch of things that people are using generative AI models for.

So that’s one thing that I’d want to emphasize, is all of these types of modeling are still with us and prevalent across industry. The second thing that I’d want to emphasize is an interesting thing that I’ve kind of noticed and even talked with someone today about, which is as you have advanced through these different generations of models, on the more machine learning statistical side, role-wise, often you had software engineers or infrastructure engineers on one side, you had business people or domain experts on the other side, and you had data scientists or data analysts or machine learning engineers in the middle, that would often translate domain-specific problems into models that they would train, which would be deployed by software engineers and infrastructure people.

[00:34:25.13] On the other end of the spectrum, in this generative AI world often there’s not this fine-tuning or training that happens. And so often domain experts or business people themselves are prompting these models to accomplish chains of reasoning, and there’s not a data scientist or an analyst in the middle. And so there’s kind of this shrinking of the zone between domain experts and business people and the direct infrastructure that’s running those models, which is an interesting development.

I think finally, the thing that I’d like to emphasize is often in an actual business scenario the very most intelligent or best, or eventually kind of – especially if you’re creating a customized, proprietary, or a very domain-specific approach to a problem or a workflow, it may involve all of these different types of models, or some combination of them. And so you can imagine a system, for example, where a user would put in a natural language query and say “Hey, could you forecast out our revenue for two years into the future?” And that may be processed by a generative model to create a tool call, that calls a function that uses a statistical or machine learning model to make that forecast, which returns the answer, which is then interpreted back into natural language by the generative model. And so these types of workflows are, I think, some of the most interesting ones.

I guess I said that was the last thing I was going to emphasize, but I’ve already emphasized this a lot in different shows… But just to remind people all of these things that we’re executing are software functions, and not some sort of fairy dust or sentience that comes about after you leave your laptop in the corner of a room for a while. And so a lot of the disillusionment that’s happening in the AI space is because people think doing AI or this AI transformation somehow just kind of happens once you “have AI.” But the reality is that these are all software functions, parameterized software functions at different scales, most definitely, but software functions that need infrastructure to operate, and are often integrated into other software systems to be made useful.

And so engineering is still quite relevant. Of course, you could argue some of these can write software now if you prompt them in a specific way, but still, you sort of need to architect that system that does all of that prompting, and code generation and integration and execution… So there’s a lot of interesting things related to that.

Well, it’s been fun for me to kind of share this little talk or workshop material with you that I’ve been talking with people about. There’s a lot of things here, of course, that I haven’t covered as part of this episode, but hopefully it gives you an overall sense… I would definitely recommend people maybe on the generative AI side, if you have yet to kind of play around on the other side, the statistical machine learning side, if you’re technical, maybe you want to just go to something like a scikit-learn, which is a Python program that even without GPU resources or other things will help you learn about the different types of modeling that is available there. There’s great examples, and you can learn about that.

If you’re coming from the more data science/machine learning side and you’re going to generative AI, of course, I would encourage you to learn about that and think about maybe how you can utilize the best parts of that alongside and with your statistical machine learning workflows.

If you’re non-technical in either of these - of course, there’s plenty of ways to get hands-on with generative models, but on these other cases there’s still AutoML systems or systems that would allow you to do things in certain domains by uploading data and training models. If you just search for kind of AutoML… I know H2O had a thing called “Driverless AI” for a while, and there’s various systems like that, that you can try and look at interesting results.

So thank you all for bearing with me on this one, and I’m excited to have Chris back with me for next week’s episode, which you’ll hear hopefully very soon. Hope you all have a great week, and thanks for joining.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Practical AI – Episode #285

AI is more than GenAI

get Fully-Connected with Daniel Whitenack

Featuring

Featuring

Sponsors

Notes & Links

Chapters

Transcript