You can’t build robust systems with inconsistent, unstructured text output from LLMs. Moreover, LLM integrations scare corporate lawyers, finance departments, and security professionals due to hallucinations, cost, lack of compliance (e.g., HIPAA), leaked IP/PII, and “injection” vulnerabilities.
In this episode, Chris interviews Daniel about his new company called Prediction Guard, which addresses these issues. They discuss some practical methodologies for getting consistent, structured output from compliant AI systems. These systems, driven by open access models and various kinds of LLM wrappers, can help you delight customers AND navigate the increasing restrictions on “GPT” models.
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
|Chapter Number||Chapter Start Time||Chapter Title|
|1||00:00||Welcome to Practical AI|
|2||01:02||Growing AI interests|
|4||04:17||Co-host & guest?|
|5||05:34||Pressures of AI|
|6||11:47||Unlocking more value|
|7||14:50||Sponsor: Changelog News|
|8||16:43||Open access models|
|9||20:25||Where do we place our bets?|
|10||25:35||Structured 7 typed output|
|11||30:33||Problems of the space|
|12||37:55||Determining your needs|
|13||41:22||Ease of use|
|14||43:03||No code world|
|15||45:27||The future question|
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist and founder of Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
Doing well today, Daniel. It just continues to be super-interesting in this space, in the world of AI. So much change. This has been a year – I think it’s a year that’s for the history books in terms of the advances, and the fact that AI is really making deep impact into the general population, people who normally might not be listening to our podcast, as hard as it is to believe that.
Yeah, I was just gonna say, people in companies like my wife’s company, which is not a large company, it’s not a tech company, but they’re having conversations about “How do we as a company leverage AI, or leverage large language models in our content generation?” or what have you. And it’s really permeated all industries at this point, I think, and people are wrestling with the idea of “What do we do?”, not just “IF we do something in relation to AI.”
Agreed. I think that’s a huge issue right now. It is potentially more confusing about how to handle everything that’s coming at companies these days from a large language model and generative AI than it ever has been, and the problem is getting harder. So we want to talk a little bit about that today, and I want to acknowledge a couple of things with our audience. Many of you have been with us for quite a long time. We’ve been doing this show for about five years on a weekly basis - I know it’s been forever - and it’s just getting more and more interesting.
Over that course of time, something I wanted to share with our audience is I’ve gotten to know Daniel pretty well. And we didn’t know each other super-well beforehand; we had met in the Go software development community as kind of the two people looking at data concerns. But Daniel over time has demonstrated not only the fact that he is repeatedly an incredibly smart man with a lot of capability, but he’s also an incredibly good human being. And for anyone who’s followed the show for a long time, they know that the idea of just being a good person, and AI for good, and such things are a huge repeating topic on the show. And I’ve also learned to trust where he’s going, and to understand that if Daniel is doing or interested in something, it’s something that I want to know about.
So today we want to hit this large language model kind of in the world and how you manage that, but I also want to acknowledge that we’re going to have bits of the show that could be considered conflict of interest. And the reason I say that is we’re going to talk about some work Daniel has been doing. And so if that bothers anybody, this is the point where you might want to shut off this particular episode… But I’m hoping that most of you trust us, and have been with us for long enough to know that I’m not going to take you down a path that you wouldn’t want to go… And I’ve asked Daniel to talk not only about the space of large language models being brought into production, and trying to juggle all the things coming out, but talk about the work he’s doing. And so we are unabashedly going to go that direction, and if anyone has hate mail to send, please send it to me, because I have demanded, I have demanded that Daniel talk about this… And so thank you for bearing with us on that.
So Daniel, you’re kind of both the co-host today, and the guest, if you will… And if you might lay out a little bit of the landscape for us about what this looks like as an insider, as someone who spends all your time focusing on this problem, that might be a good way to start.
Yeah. Thanks, Chris, and thanks for the kind words. I’ve over time learned so much from doing this show, and it’s shaped a lot of what I think about, and certainly the things that I’ve been thinking about really since - well, this whole year, since kind of Christmas time, have been focused around these ideas of controlling large language models, guiding them, guarding them, making compliant AI systems, and a lot of that’s led into the thing that I’m building right now, which is called Prediction Guard. So that’s what you were referring to in terms of what I’m building. So I’m coming at it from that perspective, and been thinking about this a lot, been talking about this a lot publicly, and excited to do things like - upcoming, there’s an LLMs in production event put on by our friends at the MLOps community. That’s really cool. I’m giving a talk there on controlled and compliant AI applications, so that’s part of what I’ll share here today as well.
One question maybe that I have for you as we start out here, Chris, is what have you experienced in terms of the people that you’re talking to with regard to the pressure that they’re feeling, either internal to their own company, or from like market pressures… Like, jump into the AI waters - implement something, make AI part of our stack. What are you seeing there?
[00:05:59.23] So I don’t think you’ll be surprised when I say this, and we’ve alluded to this on some previous episodes… But it is a difficult business concern to navigate. I know all of us who straddle into the AI technical realm are incredibly excited, we’re trying to figure out how to do the models, and put them out there, and everything like that… But if you are not in our shoes, if you’re walking on a slightly different path, and let’s say you work for a legal department, or a compliance department, or other business concerns, and suddenly these technologies are coming at you hard and fast, week by week in 2023, and you’re trying to navigate that, and look at things like licensing on how the data that goes into models is used, and you’re looking at compliance concerns, and you’re looking at protecting your intellectual property… There’s a whole host of challenging business problems with essentially no guidance. This is a brave new world that has to be pioneered through, and so I have talked to a lot of businesspeople in various roles, including attorneys, and this stuff is scary stuff. It is problematic stuff, it is challenging to navigate. And I definitely want to take you down the path today of talking about the space, and Prediction Guard, relative to how you actually get these models out there in a productive way, in a business environment, so that people can take advantage of the technology, and understand what the pitfalls are, and such.
So that’s the big thing that I’ve been hearing… I’ve been getting an earful of it lately, like “Chris, settle down. Stop taking us down this AI thing. We’ve got to figure some things out first”, so I’m coming to you for answers, man.
Yeah, it’s so tempting, actually, to have really easy to use systems, like let’s say the Open AI API, right? I can go to the Playground, or I can go to ChatGPT, or I can go wherever, put in my prompt and get some magical output. It’s magical, and immediately it triggers in your mind, “I can solve real business problems, and I can create actual solutions with this type of technology.” It’s so quick to make that connection. But what I’ve seen, both in sort of advising and consulting and conversations that I’ve been having is on maybe like a less stringent case, people are struggling to make that connection to how they can build robust systems out of these technologies. So it’s one thing to get text output and look at it with your eyes as a human, ad say “Extract this piece of data” or “Give me a summary of this”, or something like that. But as soon as you make that programmatic and automated, then how do you know you’re getting the right output? And if you actually want to do something with that, like you’re outputting a number, a vomit of text blob out of a large language model, it doesn’t really actually do you that much good if you’re trying to implement a robust system that’s making actual business decisions on top of the output of large language models.
On the harder side of this, I’m getting feedback from people that either I know, or I’m advising, or other things that companies are actually telling them “No, there’s a full stop on using “GPT models” in this organization”, because of one of a few different reasons. Maybe that’s a risk thing, around “Hey, we’re gonna hallucinate some name out of this, or something… This person doesn’t exist, and that’s going to get us in trouble. People are going to stop trusting our product”, and that sort of thing. So there’s the hallucination or consistency of output sort thing.
[00:10:02.10] There’s also, as you mentioned, the IP or PII type of leakage scenario. So it is actually a problem for people to sit in a company - I’m sure this would be true whether you’re at your company, or a variety of other companies that I’ve talked to, where I’m sitting there and I’m like “Oh, I could solve this problem with ChatGPT. Let me copy and paste this user data into ChatGPT, and have it summarize something, or extract something”, or whatever it might be. It’s sort of unclear and murky waters how that data is actually going to be used by Open AI, and you’re kind of leaking IP, or company information, PII, to external systems, right? Which is a big, big no-no; regardless of how that’s used in the end, this data, it seems like it’s going to exist outside of your own systems. And so on the harder side of this problem, people are being told, “No, you have a full stop. Can’t use GPT, can’t use large language models.”
So to summarize how I would kind of think about this problem space, people are feeling the pressure that they need to or really want to implement these systems, either because they feel like they’re getting left behind, or there’s an actual market pressure for them to do something… But in practice, they don’t know how to deal with the outputs of large language models, and they might not even be able to connect to the kind of most common large language models because of these privacy, security, leaked IP type of issues.
I think that’s really, really widespread… It’s funny, you’ve kind of enumerated a whole set of risks associated with that. Yesterday, just as a thing - you know, I have a particular employer, and thinking about public information, well-known public information about the lines of business that we have - it is publicly acknowledged and multiple sources out there… I went to ChatGPT; I should have done the 4.0 model, but I forgot, and I just let it default to 3.5, and I simply asked for our 19 lines of business, which is incredibly public knowledge, and it got it wrong. It got it wrong the first time, and so I tried to steer it a little bit, and it got it wrong the second time… And had I not known better about the intellectual property concerns with licensing, had I tried to put something in that might have been out there in the public… So I run into what you just said all the time. And there’s so many risks. And yet, there’s so much value to extract from this space. And so I think putting your finger on the fact that if you can find a way to mitigate these risks in various ways, that will unlock a huge amount of value for a lot of organizations and users to do that… But it certainly, from my standpoint, feels like the Wild West right now.
Yeah, I would say that that’s true, and yet there’s these concerns… The money that people are able to save, operating costs with AI in your business are significant. So I saw this study from Accenture, estimating insurance companies saving $1.5 million per 100 full-time employees. So if you’re insurance company A, and you’re not trying to implement AI systems in your business, then you’re actually introducing a liability… Because insurance company B might be doing that, and they’re gonna slash their prices and undercut you and put you out of business. So even regardless of new features that might be implemented in like people’s products, and that sort of thing, there’s this real liability around not considering AI solutions as part of your business strategy.
I think that’s a huge point, and that’s the other side of the coin that I was just talking about. There was the risk of using, and there’s the potentially larger risk of not using at all. So we’re seeing that in all markets, in terms of the need to stay on top of what is gradually evolving over these past months, and to be able to use that to promote your business. And if you don’t do that, the risk is substantial. So the idea of navigating the licensing and the compliance concerns, and being able to productively use these outputs is really crucial to being successful in almost any industry going forward. So definitely looking forward to finding out how we might do that.
If I could summarize some of what’s been said, we kind of talked about these two large categories of problems. One was the structuring consistency and validation of the output of these models to make them useful in actual business use cases. And the second was maybe compliance concerns, privacy security concerns, which really have to do with how a model is hosted, or how you access that model. So on the one side, it’s how do you process the output of a model, and then on the other side, how do you access or host a model? Both of those things can be pretty big blockers.
To kind of dive into the latter of those, the hosting privacy security thing, I actually am quite encouraged by where things are headed recently, because we’ve seen this kind of proliferation and explosion of open access models that continue to be released day after day. The most recent one at the time of recording this - I might be missing one; they seem to come out every week. But one for example that came out recently is the MPT family of models from MosaicML, which is just really extraordinary. I think that they have up to like context link, or - you can think about that as kind of your prompt size for the model, of like 60,000 tokens… And they do quite well in various scenarios.
[00:18:18.02] So there are these increasing number of open access models, but I would say there’s two problems with using these as a business. Let’s say I wanted to host one of these and use it internally. Well, maybe three problems. It’s always good to have three points, right? Three problems. One is you still have to figure out the weird GPU hosting and scaling of that model, which is a challenge.
The second is, in reality, these open access models, at least according to most people, I think it’s generally accepted that these aren’t quite up to the standards of the larger commercial systems that Open AI and others are putting out there; Cohere, and Anthropic, and others.
So there’s a performance concern, there’s the hosting concern… And then the third, which is the same as our other major topic here, is you still have to figure out how to use the output of them; they’re still just gonna vomit up text on you, and you have to figure out how to deal with that. This has led some people to strike up these kind of expensive deals to host Open AI models in Azure infrastructure. That’s becoming easier over time; I hope that becomes increasingly easier. It’s still a little bit limited to Azure mainly, in my understanding, and it’s definitely not cheap, I would say, if you kind of compare all the costs and add in the engineering time to do that, and all that. So some people are solving this model hosting issue by either hosting an open access model, maybe with a hidden performance, or implementing a really expensive kind of private version of Open AI, something like that… And if you don’t have that budget, or if you don’t know about GPUs, or how to host models, you’re kind of out of luck, in a lot of ways.
Not only I agree with you, but I think that that’s going to proliferate in terms of the challenges across there… I know speaking for myself and another friend that I talked to a lot about this stuff a lot, we are experiencing the fact that as model updates come out, new models come out, they have different strengths and weaknesses. There are some things that I might, for instance, go to GPT-4 on, there are other things I might go to Bard on now… And those are just two; there’s a whole bunch of open source ones that we were starting to talk about that… And with the acknowledgement of - for instance, Open AI has kind of acknowledged that there is a practical limit in terms of how much data you can feed a model, and that we need to start looking at other dimensions on that.
So with practical limits in sight, the commercial advantage, for instance, may hit that ceiling, and open source ones will gradually catch up. And so you’re seeing the relationships of utility for a user between different models changing on a regular basis, and us users having to make adjustments to that. How does that play into the landscape? Because if you’re an organization and you’re trying to make investments, like “Do we bet on Open AI and Microsoft, do we bet on Google? Do we bet on open source options? What are the options there? What are the different capabilities that might be available to us for doing that?” And acknowledging upfront Prediction Guard may be one of those, what does the rest of the landscape look like, and how does Prediction Guard fit into that, and what are some of the pros and cons that you see?
[00:21:46.29] I’ll decouple a couple of these things and talk about the general landscape, and then Prediction Guard. So in terms of this problem of the hosting, compliance, privacy, IP leakage, that sort of thing, I think if you’re a company of a certain size, and you can afford kind of a private, Open AI set up in Azure, it’s probably a pretty reasonable solution; it will definitely work very well, but it’s going to be very, very costly. And again, it’s not going to solve this structuring and usage of the output of language models problem. So you’re going to have to put additional engineering effort into helping build layers on top of that, that work for your business use cases.
You could bet on certain open access models right now, but like you said, things are advancing so quickly, it’s hard to say “I’m going to put all of this effort into one, and hosting of the one, and build a system around it.” I do think that there’s advantages if you’re going that route to center your infrastructure around kind of model-agnostic workflows, like those in LangChain, or others, where you actually abstract away the model interface and can connect to multiple large language models with a lower switching cost than if you kind of have a one-off solution centered around a certain model. So I think there’s some things that people could be encouraged about there.
In terms of that though, if you think about “Okay, now I’m going to go all-in on these open access models”, like you say, these models have different characters, so I’m going to want to host maybe multiple of them, and generate these model-agnostic workflows on top of LangChain, and other things; you start to really add up the engineering effort to make this happen. A parallel might be I could create a data visualization solution for my company by assembling a database, and hosting that, making the connection into a layer that would run Plotly plots or something like that, and then maybe some UI for my users that those are embedded in… And all of a sudden, I’m now talking about an absolute fortune in engineering costs, and support costs over time… Which is why products like Tableau, or other – I remember a long time ago, I don’t know how much people are still using it… One of the companies I was at was using Domo. This was one of these solutions where you can quickly suck in data and visualize it, and all that… There’s a reason why those products exist.
So Prediction Guard you could kind of think of as taking the best of open source models, and the best of this kind of control and structuring of output, which we haven’t talked about yet, and we can get into here in a second… And assembling those together in an easy-to-access and cost-efficient manner, so people can get quality output out of the latest large language models that’s structured and ready to be used in business use cases, and also with a guarantee if you want it around you using only specific models that are hosted in a compliant way, even compliant in a certain way, like a HIPAA compliant way, or in a data-private sort of way, where your data isn’t leaked if you’re putting data into models.
So that’s kind of how the landscape works and how Prediction Guard works as this kind of system that assembles the best of large language models, with structured and typed output, that can be deployed compliant without this whole huge engineering effort to roll your own system.
You mentioned structured and typed output. Can you go ahead and kind of talk a little bit about that? Because I think for many of us that are listening, we’re used to using the models that are out there kind of in the default interfaces on the web. You know, using ChatGPT, using Bard… And we’re not really dealing with that. We get an output, but we’re not at the level of sophistication where we’re doing APIs, and such as that. Can you talk a little bit about what structured output looks like when you’re dealing with it from an API standpoint, and how you unify that landscape?
[00:26:07.01] There’s a lot of use cases where this may come up, but let’s take one for example. Let’s say that you’re doing data extraction. You have a database with a column in it, which is basically – so this scenario has happened at every company that I’ve been with, so I know that it’s very common… There’s some database with a table in it, and there’s a column that’s like a Comments column, or something… And it’s just like text blobs in there that are like notes from people, or technician messages, or user messages… Or whatever it is, it’s not structured. And you want to run a large language model over that to extract - maybe it’s phone numbers, or prices, or certain classes of information out of this column. Well, you could run your large language model and set up a prompt that says, “Give me the sentiment of each of these pieces of text in my database.” Well, that prompt, each time you run it through a large language model, maybe once it generates an output that says “Space positive sentiment”, and the next time it creates an output that says, “Positive.” And the next time it creates an output that says “This is positive sentiment.” And you can start to see there’s a consistency problem here, like “How do I parse all of these strange outputs from my large language model?” You can do a little bit of prompt engineering to get around that, but ultimately, it doesn’t solve the problem that you could have all sorts of weird output out of your large language model.
So ultimately, what you would want in that scenario is a system that lets you constrain and control what types of output you’re going to get out of your large language model. So in the case of sentiment, maybe I want to restrict my output to only pos, neg, and neu tags for sentiment. There’s only three choices, I always want one of those three. I don’t want it to say “This is positive sentiment.” So I want to actually structure and control the output of my large language model to produce one of these outputs.
Another example that’s maybe a little bit more complicated would be to say, “I actually want to output a valid JSON blob out of my large language model, or valid Python code out of my large language model.” And these are structures that are very well-defined, but you could have all sorts of variability coming out of your large language model. And if you want a specific type coming out of your large language model - maybe it’s a float - that you can do like greater than, or add it to another number, you need that as a typed output. Or you need very specific structured output to actually make automated decisions in your business.
And so with Prediction Guard, what we’re doing is we’re kind of assembling the best of the recent advances in this kind of control and structuring of output, and layering it on top of these open source large language models to allow you to say, “Here’s my prompt. I’m going to send it to these five open and/or closed” - we support Open AI as well… So “open and/or closed models, and for each output, I want you to give me a float number.” And that’s the sort of rich output that you can get from large language models very quickly with Prediction Guard kind of prompt, because you can control the models that you’re using either ones that are more privacy-conserving, or the closed source options, and provide constraints around the output that allow you to actually make business decisions on that. Now, there’s additional checks that could go along with that, like factuality checks and toxicity checks, which we also implement… But I’ve vomited up a lot of information, so I’ll pause here.
[00:29:58.28] No, no, that sounds fascinating. The way I’m interpreting what you’re saying is sort of like you have these kind of software filters that are creating boundaries, if you will, on how you structure input, and what that output can be so it’s usable… Which kind of goes back to one of the points that we’re often talking about on the show, is that the AI is to some degree inseparable from the software that you’re using it within, and so you have a best-of-breed software product that’s kind of shaping and constraining what that can be, so that it’s actually usable on that.
So as we look forward at kind of where things are going, what are some of the problems that you see going in this space that we have, and what are some of the things that you would like to see Prediction Guard starting to address in the time ahead? And I don’t mean so much as the far distance, but kind of like you’re busy putting this solution together now, it works pretty darn well what you already have… What are some of the challenges when you’re in this kind of a fast-moving space? Because you’re having the world change out from under you on a week-by-week basis right now.
Yeah, I think maybe one of the things that we’re thinking about as really at the forefront of our mind is ease of use and accessibility to both data scientists and developers. So the reality is that - I think we had Kirsten Lum on the podcast talking about this… The majority of data scientists out there are super-constrained in the time that they have to put into one of these solutions. So it’s really, really important that there is an ease of use to this sort of controlled, compliant LLM output and generative AI output.
Now, what we’re seeing - and I want to acknowledge this as well - is there are an increasing number of open source projects that are doing an amazing job at digging into this problem of controlled and guarded LLM output. So these are things like guard rails and guidance from Microsoft, and Matt Rickard’s ReLLM - these projects are doing amazing things at really flexible ways for you to control the output of large language models… But I see this as kind of like a double-edged sword a bit. The more flexible you become, it’s also possible to become less easy to use…
And there’s more engineering involved in it.
Yeah. So I saw Matt Rickard tweet about this related to his regex LLM project, which is that sort of famous quote about regex, which is “I have a problem, and so I decided to use regex, and now I have two problems.”
That’s an old one. That’s been around for a long time, actually.
Yeah, it’s actually – it’s so true. And some of these solutions are coming up with their own query languages to kind of deal with this structured output, which I think is great, and it’s really important, but there’s a need for this abstraction layer on top where I know kind of what I want my output to look like, so I should be able to plug that into something and have it constrain the output of my large language model in an appropriate way.
So with Prediction Guard, what we’ve started with is the kind of presets of structuring your output… So I want integer and float and JSON and Python or yaml, I want categorical output… These are things that we support now. Also supporting kind of these hosted models, and access in a guarded kind of controlled way to these models. But let’s say that I have a really specialized format that I want to work with; I would rather set up a solution with Prediction Guard - this is actually what we’re actively working on, where they could give examples of the structure that they want, and we actually generate the right constraints for them on the large language model output, which I think is very possible, and our initial work on this, which is kind of in a beta form, is really good.
[00:34:17.02] So let’s say that I want a specific JSON with these specific fields, or a specific CSV output with these specific columns, right? I should be able to give a few examples of that, and generate the right underlying constraints for my large language model, without the user having to think about special languages, or regex, or context-free grammar, or these things that are a little bit harder to grasp. We’ll handle that bit for you, and you just get the right structured output from your models.
So that’s part of where I see us headed, is leveraging these rich systems under the hood that are being produced around using context-free grammars, special query languages, regex, all of these things to structure output, and combining those in a more automated way for users, where they can just say, “Here’s my examples, here’s my query”, and they just start getting the right formatted output from their language model.
So that’s kind of thing one, is this automation of some of the problem and the constraints. I think the thing too would really be around the validation and checking of output in addition to the structuring. So right now we support factuality and toxicity checks on the output of large language models, so…
Could you talk a little bit about what each of those are?
Yeah, yeah. So let’s say that I take a big piece of text and I generate a summary, or I do a question-answer prompt and get an answer, right? It doesn’t mean the answer is factual. And we all know about the hallucination problems of these models. So the things that we have implemented in Prediction Guard are two things with respect to that. The first is a factuality checking score, which is built on these trained models under the hood that look at a reference piece of text and your text output to determine a likelihood of the answer being factual. So this is an estimate on the factuality of your output.
The other thing that we’re doing around hallucinations and factuality is making it really easy for people to do consistency checks. I kind of alluded to this earlier, but we have all of these different language models accessible under the hood. So you could combine the outputs of CAMEL 5 billion, NPT 7 billion, Dolly and Open AI restrict the output to say “Give me the answer, but only if all of these agree on what the output is. If all of them don’t agree, then I’m going to flag that as not a reliable output.” And so you can actually gain a lot by not just leveraging one model, but ensembling these models together to do a check.
The toxicity thing is something that’s been studied for a while, and there’s models out there, state of the art models for detecting whether an output is toxic or not, or includes hate speech or not, that sort of thing… So this is another layer of check that you can have on the output. And so if you put the whole pipeline together of Prediction Guard, you’ve got models in the output which can be deploy-compliant with HIPAA, or just data privacy; those structured or typed output that you can define very easily, and then you can run additional checks on that output for factuality, toxicity, consistency, as a final sort of layer in the pipeline towards the output that’s used in a business application.
[00:37:46.27] I appreciate the explanation. It’s a very robust-sounding pipeline that you have on that. Let me ask you… And this could be whether it’s Prediction Guard, or whether it’s the larger field. One of the challenges – it’s certainly something I’ve been playing with, but I don’t have a good rhyme or reason to it yet… With the proliferation of these models coming out, and evermore coming; we know this space is going to get larger and larger. How does a user, or how would a system like Prediction Guard be able to determine which is the right way to go in terms of which model you want to choose, or which group of models? And you talked about the comparisons a moment ago… How do you structure the input and know that you’re gonna get what you need from an output by putting the right model or collection of models together, and then knowing how to evaluate them against each other? Does that make sense?
Yeah, yeah, that makes sense. Actually, early on when we were building the Prediction Guard backend, this was actually front of my mind, and has since kind of evolved a little bit. The fact that there’s all of these models and I want to choose the right one for my use case - you can very much automate that process, and it’s actually still implemented in the Prediction Guard backend, where you can give some examples, and evaluate a whole bunch of models on the backend.
I think where this is headed though, and where the Prediction Guard system is headed is making it easier for people to get output from multiple models in a typed way, because they know how to do the evaluation. They’re familiar with this sort of thing, whether you’re a developer doing sort of integration tests or unit tests, and you’re checking and you’re asserting certain values, or you’re a data scientist that’s running a larger-scale test against the test set, people kind of know what they want to do with that sort of thing. What they need is an easy way to get that typed output from multiple models. So like if I have a test set and I’m comparing two scores on the output, like float numbers, I need to get float numbers out of a whole bunch of different large language model was to compare them to my baseline or to my test set. Right now that’s very difficult, because all of these different kind of structuring/guidance/control systems work not for all models, and they don’t work in the same way for all models, and you have to implement it for all of the models. And so it becomes this compounding problem to figure out how to do that.
And so how we’re approaching that with the Prediction Guard system is there’s a standardized API to all of these different models, along with the typed and structure control on the output. So I can do a query that says “Give me the float output for these 100 prompts, using these five models, and then I’ll just compare all the float outputs and figure out which is the best.” That’s not the hard problem. It’s the getting that structured output from the variety of models in a robust and consistent way; that’s actually a more difficult problem.
Gotcha. Is it fair – you know, as we’re talking about this, it sounds a lot like you’re also solving one of the bigger challenges we’ve talked about over time, which is that there’s so much domain expertise in the AI space in terms of being able to manage models, but if I’m understanding you correctly, it sounds like minimally, with some basic software skills, and knowing how to use APIs and stuff, you can probably without deep expertise and deep learning manage to get some fairly productive output through Protection Guard, by implementing it that way. In other words, it becomes just another part of your software workflow. Is that a fair characterization, what I’m saying?
[00:41:45.07] I would say it is, in the sense that there’s still some sort of like integration, testing and integration that will have to happen regardless. But going back to my example before, of like the data visualization stack, it’s a lot harder to implement the database and the visualization layer and the frontend than it is to like log in and do the – there’s still configuration that’s needed in like a Domo type solution or Tableau, it’s just a lot more accessible.
So here, we have the language models hosted on the backend, we have the structured, guarded way to query those models via something that all developers know how to use, a REST API or a Python client - maybe there’ll be other clients over time - and have the ability to configure that in the way that you want. So I want output from these five models, I want to ensemble them together, or I want this structured output… And so there’s still configuration, and I think developers and data scientists, they want that. It’s just that it’s really hard to get all the other pieces in place, and we’re hopefully making that a lot easier.
So in let me ask one final – I think this is an aspirational question, but I’m kind of curious… One of the things that we’ve seen with larger language models is the ability for people who aren’t even developers - I was saying, like, developers who aren’t even deep learning experts, but to have a certain amount of capability producing code; the kind of avenue into a no-code world has at least been started on this. It has a lot of maturing to do, obviously. Do you envision a point where someone with very limited skills can also use Prediction Guard in this way, and be able to kind of generate apps using large language models that then kind of feed into a more mature workflow like what you’ve described? Do you think that that’s attainable at some point, in the not-so-distant future?
It’s hard to say how far this kind of automation will go. I think a lot of the agents that we’ve seen produce good demos, but they have an additional layer of this sort of – additional problems around automating these various steps of the process. I think that in terms of what we’re looking at, this sort of automated structuring of output is a step in the right direction in terms of “I don’t have to define a special query language or a special specification, but I can say what sort of structure I want output, and that gets output.” I think then if you layer that on top of the agent sort of infrastructure that’s in LangChain, and the data augmentation - we just had the episode with Jerry from Llama Index, which was super-fascinating… So if you layer the kind of structured, guarded output with the chaining and agent and automation of LangChain, and maybe the data augmentation of Llama Index, I think a lot of things become possible.
I hope that some of the things that you mentioned become possible. It’s yet to be seen… But I am really encouraged that adding in this sort of type safety for outputs and structuring of outputs gives a lot more confidence maybe in some of the checks that you could do on AI agents over time, and that increases our confidence in sort of releasing AI agents on various parts of the workflows that we’d like them to work on.
[00:45:21.09] So you’ve sort of already covered some of the territory, but for our listeners - Daniel and I often when we’re talking to a guest, we’ll kind of finish with what we roughly call the future question; kind of wax poetic about where things are going. And so Daniel, since you’re knowing that there’s a – I’ve kind of hit some of that already, but what would you be asking yourself? So you’ve kind of had me throwing these questions at you from a point of somewhat ignorance compared to where you’re coming from as the expert on it… What right now would you ask yourself that you haven’t covered, that you think is worthy of getting in before the episode is over? I’m putting you on the spot.
Yeah, I’ve mentioned Open Access models quite a bit, and I think hopefully a lot of us are encouraged by the direction that that’s going, that these models are getting better and better… But one thing that maybe I would ask myself, or that I think is important to highlight and encourage people with, is these open access models might not quite be at the level of Open AI, Anthropic etc. yet. But I think not only will they get there, but already in the space where we’re at now, with some of these kind of structured control elements around open access models, you can actually boost the performance of open access models to be more in line with Open AI-level output… Because what you can do is say, “Well, I’m gonna force my output to this. If I’m not able to produce it, I can re-ask the question, or I can try a variant of my prompt.” And these kind of wrapping layers around open access models actually provide a way for you to operate in a data-private, compliant way, with open access models that boost their performance closer to what these kinds of closed and maybe more suspect in terms of IP leakage and that sort of thing systems are doing.
So I think that’s an encouragement that I’ve found recently, and I hope that’s encouraging to others, is we are really seeing a proliferation of these models, and they’re all going to have a little bit different character, but the ways that we wrap them and the way that we present them provides the majority of the value of those models. And I think we’ll see not only Prediction Guard, but other systems as well coming out that wrap these models and use them in really intelligent manners, that boost their performance in a way that isn’t reliant on sort of a centralized API.
I appreciate that. I think you’re right, and I am deeply appreciative of you not only telling us about Prediction Guard, but actually kind of laying out the space; even if someone is not chomping at the bit the way I am to use Prediction Guard, they hopefully kind of understand what some of the problems are that need to be addressed, whether by you or others out there. So thank you for allowing me to twist your arm, and do this episode today. I appreciate you letting me go there.
So anyway, thank you very much to my good co-host and my guests today for coming on Practical AI.
Thanks so much, Chris.
Our transcripts are open source on GitHub. Improvements are welcome. 💚