Practical AI – Episode #253
Collaboration & evaluation for LLM apps
with Raza Habib, co-founder and CEO of HumanLoop
Small changes in prompts can create large changes in the output behavior of generative AI models. Add to that the confusion around proper evaluation of LLM applications, and you have a recipe for confusion and frustration. Raza and the Humanloop team have been diving into these problems, and, in this episode, Raza helps us understand how non-technical prompt engineers can productively collaborate with technical software engineers while building AI-driven apps.
Featuring
Sponsors
Read Write Own – Read, Write, Own: Building the Next Era of the Internet—a new book from entrepreneur and investor Chris Dixon—explores one possible solution to the internet’s authenticity problem: Blockchains. From AI that tracks its source material to generative programs that compensate—rather than cannibalize—creators. It’s a call to action for a more open, transparent, and democratic internet. One that opens the black box of AI, tracks the origins we see online, and much more. Order your copy of Read, Write, Own today at readwriteown.com
Changelog News – A podcast+newsletter combo that’s brief, entertaining & always on-point. Subscribe today.
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:07 | Welcome to Practical AI | 00:31 |
2 | 00:43 | Origin of Human Loop | 04:40 |
3 | 05:23 | Types of designers | 02:43 |
4 | 08:06 | Tech & non-tech worklfow | 03:41 |
5 | 11:47 | What am i building? | 01:32 |
6 | 13:19 | Sponsor: Read Write Own | 01:11 |
7 | 14:40 | What can Human Loop do? | 02:42 |
8 | 17:22 | In-production feedback | 01:09 |
9 | 18:32 | Fine-tiuning jargon | 03:11 |
10 | 21:43 | Fine-tuning trends | 02:28 |
11 | 24:11 | Proliferation of open models | 02:39 |
12 | 26:49 | Sponsor: Changelog News | 01:31 |
13 | 28:20 | Different roles in the HL system / Collaborating in Human Loop | 03:44 |
14 | 32:04 | Production framework | 02:35 |
15 | 34:40 | Importance of evaluation | 03:34 |
16 | 38:14 | Surprising usecases | 04:29 |
17 | 42:43 | Exciting things that are happening | 02:38 |
18 | 45:27 | Outro | 00:46 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am CEO and founder at Prediction Guard, and I’m really excited today to be joined by Dr. Raza Habib, who is CEO and co-founder at Humanloop. How are you doing, Raza?
Hi, Daniel. It’s a pleasure to be here. I’m doing very well. Yeah, thanks for having me on.
Yeah, I’m super-excited to talk with you. I’m mainly excited to talk with you selfishly, because I see the amazing things that Humanloop is doing, and the really critical problems that you’re thinking about… And every day of my life it’s like “How am I managing prompts? And how does this next model that I’m upgrading to, how do my prompts do in that model? And how am I constructing workflows around using LLMs?”, which definitely seems to be the main thrust of some of the things that you’re thinking about at Humanloop. Before we get into the specifics of those things at Humanloop, would you mind setting the context for us in terms of workflows around these LLMs, collaboration on team? How did you start thinking about this problem, and what does that mean in reality, for those working in industry right now, maybe more generally that at Humanloop?
Yeah, absolutely. So I guess on the question of how I came to be working on this problem, it was really something that my co-founders, Peter and Jordan, had been working on for a very long time, actually. Previously, Peter and I did PhD’s together around this area, and then when we started the company, it was a little while after transfer learning had started to work in NLP for the first time, and we were mostly helping companies fine-tune smaller models. But then sometime midway through 2022 we became absolutely convinced that the rate of progress for these larger models was so high, it was going to start to eclipse essentially everything else in terms of performance… But more importantly, in terms of usability. It was the first time that instead of having to hand-annotate a new dataset for every new problem, there was this new way of customizing AI models, which was that you could write instructions in natural language, and have a reasonable expectation that the model would then do that thing. And that was unthinkable at the start of 2022, I would say, or maybe a little bit earlier.
So that was really what made us want to go work on this, because we realized that the potential impact of NLP was already there, but the accessibility had been expanded so far, and the capabilities of the models had increased so much that there was a particular moment to go do this. But at the same time, it introduces a whole bunch of new challenges, right? So I guess historically, the people who were building AI systems were machine learning experts; the way that you would do it is you would collect, annotate the data, you’d fine-tune a custom model… It was typically being used for like one specific task at a time. There was a correct answer, so it was easy to evaluate… And with LLMs, the power also brings new challenges. So the way that you customize these models is by writing these natural language instructions, which are prompts, and typically that means that the people involved don’t need to be as technical. And usually, we see actually that the best people to do prompt engineering tend to have domain expertise. So often, it’s a product manager or someone else within the company who is leading the prompt engineering efforts… But you also have this new artifact lying around, which is the prompt, and it has a similar impact to code on your end application. So it needs to be versioned, and managed, and treated with the same level of respect and rigor that you would treat normal code, but somehow you also need to have the right workflows and collaboration that lets the non-technical people work with the engineers on the product, or the less technical people.
And then the extra challenge that comes with it as well is that it’s very subjective to measure performance here. So in traditional code we’re used to writing unit tests, integration tests, regression tests… We know what good looks like and how to measure it. And even in traditional machine learning, there’s a ground truth dataset, people calculate metrics… But once you go into generative AI, it tends to be harder to say what is the correct answer. And so when that becomes difficult, then measuring performance becomes hard; if measuring performance is hard, how do you know when you make changes if you’re going to cause regressions? Or all the different design choices you have in developing an app, how do you make those design choices if you don’t have good metrics of performance?
And so those are the problems that motivated what we’ve built. And really, Humanloop exists to solve both of these problems. So to help companies with the task of finding the best prompts, managing, versioning them, dealing with collaboration, but then also helping you do the evaluation that’s needed to have confidence that the models are going to behave as you expect in production.
And as related to these things, maybe you can start with one that you would like to start with and go to the others, but… In terms of managing, versioning prompts, evaluating the performance of these models, dealing with regressions, as you’ve kind of seen people try to do this across probably a lot of different clients, a lot of different industries, how are people trying to manage this, in maybe some good ways and some bad ways?
[05:52] Yeah, I think we see a lot of companies go on a bit of a journey. So early on, people were excited about generative AI and LLMs; there’s a lot of hype around it now, so some people in the company just go try things out. And often, they’ll start off using one of the large, publicly-available models, Open AI, or Anthropic, Cohere, one of these; they’ll prototype in their own kind of playground environment that those providers have. They’ll eyeball a few examples, maybe they’ll grab a couple of libraries that support orchestration, and they’ll put together a prototype. And the first version is fairly easy to build; it’s very quick to get to the first wow moment. And then, as people start moving towards production and they start iterating from that maybe 80% good enough version to something that they really trust, they start to run into these problems of like “Oh, I’ve got like 20 different versions of this prompt, and I’m storing it as a string in code… And actually, I want to be able to collaborate with a colleague on this, and so now we’re sharing things either via screen sharing, or –” You know, we’ve had some serious companies you would have heard of, who were sending their model configs to each other via Microsoft Teams. And obviously, you wouldn’t send someone an important piece of code through Slack or Teams or something like this. But because the collaboration software isn’t there to bridge this technical/non-technical divide, those are the kinds of problems we see.
And so at this point, typically a year ago people would start building their own solution. So more often than not, this was when people would start building in-house tools. Increasingly, because there are companies like Humanloop around, that’s usually when someone books a demo with us, and they say “Hey, we’ve reached this point where actually managing these artifacts has become cumbersome. We’re worried about the quality of what we’re producing. Do you have a solution to help?” And the way that Humanloop helps, at least on the prompt management side, is we have this interactive environment; it’s a little bit like those Open AI playgrounds, or the Anthropic playground, but a lot more fully featured and designed for actual development. So it’s collaborative, it has history built in, you can connect variables, and datasets… And so it becomes like a development environment for your sort of LLM application. You can prototype the application, interact with it, try out a few things… And then people progress from that development environment into production through evaluation and monitoring.
You mentioned this kind of in passing, and I’d love to dig into it a little bit more. You mentioned kind of the types of people that are coming at the table in designing these systems, and oftentimes domain experts – you know, previously, in working as a data scientist, it was always kind of assumed “Oh, you need to talk to the domain experts.” But it’s sort of like – at least for many years, it was like data scientists talk to the domain experts, and then go off and build their thing. The domain experts were not involved in the sort of building of the system. And even then, the data scientists were maybe building things that were kind of foreign to software engineers. And what I’m hearing you say is you kind of got like these multiple layers; you have like domain experts, who might not be that technical, you’ve got maybe AI and data people, who are using this kind of unique set of tools, maybe even they’re hosting their own models… And then you’ve got like product software engineering people; it seems like a much more complicated landscape of interactions. How have you seen this kind of play out in reality in terms of non-technical people and technical people, both working together on something that is ultimately something implemented in code and run as an application?
I actually think one of the most exciting things about LLMs and the progress in AI in general is that product managers and subject matter experts can for the first time be very directly involved in implementing these applications. So I think it’s always been the case that the PM or someone like that is the person who distills the problem, speaks to the customers, produces the spec… But there’s this translation step where they sort of produce that prd document, and then someone else goes off and implements it. And because we’re now able to program at least some of the application in natural language, actually it’s accessible to those people very directly. And it’s worth maybe having a concrete example.
[10:02] So I use an AI notetaker for a lot of my sales calls. And it records the call, and then I get a summary afterwards. And the app actually allows you to choose a lot of different types of summary. So you can say, “Hey, I’m a salesperson. I want a summary that will extract budget, and authority, and a timeline.” Versus you can say “Oh, actually, I had a product interview, and I want a different type of summary.” And if you think about developing that application, the person who has the knowledge that’s needed to say what a good summary is, and to write the prompt for the model, it’s the person who has that domain expertise. It’s not the software engineer.
But obviously, the prompt is only one piece of the application. If you’ve got a question answering system, there’s usually retrieval as part of this; there may be other components… Usually, the LLM is a block in a wider application. So you obviously still need the software engineers around, because they’re implementing the bulk of the application, but the product managers can be much more directly involved. And then, actually, we see increasingly less involvement from machine learning or AI experts, and less people are fine-tuning their own models. So for the majority of product teams we’re seeing, there is a an AI platform team that maybe facilitates setting things up, but the bulk of the work is led by the product managers, and then the engineers.
And one interesting example of this on the extreme end is one of our customers that’s a very large ad tech company, they actually do not let their engineers edit the prompts. So they have a team of linguists who do prompt development. The linguists finalize the prompts, they’re saved in a serialized format and they go to production, but it’s a one-way transfer. So the engineers can’t edit them, because they’re not considered able to assess the actual outputs, even though they are responsible for the rest of the application.
Just thinking about how teams interact and who’s doing what, it seems like the problems that you’ve laid out are, I think, very clear and worth solving, but it’s probably hard to think about “Well, am I building a developer tool? Or am I building something that these non-technical people interact with? Or is it both?” How did you think about that as you kind of entered into the stages of bringing Humanloop into existence?
I think it has to be both… And the honest answer is it evolved kind of organically by going to customers, speaking to them about their problems, and trying to figure out what the best version of a solution looked like. So we didn’t set out to build a tool that needed to do both of these things, but I think the reality is, given the problems that people face, you do need both.
An analogy to think about might be something like Figma. Figma is somewhere where multiple different stakeholders come together to iterate on things, and to develop them, and provide feedback… And I think you need something analogous to that for gen AI… Although it’s not an exact analogy, because we also need to attach the evaluation to this. So it’s almost by necessity that we’ve had to do that… But I also think that it’s very exciting. And the reason I think it’s exciting is because it is expanding who can be involved in developing these applications.
Break: [13:05]
You mentioned how this environment of domain experts coming together, and technical teams coming together in a collaborative environment opens up new possibilities for both collaboration and innovation. I’m wondering if at this point you could kind of just lay out… We’ve talked about the problems, we’ve talked about those involved and those kind of that would use such a system or a platform to enable these kinds of workflows… Could you describe a little bit more what Humanloop is specifically, in terms of both what it can do, and kind of how these different personas engage with the system?
Yeah. So I guess in terms of what it can do, concretely, it’s firstly helping you with prompt iteration versioning and management, and then with evaluation and monitoring. And the way it does that - if there’s a web app and there’s a web UI where people are coming in and in that UI is an interactive playground-like environment, where people basically try out different prompts, they can compare them side by side with different models, they can try them with different inputs, when they find versions that they think are good, they save them. And then those can be deployed from that environment to production, or even to a development or staging environment. So that’s the kind of development stage.
And then once you have something that’s developed, what’s very typical is people then want to put in evaluation steps into place. So you can define gold standard test sets, and then you can define evaluators within Humanloop. And evaluators are ways of scoring the outputs of a model or a sequence of models, because oftentimes the LLM is part of a wider application.
And so the way that scoring works is there’s very traditional metrics that you would have in code for any machine learning system. So precision, recall, rouge, blue, these kinds of scores that anyone from a machine learning background would already be familiar with. But what’s new in the kind of LLM space is also things that help when things are more subjective. So we have the ability to do model as judge, where you might actually prompt another LLM to score the output in some way… And this can be particularly useful when you’re trying to measure things like hallucination. So a very common thing to do is to ask the model “Is the final answer contained within the retrieved context?” Or “Is it possible to infer the answer from the retrieved context?” And you can calculate those scores.
And then the final way is we also support human evaluation. So in some cases, you really do want either feedback from an end user, or from an internal annotator involved as well. And so we allow you to gather that feedback, either from your live production application, and have it logged against your data, or you can cue internal annotation tests from a team. And I can maybe tell you a little bit more about sort of in production feedback, because that’s something that – that’s actually where we started.
Yeah, yeah. Go ahead, I would love to hear more.
Yeah, so I think that because it’s so subjective for a lot of the applications that people are building, whether it be email generation, question answering, a language learning app - there isn’t a “correct answer.” And so people want to measure how things are actually performing with their end users. And so Humanloop makes it very easy to capture different sources of end user feedback. And that might be explicit feedback, things like thumbs up/thumbs down votes that you see in ChatGPT, but it can also be more implicit signals. So how did the user behave after they were shown some generated content? Did they progress to the next stage of the application? Did they send the generated email? Did they edit the text? And all of that feedback data becomes useful, both for debugging, and also for fine-tuning the model later on. So that evaluation data becomes this rich resource that allows you to continuously improve your application over time.
[18:23] Yeah, that’s awesome. And I know that that fits in… So maybe you could talk a little bit about how you’re – one of the things that you mentioned earlier is you’re seeing fewer people do fine-tuning… Which - I see this very commonly as a… It’s not an irrelevant point, but it’s maybe a misconception, where a lot of teams come into this space and they just assume they’re gonna be fine-tuning their models… And often, what they end up doing is fine-tuning their workflows or their language model chains, or the data that they’re retrieving, or their prompt formats, or templates, or that sort of thing. They’re not really fine-tuning. I think there’s this really blurred line right now for many teams that are adopting AI into their organization, where they’ll frequently just use the term “Oh, I’m training the AI to do this, and now it’s better”, but all they’ve really done is just inject some data into their prompts, or something like that.
So could you maybe help clarify that distinction? And also, in reality, what you’re seeing people do with this capability of evaluation, both online and offline, and how that’s filtering back into upgrades to the system, or actual fine-tunes of models?
Yeah. So I guess you’re right, there’s a lot of jargon involved… And especially for people who are new to the field, the word “fine-tuning” has a colloquial meaning, and then it has a technical meaning in machine learning, and the two end up being blurred. So fine-tuning in a machine learning curve context usually means doing some extra training on the base model, where you’re actually changing the weights of the model, given some sets of example pairs of inputs/outputs that you want. And then obviously, there’s prompt engineering and maybe context engineering, where you’re changing the instructions to the language model, or you’re changing the data that’s fed into the context, or how an agent system might be set up… And both are really important. Typically, the advice we give the majority of our customers and what we see play out in practice is that people should first push the limits of prompt engineering. Because it’s very fast, it’s easy to do, and it can have very high impact, especially around changing the sort of outputs, and also in helping the model have the right data that’s needed to answer the question.
So prompt engineering is kind of usually where most people start, and sometimes where people finish as well. And fine-tuning tends to be useful either if people are trying to improve latency or cost, or if they have like a particular tone of voice or output constraint that they want to enforce. So if people want the model to output valid JSON, then fine-tuning might be a great way to achieve that. Or if they want to use a local private model, because it needs to run on an edge device, or something like this, then fine-tuning I think is a great candidate.
And it can also let you reduce costs, because oftentimes you can find you in a smaller model to get similar performance. The analogy I like to use is fine-tuning is a bit like compilation. If you’ve already sort of built your first version of the language, when you want to optimize it, you might use a compiled language, and you’ve got a kind of compiled binary. I think there was a second part to your question, but just remind me, because I’ve lost the second part.
Yeah… Basically, you mentioned that maybe fewer people are doing fine-tunes… Maybe you could comment on – I don’t know if you have a sense of why that is, or how you would see that sort of progressing into this year, as more and more people adopt this technology, and maybe get better tooling around the - let’s not call it fine-tuning, so we don’t mix all the jargon, but the iterative development of these systems. Do you see that trend continuing, or how do you see that kind of going into maybe larger or wider adoption in 2024?
[22:21] Yeah, so I think that we’ve definitely seen less fine-tuning than we thought we would see when we launched this version of Humanloop back in 2022. And I think that’s been true of others as well. I’ve spoken to friends at Open AI… And Open AI is expecting there will be more fine-tuning in the future, but they’ve been surprised that there wasn’t more initially. I think some of that is because prompt engineering has turned out to be remarkably powerful, and also because some of the changes that people want to do to these models are more about getting factual context into the model. So one of the downsides of LLMs today is they’re obviously trained on the public Internet, so they don’t necessarily know private information about your company; they tend not to know information past the training date of the model. And one way you might have thought you could overcome that is “I’m going to fine-tune the model on my company’s data.” But I think in practice, what people are finding is a better solution to that is to use a hybrid system of search or information retrieval, plus generation. So what’s come to be known as like RAG, or retrieval-augmented generation has turned out to be a really good solution to this problem.
And so the main reasons to fine-tune now are more about optimizing costs and latency, and maybe a little bit tone of voice, but they’re not needed so much to adapt the model to a specific use case. And fine-tuning is a heavier duty operation, because it takes longer… You know, you can edit a prompt very quickly and then see what the impact is. Fine-tuning - you need to have their dataset that you want to fine-tune on, and then you need to run a training job and then evaluate that job afterwards.
So there are certainly circumstances where it’s going to make sense. I think especially anyone who wants to use a private open source model will likely find themselves wanting to do more fine-tuning… But the quality of prompt engineering and the distance you can go with it I think took a lot of people by surprise.
And on that note, you mentioned the closed proprietary model ecosystem versus open models that people might host in their own environment, and/or fine-tune on their own data… I know that Humanloop - you explicitly say that you kind of have all of the models, you’re integrating these sort of closed models, and integrate with open models… Why and how is that kind of decided to kind of include all of those? And in terms of the mix of what you’re seeing with people’s implementations, how do you see this sort of proliferation of open models impacting the workflows that you’re supporting in the future?
So the reason for supporting them, again, is largely customer poll, right? What we’re finding is that many of our customers were using a mixture of models for different use cases, either because the large proprietary ones had slightly different performance trade-offs, or because they were use cases where they cared about privacy, or they care about latency, and so they couldn’t use a public model for those instances. And so we had to support all of them. It really was something that it wouldn’t be a useful product to our customers if they could only use it for one particular model.
And the way we’ve got around this is that we tried to integrate all of the publicly-available ones, but we also make it easy for people to connect their own models. So they don’t necessarily need us. As long as they expose the appropriate APIs, you can plug in any model to Humanloop.
That would be a matter of hosting the model and making sure that the API contract that you’re expecting in terms of responses from a model server, that maybe someone’s running in their own AWS or wherever, would fulfill that contract.
That’s exactly right. Yeah. And in terms of the proliferation of open source and how that’s going, I think there’s still a performance gap at the moment between the very best closed models, so between the GPT-4, or some of the better models from Anthropic, and the best open source… But it is closing. So the latest models from, say, Mistral have proved to be very good, LLaMA 2 was very good… Increasingly, you’re not paying as big a performance gap, although there is still one, but you need to have high volumes for it to be economically competitive to host your own model. So the main reasons we see people doing it are related to data privacy. Companies that, for whatever reason, cannot or don’t want to send data to a third party end up using open source… And then also, anyone who’s doing things on edge, and who wants sort of real time or very low latency ends up using open source.
Well, Raza, I’d love for you to maybe describe, if you can… We’ve kind of talked about the problems that you’re addressing, we’ve talked about the sort of workflows that you’re enabling, the evaluation, and some trends that you’re seeing… But I’d love for you to describe if you can maybe for like a non-technical persona, like a domain expert who’s engaging with the Humanloop system, and maybe for a more technical person who’s integrating data sources or other things, what does it look like to use the Humanloop system? Maybe describe the roles in which these people are – like what they’re trying to do from each perspective. Because I think that might be instructive for people that are trying to engage domain experts and technical people in a collaboration around these problems.
Absolutely. So maybe it might be helpful to have a kind of imagined concrete example. So a very common example we see is people building some kind of question answering system. Maybe it’s for their internal customer service stuff, or maybe they want to replace an FAQ… Maybe they’re trying to build some kind of internal question answering system to replace something, or an FAQ, or that kind of thing. So there’s a set of documents, a question is going to come in, there’ll be a retrieval step, and then they want to generate an answer. So typically, the PMs or the domain experts will be figuring out what are the requirements of the system, “What does good look like? What do we want it to build?” And the engineers will be building the retrieval part, orchestrating all the model calls in code, integrating the Humanloop APIs into their system… And also, usually they lead on setting up evaluation. So maybe once it’s set up, the domain experts might continue to do the evaluation themselves, but the engineers tend to set it up the first time.
So if you’re the domain expert, typically you would start off in our playground environment, where you can just try things out. So the engineers might connect a database to Humanloop for you. So maybe they’ll store the data in a vector database, and connect that to Humanloop. And then once you’re in that environment, you could try different prompts to the models; you could try them to GPT-4, to Cohere, to an open source model, see what impact that has, see if you’re getting answers that you like… Oftentimes early on it’s not in the right tone of voice, or the retrieval system is not quite right, and so the model is not giving factually correct answer… So it takes a certain amount of iteration to get to the point where even when you eyeball it, it’s looking appropriate. And usually, at that point people then move to doing a little bit more of a rigorous evaluation.
So they might generate either automatically or internally a set of test cases, and they’ll also come up with a set of evaluation criteria that matter to them in their context. They’ll set up that evaluation, run it, and then usually at that point they might deploy to production.
So that’s the point at which things would end up with real users, they start gathering user feedback… And usually, the situation is not finished at that point, because people then look at the production logs, or they look at the real usage data, and they will filter based on the evaluation criteria. And they might say “Hey, show me the ones that didn’t result in a good outcome”, and then they’ll try and debug them in some way, maybe make a change to a prompt, rerun the evaluation and submit it.
And so the engineers are doing the orchestration of the code. They’re typically making the model calls, they’ll add logging calls to Humanloop… So the way that works - there’s a couple of ways of doing the integration, but you can imagine every time you call the model, you’re effectively also logging back to Humanloop what the inputs and outputs were, as well as any user feedback data. And then the domain experts are typically looking at the data, analyzing it, debugging, making decisions about how to improve things, and they’re able to actually take some of those actions themselves in the UI.
[32:03] Yeah. So if I just kind of abstract that a bit to maybe give people a frame of thinking, it sounds like there’s kind of this framework setup where there’s data sources, there’s maybe logging calls within a version of an application… If you’re using a hosted model or if you’re using a proprietary API, you decide that… And so it’s kind of set up, and then there’s maybe an evaluation or a prototyping phase, let’s call it, where the domain experts try their prompting… Eventually, they find prompts that they think will work well for these various steps in a workflow, or something like that… Those are pushed, as you said, I think, one way into the actual code or application, such that the domain experts are in charge of the prompting, to some degree. And as you’re logging feedback into the system, the domain experts are able to iterate on their prompts, which hopefully then improve the system, and those are then pushed back into the production system, maybe after an evaluation or something. Is that a fair representation?
Yeah, I think that’s a great representation. Thanks for articulating it so clearly. And the kinds of things that the evaluation becomes useful for is avoiding regression, say. So people might notice one type of problem. They go in and they change a prompt, or they change the retrieval system, and they want to make sure they don’t break what was already working. And so having good evaluation in place helps with that.
And then maybe it’s also worth – because I think we didn’t sort of do this at the beginning… Just thinking about what are the components of these LLM applications. So I think you’re exactly right, we sort of think of the blocks of LLM apps being composed of a base model. So that might be a private fine-tuned model, or one of these large public ones… A prompt template, which is usually an instruction to the model that might have gaps in it for retrieved data or context, a data collection strategy, and then that whole thing of like data collection, prompt template and model might be chained together in a loop, or might be repeated one after another… And there’s an extra complexity, which is the models might also be allowed to call tools or APIs. But I think those pieces that get taken together more or less comprehensively cover things. So tools, data retrieval, prompt template and base model are the main components. But then within each of those you have a lot of design choices and freedom. So you have a combinatorially large number of decisions to get right when building one of these applications.
One of the things that you mentioned is this evaluation phase of what goes on as helping prevent regressions, because in sort of testing behaviorly the output of the models you might make one change on a small set of examples, that looks like it’s improving things, but has sort of different behavior across a wide range of examples… I’m wondering also, I could imagine two scenarios… You know, models are being released all the time, whether it’s upgrading from this version of a GPT model to the next version, or this Mistral fine-tune to this one over here… I’m thinking even in the past few days we’ve been using the Neural Chat model from Intel a good bit, and there’s a version of that that Neural Magic released, that’s a sparsified version of that, where they pruned out some of the weights and the layers to make it more efficient, and to run on better – or not better hardware, but more commodity hardware, that’s more widely available… And so one of the questions that we were discussing is “Well, we could flip the version of this model to the sparse one, but we have to decide on how to evaluate that over the use cases that we care about.” Because you could look at the output for like a few tests prompts, and it might look similar, or good, or even better, but on a wider scale might be quite different in ways that you don’t expect. So I could see the evaluation also being used for that, but I could also see where if you’re upgrading to a new model, it could just throw everything up in the air in terms of like “Oh, this is an entirely different prompt format”, right? Or “This is a whole new behavior from this new model, that is distinct from an old model.” So how are you seeing people navigate that landscape of model upgrades?
[36:33] I think you should just view it as a change as you would to any other part of the system. And hopefully, the desired behavior of the model is not changing. So even if the model is changed, you still want to run your regression test and say “Okay, are we meeting a minimum threshold that we had on these gold standard test set before?”
In general, I think evaluation - we see it happening in sort of three different stages during development. There’s during this interactive stage very early on, when you’re prototyping, you want fast feedback, you’re just looking to get a sense of “Is this even working appropriately?” At that stage, eyeballing examples, and looking at things side by side, in a very interactive way can be helpful.
And interactive testing can also be helpful for adversarial testing. So a fixed test set doesn’t tell you what will happen when a user who actually wants to break the system comes in. So a concrete example of this - you know, one of our customers has children as their end users, and they want to make sure that things are age-appropriate, so they have guardrails in place. But when they come to test the system, they don’t want to just test it against an input that’s benign. They want to see like, if we try, if we really red-team this, can we break it? And their interactive testing can be very helpful.
And then the next place where you kind of want testing in place is this regression testing, where you have a fixed set of evaluators on a test set, and you want to know “When I make a change, does it get worse?” And the final place we see people using it is actually from monitoring. So okay, I’m in production now; there’s new data flowing through. I may not have the ground truth answer, but I can still set up different forms of evaluator, and I want to be alerted if the performance drops below some threshold.
So one of the things that I’ve been thinking about throughout our conversation here, and that’s I think highlighted by what you just mentioned in sort of the upgrades to one’s workflow, and the various levels at which such a platform can benefit teams… And it made me think of [unintelligible 00:38:31.06] I have a background in physics, and there were plenty of physics teams or collaborators that we worked with - you know, we were writing code - and not doing great sort of version control practices… And not everyone was using GitHub, and there was sort of collaboration challenges associated with that, which are obviously solved by great code collaboration systems of various forms, that have been developed over time… And I think there’s probably a parallel here with some of the collaboration systems that are being built around both playgrounds, and prompts, and evaluation. I’m wondering if there’s any examples from clients that you’ve worked with, or maybe it’s just interesting use cases of surprising things they’ve been able to do when going from sort of doing things ad hoc, and maybe versioning prompts in spreadsheets, or whatever it might be, to actually being able to work in a more seamless way between domain experts and technical staff. Are there any clients, or use cases, or surprising stories that come to mind?
[39:46] Yeah, it’s a good question. I’m kind of thinking through them to see what the more interesting examples might be. I think that, fundamentally, it’s not necessarily enabling completely new behavior, but it’s making the old behavior significantly faster, and less error-prone. Certainly, fewer mistakes and less time spent – okay, so a surprising example… Publicly-listed company, and they told me that one of the issues they were having is because they were sharing these prompt conflicts in teams, they were having differences in behavior based on whitespace being copied. Someone was like playing around with the Open AI playground, they copy-pasted into Teams… That person would copy-paste from Teams into code… And there was small whitespace differences, and you wouldn’t think it would affect the models, but it actually did. And so they would then get performance differences they couldn’t explain. And actually, it just turned out that you shouldn’t be sharing your code via Teams, right?
[laughs]
So I guess that’s one surprising example. I think another thing as well is the complexity of apps that people are now beginning to be able to build. So increasingly, I think people are building simple agents; I think more complex agents are still not super-reliable, but a trend that we’ve been hearing a lot about from our customers recently is people trying to build systems that can use their existing software. An example of this is - you know, Ironclad is a company that’s added a lot of LLM based features to their product… And they actually are able to automate a lot of workflows that were previously being done by humans, because the models can use the APIs that exist within the Ironclad software. So they’re actually able to leverage their existing infrastructure. But to get that to work, they had to innovate quite a lot in tooling. And in fact - you know, this isn’t a plug for Humanloop. Ironclad in this case built a system called Rivet, which is their own open source prompt engineering and iteration framework. But I think it’s a good example of, you know, in order to achieve the complexity of that use case - this happened to be before tools like Humanloop around - they had to build something themselves. And it’s quite sophisticated tooling. I actually think Rivet’s great, so people should check that out as well. It’s an open source library; anyone can go and get the tool.
So yeah, I think the surprising things are like how error-prone things are without good tooling, and the crazy ways in which people are solving problems. Another example of a mistake that we saw someone do is two different people triggered exactly the same annotation job. So they had annotation and spreadsheets, and they both outsourced the same job to different annotation teams… Which was obviously an expensive mistake to make. So very error-prone. And then I think also just impossible to scale to more complex agentic use cases.
Well, you already kind of alluded to some trends that you’re seeing moving forward… As we kind of draw to a close here, I’d love to know from someone who’s seeing a lot of different use cases being enabled through Humanloop, and your platform, what’s exciting for you as we move into this next year in terms of - maybe it’s things that are happening in AI more broadly, or things that are being enabled by Humanloop, or things that are on your roadmap, that you can’t wait for them to go live… As you’re lying in bed at night and getting excited for the next day of an AI stuff, what’s on your mind?
So AI more broadly, I just feel the rate of progress of capabilities is both exciting and scary. It’s extremely fast; multimodal models, better generative models, models with increased reasoning… I think the range of possible applications is expanding very quickly, as the capabilities of the models expand.
I think people have been excited about agent use cases for a while; systems that can act on their own and go off and achieve something for you. But in practice, we’ve not seen that many people succeed in production with those. There are a couple of examples, Ironclad being a good one… But it feels like we’re still at the very beginning of that, and I think I’m excited about seeing more people get to success with that. I’d say that the most common, successful applications we’ve seen today are mostly either retrieval-augmented applications, or more simple LLM applications. But increasingly, I’m excited about seeing agents in production, and also multimodal models in production.
In terms of things that I’m particularly excited about from Humanloop, is I think us becoming a proactive rather than a passive platform. So today, the product managers and the engineers drive the changes on Humanloop. But I think that something that we’re going to hopefully release later this year is actually this system – you know, Humanloop itself can start proactively suggesting improvements to your application. Because we have the evaluation data, because we have all the prompts, we can start saying things to you, like “Hey, we have a new prompt for this application. It’s a lot shorter than the one you have. It scores similarly on eval data. If you upgrade, we think we can cut your costs by 40%.” And allowing people to then accept that change. And so going from a system that is observing, to a system that’s actually intervening.
That’s awesome. Yeah, well, I definitely look forward to seeing how that rolls out, and I really appreciate the work that you and the team at Humanloop are doing to help us upgrade our workflows, and enable these sort of more complicated use cases. So thank you so much for taking time out of that work to join us. It’s been a pleasure. I really enjoyed the conversation.
Thanks so much for having me, Daniel.
Our transcripts are open source on GitHub. Improvements are welcome. đź’š