Daniel & Chris explore the state of the art in prompt engineering with Jared Zoneraich, the founder of PromptLayer. PromptLayer is the first platform built specifically for prompt engineering. It can visually manage prompts, evaluate models, log LLM requests, search usage history, and help your organization collaborate as a team. Jared provides expert guidance in how to be implement prompt engineering, but also illustrates how we got here, and where we’re likely to go next.
Featuring
Sponsors
Shopify – Sign up for a $1/month trial period at shopify.com/practicalai
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:07 | Welcome to Practical AI | 00:36 |
2 | 00:43 | Prompt master Jared Zoneraich 👀 | 00:38 |
3 | 01:21 | What is prompt engineering? | 03:38 |
4 | 04:59 | Different models, different approaches | 02:10 |
5 | 07:10 | Struggles in prompt engineering | 02:53 |
6 | 10:03 | What it's like to use an API | 02:38 |
7 | 12:41 | Shift in users | 02:39 |
8 | 15:20 | Building on probabilistic technology | 03:22 |
9 | 18:42 | Handling non-deterministic models | 02:57 |
10 | 21:39 | Implications of changes in prompts | 06:10 |
11 | 27:59 | Sponsor: Shopify | 02:20 |
12 | 30:35 | Optimizing prompts | 04:50 |
13 | 35:25 | Logging and monitoring best practices | 03:08 |
14 | 38:33 | Providing users feedback | 02:22 |
15 | 40:55 | Future of prompt engineering | 03:35 |
16 | 44:30 | Thank you Jared! | 00:43 |
17 | 45:12 | Outro | 00:46 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am founder and CEO at Prediction Guard, and I am joined as always by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris?
I’m doing fine. How’s it going today, Daniel?
It’s going great. I’m pretty excited to prompt our guest today and hear what he has to say… We’re joined today by the prompt master, Jared Zoneraich, who is founder at PromptLayer. How are you doing, Jared?
I am doing well, excited for this.
We’re excited to have you.
It seems like maybe from my perspective there was kind of the release of all of this generative AI stuff, and then there was this realization that there’s kind of a new skill needed around this thing called prompt engineering… And then it seems like some people have kind of - I don’t know if they’ve moved past the term, or they’ve tried to develop other terms… I see other terms being developed around like AI engineering and such, as related to generative AI… So could you just give us a sense of, from your perspective, who is obviously building things for prompt engineering - maybe to start out kind of what is prompt engineering? And from your perspective, how have you seen it develop as a skill over the past year, since people have been thinking a lot about prompting generative models?
Yeah. The last year, since the word was invented, or maybe a little more than a year ago… [laughter] But yeah, I think it’s a really good question. I think this question of what does prompt engineering even mean is also a good question. And I’ll tell you how we think about it. So honestly, so for PromptLayer, we consider ourselves a prompt engineering platform. So we’ve really steered into this term that is kind of overloaded; we’ve embraced it, for sure. And it was half by accident, and I would say half realizing it was kind of beneficial to us… Because the fact that it has no definition means anyone who’s kind of getting into LLMs, and getting into prompt engineering says “Oh, there’s a prompt engineer platform. Maybe I need that”, and they don’t even necessarily know what it is. So it kind of helps us a little bit there.
I guess prompt engineering - I first started hearing that term I’d say GPT-3 days, kind of a little bit before ChatGPT, maybe GPT-2, back when everyone was just using the Open AI Playground. And you’d start to hear a little bit about prompt engineering, and stuff like that… And it was kind of cool, but that was, I guess, the days before it clicked for everyone. Maybe you can call it that. The days before ChatGPT came out, and before people really realized how much potential this technology has. So that was when I think prompt engineering first became a word. And then it kind of got even more – you know, Scale AI I think is famous for maybe the first one to publicly hire someone with a role prompt engineer.
And from there, we call our platform a prompt engineering platform. At the beginning, that brought a lot of people to us who had no idea what prompt engineering is… And kind of the definition we’ve started to roll with is, as a company, we consider prompt engineering to be the tuning of the inputs to the LLM. So the prompt is the main input, but it also includes “What model are you using? What are your [unintelligible 00:04:16.07] what are your other hyperparameters?” But the whole process of prompt engineering to us is what goes in, and what comes out. And that specifically is – and I could talk a little bit more about this if it’s interesting, but specifically a little bit different to the MLOps definition, or the standard machine learning definitions of hyperparameter tuning, and standard traditional ML, and - specifically, we call ourselves a prompt engineering platform, and not an LLM Ops platform for that reason, because I do think there’s slightly a difference. So I don’t know if I answered the question fully, but those are my thoughts on prompt engineering.
As we all dived into it – you talked about the Open AI Playground originally, that I think everybody kind of dips their toe into first, at least before the ChatGPT release days… With that, one of the things I discovered is other models started coming out was some of the skills I was developing for prompting did not always translate as I expected across to other models. Have you seen one of those things where every model had its own kind of variations of what seemed to work from a productivity and output standpoint. Any thoughts around that? Like, how do you guys see that with all these different models coming out, and little variations across them in terms of how they respond?
Yeah, that’s for sure a good observation. I think it’s become clear that each model – I mean, each model is made a little bit differently, so you have to talk to it a little bit differently. I think these differences are going to maybe get a little less significant over the future. When ChatGPT came out, and the big thing was if you were nicer to the model, you were gonna get a better answer, maybe because Stack Overflow questions that are nicer get better answers, or something like that.
[00:06:09.27] And I think now you have people talking about “Oh, my grandma’s gonna die. You need to answer this”, or “I’m going to tip you $100 if you answer this.” These are all, I think, tricks that work today; they’re not going to last forever. These are little things people threw out. But the part you mentioned, of talking to different models differently I think is not going away. A lot of these models are made very differently. We think it’s pretty conclusive now that we’re not going to live in a world with an Open AI monopoly of language models. I think just a week or two ago I saw a good tweet… Now we have Mistral, Claude, Chat GPT-4, and a few others that are all really good, and they’re all made somewhat differently, and there are intricacies. And I think our philosophy and what we talk to our users about regarding prompt engineering is think about it as a black box. It’s almost helpful to be kind of a little bit naive and stupid here, and not try to understand how an LLM works, and just try to track the inputs to the outputs, if that makes sense.
Have you found any difference from people that are coming maybe from a deep sort of like data sciency background, where they are over-analyzing everything, and then other people who are coming maybe from a non-technical background, but they’re maybe domain experts, and they’re getting into developing these prompts? Have you seen different struggles on each side of that spectrum, where – because you have this very interesting mix of people that are trying to be “prompt engineers”, some of which I’ve kind of seen are very much just like non-technical domain experts who are really good at even like psychology and writing narrative instructions, being articulate… And then you kind of have this other side, which is the data science side, and they’re really into modeling, and wanting to analyze all of the outputs and that sort of thing… So have you seen different struggles on both sides of that in terms of being effective prompt engineers?
I think you put it well, that there’s kind of these two groups coming at LLMs. Traditionally – I mean, this is what makes LLMs so cool to me, is that traditional machine learning, standard, mathy machine learning kind of doesn’t need a PhD. Maybe you don’t need it, but a PhD is very helpful. It’s intense how you’re building these models, doing a lot by hand still; you’re doing a lot of this tweaking by hand in traditional ML. And then Open AI came out with this amazing API, and I’ve done a lot with Dev Rel in the past, and hackathons and stuff like that, and helping companies make APIs, and in my opinion, Open AI’s API - it’s like the best docs I’ve ever read in my life. It’s so simple; you just give it text and you get text out. And like you said, this kind of opened up this completely new technology that’s just so much better than everything else to non-technical people. You don’t need a PhD to be able to understand how to communicate… And I think it brings about this new skill set, which is prompt engineering… Which in my opinion is kind of a mixture of communication and being able to write succinctly, but also being able to think algorithmically.
So I don’t know the exact word for thinking algorithmically… I’ve heard – Stephen Wolfram used that word. I like that way to talk about it, but… Kind of just the scientific method. Do you know how to think in terms of creating a hypothesis, trying it out, tracking it? Are you strategic about this? And I think it’s the same challenge on both sides. I think some people try to overcomplicate it. If you’re coming from an ML background, and you’re trying to understand why a certain token gives you an output… And I almost think these things are getting more complicated, not less. You kind of just need to take the naive approach and say “Hey, I’m just going to talk to it, I’m going to try to get the output I want. I’m going to keep trying stuff till it works.”
[00:10:02.00] As someone who’s used the APIs a lot, and you’re talking about Open AI’s being so good - so many folks listening to this may have only used things like the normal chat interfaces on each of the models when they’ve gone and tried them out, or paid for a subscription for the top end, and stuff, and have never touched the APIs at all. Could you take a moment and just talk about what an API experience is to someone who has done a lot? I’m kind of taking advantage of you as an expert in the area; share that a little bit with the listeners just so they kind of get the other side of that, because not everybody – probably the vast majority of them don’t.
I’ll explain, and then just to sidetrack myself for a sentence… I think the fact that you used the word “expert” is so funny, because it’s such a new field; it’s almost amazing. I tell everybody, you could become an expert in this thing very easily. Like, nobody really knows what’s going on. You kind of just need to study for a weekend and you’re an expert, which is a very unique place to be. But anyway… [laughs]
It makes it fun to be able to dive into something and get as deep as the leaders in the field. So…
Yeah, 100%. And just to be on the cutting edge, and know that nobody really knows what they’re doing here. Some people do, but there’s very few of them. So regarding the APIs, ChatGPT, the way I think about it, and the way I usually explain it is it’s basically a very thin wrapper on top of the API. Every time you talk to ChatGPT, behind the scenes it’s using this LLM technology, it’s sending your message, and a little bit of a preamble before your message. And the preamble is basically - we can call it the prompt, and it’s basically saying “Hey, you’re an AI assistant. Make sure to be helpful to the user. Maybe don’t be controversial. You can use a calculator if you need”, stuff like that, and then giving the user messages. And that’s all it is. And the process of prompt engineering is “How do you tweak that preamble? How do you get it to respond in the way you want to respond by telling it what to do?”
When I talk about the API, I’m talking about the things Open AI has exposed to let you build your own track up and let your build your own products on top of it. And I just think Open API has done – if you want to get started and you haven’t touched these APIs today, the best, best thing to do is just go to Open AI’s docs and just read the Getting Started tutorial. It’s really well done, and I think - as someone who’s running a developer tools company, I’ll tell you how hard it is to write good docs. I think our docs – from my perspective, I think they should be much better. Other people say they’re great, but… Well, there’s always room for improvement, and it’s just a very hard thing to do. So that’s where you should go.
From your perspective, over the last year, as people have gotten into doing this practice of prompt engineering, have you generally seen people that are engaging with your platform be sort of more informed coming in? Like, they’ve done that experimentation with ChatGPT, or the APIs, and they are using words like Few-shot, whatever, blah, blah, blah, and like this stuff? Or is it still kind of like people coming in that say “I know, I think I need to have a prompt engineering platform, but I’m kind of don’t know where to start”? Have you seen that shift even over in this last year?
I would say we’re still very early in this, and there’s a few leaders, but everything’s up for grabs in terms of AI products. I don’t buy the whole notion of “It’s only the incumbents who are going to be able to use AI.” There’s so much – the AI products people are making today are just scratching the surface. Having said that, I have seen a bit of a change since we launched our product. We’ve launched our product January of last year, so January of 2023. I guess that was a few months after ChatGPT came out; these things were – I think we were the first prompt engineering platform; maybe there’s some argument there… But we were one of the first, at least. And when we launched our product, we had a lot of indie hackers and individual hobbyists using our platform. That was the whole community back then. That lasted for a little. Then we kind of moved on to AI-first startups. So like one or two-person startups, these are the really cutting edge ones. We used one of them - it’s a great company - to actually refactor our whole codebase into TypeScript… A lot of really cool stuff.
[00:14:22.08] And then from there, I would say starting in the fall - and I’ve heard this from a few other founders in the space - it felt like there started to become a real shift, where real companies… And when I say real, I mean maybe companies that actually make money; they started actually getting serious about AI, and getting serious about LLMs. And I think we’re still seeing that maturation continuing, where these real teams are building AI products that they care about, and not just – like, Twitter demos are one thing. PromptLayer is interesting for Twitter demo, but PromptLayer becomes really useful for a team that is serious about building their product, and has multiple stakeholders, and wants to collaborate… So I am seeing that this shift is still is happening; more and more companies are getting serious about their LLM products and getting value and revenue from them. But we’re still at the very, very beginning of the curve. So a lot more to come.
I think that was a great sort of intro into prompt engineering and the state of prompt engineering. I’m wondering if you could help us maybe understand - yes, it’s good if people get hands-on with these models, kind of gain some intuition about how they behave, and different ways that you prompt them… What is it about this discipline of prompt engineering that needs sort of systematic ways of managing your prompting methodologies? And how is that different to, or the same as different sorts of engineering in the past? We’ve always had version control, and that sort of thing… What’s kind of unique and not unique about this discipline of prompt engineering in terms of how you need to approach it systematically?
Just starting from first principles. There’s one fundamental thing that’s changed, and it’s that we’re now building upon a probabilistic technology. So we’re now building on a technology that sometimes gives us some answers, sometimes gives us another answer, and is trending to be more confusing on why it gives one answer and not less. And confusing… I mean, yes, theoretically, it is deterministic. You can really dive into the weights and maybe figure it out. But nobody’s really practically going to do that. It’s virtually too hard for 99.9% of use cases with models. So we’re at a place where you’re working with a black box now. And yes, some servers, some architectures become black boxes because of bad code. But that’s a different type of black box. That’s not a real black box. But we’re building technology –
That’s the code I write.
[laughs] Me too. That’s why I’m gonna get banned from our repo soon, but… [laughs] Yeah, so you’re building technology on this black box, and you need to think about it differently. And I think this is a big philosophy we have at PromptLayer. We have the philosophy of – we built a lot of great stuff in traditional software, and traditional machine learning, and had a lot of great learnings. Git is a fantastic tool, version control is important, access controls are important, test-driven development is important… But do we necessarily want to take one to one of everything? Not really. I think the biggest difference between LLM-based development and building AI applications versus building standard software is who are the stakeholders? Like we were talking about earlier - now you can have subject matter experts who are not necessarily software engineers, but rather these prompt engineers, these AI whisperers, these people who are able to talk to the black box, and able to communicate with the AI we could call it in a little sci-fi sense…
[00:18:01.29] But we have this new stakeholder in the process of software engineering, who is not going to jump into the code and not going to jump into Git. And that’s why, at least for PromptLayer, we’ve taken a very first principles ground-up approach where we’re saying “Hey, what can we learn from normal software and how people work together in software, and do version control and collaborate? And how can we take that into LLMs and bring in new stakeholders and bring in new collaborators and let people actually build on this black box technology in a systematic way?” So hopefully, that makes sense.
It does. I’m curious, you said some things there that really kind of piqued my interest… You kind of were contrasting it with deterministic programming, that we’ve all kind of grown up with, and now we’re in this new age, and we have these non-deterministic things… And you can give it the same prompt and it may or may not on any given day give you the same answer back… So how has that kind of fundamentally changed software development in general? And I’m encompassing all the things when I say software development - both the AI and the systems around it to feed it… Because we’re dealing with that potentially unexpected return in that non-deterministic blackbox that you’re talking about. How do people handle that when they’re trying to devise and say “Hey, I want to use a model in the thing that I’m building”? How did they change in the way they think about that?
I don’t know how people think, but I can tell you a symptom of how they think, which is - we work with a lot of LLM teams, of course, because they’re using our platform. And we talk to them, we’re trying to – we’re always trying to talk to people, trying to figure out what to improve… And one of the big theses we’ve come to with PromptLayer is that the iteration cycle of prompt engineering is different than software engineering. So what I mean by that is your code deploys and continuous integration, and that whole sort of thing is happening at a different cadence almost always in mature software than prompt engineering, because of a lot of reasons. Maybe you’re updating the prompt frequently. Maybe, again, different stakeholders are updating the prompt. That’s why we encourage people to have their problems in a CMS, or - we call it a prompt registry. You could put it in a Postgres database, something like that. But you don’t want to block your prompt engineering cycle on edge deployments. And that’s, I think, the symptom of this new thought pattern of how you’re building let’s call it black box software, and how you’re kind of thinking through these problems where - they are different problem sets, also. You’re not solving the same type of software problems, you’re solving - let’s call it language problems or things that can employ this new type of software.
So it’s not just a different way to build, but it’s also a different way to think about it. And then we can also get into how do you code now, how that’s different, and how - I think maybe this is maybe a more controversial… I don’t like to predict things, because I think it’s kind of a fool’s errand… But if I had to predict one thing, I think as a lazy programmer, there’s something really nice about just having a little block of LLM do something for me. So for example, parsing strings, or parsing chunks of text and reordering it. You can do that deterministically and it will probably be better, but there’s probably a world where models are gonna get cheap and quick enough where maybe it’s worth the engineering time not to build it well, and just outsource it to AI. But maybe that’s a whole different tangent.
[00:21:38.01] Yeah, and maybe you could get into a little bit of the implications of these prompts in a registry… Because you could have these sort of random tasks that - I don’t know if you all saw the Devon thing that’s been going around, where it’s like a junior engineer agent that can write scripts, and interact with software documentation… And today I was thinking about that, similar to what you were saying, Jared; it’s like, hey, well, I could go and take all of these strings and go and interact with the API to translate them into another language, which is what the task was that I was doing… Or I could just say “Hey, write a script that does this, and do it for me.” So there’s those random things. But then there’s another type of prompt, which as soon as you’re starting to expose a system to end users, then small changes in the prompt, on the backend, could produce very different changes in the behavior to your actual end users, and cause actual problems. So could you talk a little bit about maybe the implications of changes in your prompts, and like best practices that you’ve found around – I know we’re talking a lot about prompt versioning and registries around those prompts, and I know that PromptLayer is thinking about more than that, around evaluation and other things… But yeah, maybe before we go to those other things, could you talk about what you’ve seen in terms of how people are managing these different prompts, some that have very low risk in terms of changing them very often, and maybe some that have actually a large risk in terms of maybe small changes in the prompt?
Yes. I think there’s a lifecycle here for how you care about your prompt. And I think every prompt, in any mature product, at any – if you have a product, an AI product that’s making your company $10 million a year or something like that, you’re gonna care about any change to it. And I think there’s that cycle. So let’s say that’s the end stage. At the beginning, you probably are just going to ship any prompt. You’re just gonna write a prompt, it’s gonna kind of work, you’re gonna try it once or twice, and say “Alright, good enough. Let’s get that MVP out.” In that case, maybe the prompt is in your code; maybe you don’t care so much about what you were just saying, about having a breaking change. “Alright, whatever. I’ve gotta get it out. Let me get five people using it.”
Alright, so you’ve done that… What’s next? You probably now have like five different prompts on your system, they’re scattered everywhere… I was just talking to a founder who was visiting our office today, who actually had this exact same story. And then he moved all his prompts to a text file. Or actually, in his case, it was just like a .ts file on his system. So now you have your prompts in one place. That’s the next step. But still in your codebase it’s still linked to edge deploys. And like you said, you still have no way of knowing if you push a new one, what happened? Is it breaking 20% of our use cases? Not great. But I would call this stage… I like this word for it, like “vibe-based prompt engineering.” So this is the five days prompt engineering stage, where you’re kind of writing a prompt, you’re testing it in the playground, you’re doing it once or twice, and you’re just judging… You’re just looking, “Okay, yeah. That’s pretty good.”
This lasts a little bit. The time when this is no longer good enough is usually when your product’s either getting to a greater level of maturity, maybe you’re rolling it out to GA, or you’re adding more stakeholders to the team; maybe you’re adding a PM, maybe you’re adding a content writer, or a subject matter, like we were talking about earlier, a psychologist, or a lawyer, or something like that. And you need more people involved. So now you have a non-technical person writing your prompts, who isn’t really capable of building out the whole dev workflow to test it out. You really want to make sure it doesn’t break everything.
So there’s a few things I’ve seen people – a few strategies I’ve seen people employ here. One is, of course, traditional software - let’s bar some stuff, A/B testing. Let’s release it for some people. If we’re monitoring user feedback, maybe if users are giving us a thumbs up, thumbs down, we should be able to see it pretty quickly. Also, having prod staging, dev staging, different – we call them release labels; a lot of words for them. Let’s call that category slow releases. So that’s one way. Then there’s two other big ways of solving it.
[00:26:08.06] Second way - regression tests. So again, another concept borrowed from software engineering. Let’s find cases where it’s failing, and see if we succeed in this test case. Third, which is also kind of like regression tests, I guess, is backtesting. Let’s just run it on old examples and see if it changes. I think in a lot of LLM use cases the really hard part about this problem is that you don’t know what the ground truth is. Let’s say we’re making a summary. There’s no correct answer to a summary. There’s a good summary, a bad summary… It’s almost very hard to understand if it is good or bad. We actually – there’s a user of ours that is doing this exactly, and they’re trying to figure this out, and they’re using a combination of human graders, and whatnot… But we can talk more about that in a second. But in this case, often the best thing to do is just rerun it on old responses, and see how much changed. And “Oh, I updated the prompt and 50% of my responses changed. Maybe I should look into those.” Oh, only like one out of 1,000 changed… Probably good enough. So it’s all about trade-offs in this world. And everything’s – again, it’s a new way of thinking; it’s non-deterministic. So how do you trade off how much it’s changing versus how much you need to make sure it doesn’t change? Maybe you want to force specific output that is deterministically-graded. So for example, you’re giving a JSON or a boolean output. So there’s a lot of strategies here, but if I were to kind of give it a one-line answer, you should be deciding at what cadence you update prompts by what stage your product is at, and how bad these issues are going to be.
Break: [00:27:50.10]
If you’re looking at a large, out on the edge system of systems, in the sense of you have a number of models deployed, and they all do specific things. And some of them are generative, and some are not. And for the generative ones, they may be trying to address very specific functions that they’re doing… With a system like that, you’ve got it in production, and maybe you’ve done that kind of that minimal viable product approach on getting it up, but when you get things to where they’re kind of stable in production, you are starting to kind of address some of that. I know that I’m grappling in my own head with how to think about being able to make those tweaks and changes to prompts in any given model in a system, and detect that. Is there like a best place for me to start? Because I’m still trying to kind of grapple with the larger picture and really understand it… And so if I want to change something, but I don’t want to impact the larger stable system, what would you kind of be – if it was you in that position, like 1-2-3, try this, try that, try that, just to give me a good hands-on takeaway from that… And I apologize for the selfish nature of the question, but… It’s hard to do that.
No, I think it’s good to have the selfish type of question here, because one thing that I actually think a lot of people get wrong in this space is that every prompt is kind of different. And the answer to this question is really very unique to what you’re actually trying to do, and the task you’re trying to solve. There isn’t really – I mean, a lot of people are trying to sell it. I like to say that maybe I’ll change my opinion in a month, or a week, or a day.
I do that all the time as I learn more. So no worries, you’re allowed to.
It’s good to change your opinion. But right now, I think a lot of these eval sets that people produce are not that useful for building real products, because you’re trying to evaluate your prompt for your real application, not for some pie in the sky financial dataset, or something like that. Having said that, I think the first question to ask - and maybe we’ll use this example… So I would modularize it, I would think about it on the prompt level. So I guess there’s two ways to think about it. We could think about modular tests, and then end-to-end tests. We should be doing both. For this case, “Do we have a ground truth?” I think is the first question I’d ask. Is there an ability to make a dataset with ground truths that we can compare to? Or is it like a summary type example where there’s no answer?
In the case that I’m looking at I think you could establish ground truth. I don’t know it would be easy, but you probably could, because you could bypass what you’re seeing, and you could have a human kind of assess what the generative AI model was trying to assess as well. So you could get a ground truth that is a human analysis of it as a proxy.
Excellent. So your life is now ten times easier.
That’s always good.
It is great, yes. Step one is, whether you do it yourself, whether you hire some people on MTurk, or a QA, or however you do it, step one is build – it doesn’t have to be big; build a small dataset and try to get into this method of test-driven prompting, or eval-driven prompt engineering. I don’t know if we have a word for it yet, maybe we need to define one. But try to build some sort of metric we can evaluate our test on, so you don’t have to be just one by one try these examples out.
[00:34:07.00] So what I mean by this is build a, let’s say, 10, let’s say 15, let’s say 100, whatever it is, depending on the use case, of input variables to your prompt; so your prompt is probably having “Hey, your task is to do this. Here’s this data”, blah, blah. So the data is an input variable in that case. And then get a human to give you the output, and then start – every time you test the problem, “Let’s run it on that.” Over time, we can – I can talk about over time how to make that better, but does that make sense?
It does, it does.
The thing I’d add is overtime how you make that better is you start connecting that back with real data, and how your users use it. So again, you can make your life 10x easier if you have good user feedback, and a way to know if the production inference, if the production LLM run actually worked or not. So user feedback’s a way to do that. Say a user gives you a thumbs up, thumbs down - now you can take all those thumbs up, thumbs down, make a new dataset out of that. And now you’re really going. Now you’re building this whole feedback loop. And I think – I’ll say, our biggest goal with PromptLayer, our MO is to shorten the prompt engineering feedback loop. I think that’s what everything boils down to in this world.
Maybe along with that element of feedback, I’m wondering if you can talk a little bit – because we’ve talked a lot about evaluation, prompt versioning… There’s the other element of this, which I know you all are thinking about deeply, which is sort of logging and monitoring… And there’s certainly cases where “Oh, I have this chain of LLM processing”, or even loops that could happen… Like LLM as a judge, or something, that kind of, or critic kind of elements of LLM prompts that actually could loop until something happens, or a certain number of times. And the way that you develop your prompts, both in terms of their length, in terms of how effective they are, could drastically impact your latency of processing, it could impact your cost in terms of how much text you’re putting into models… Especially if they’re charging you for how much text you’re putting in. So yeah, could you talk a little bit about maybe the highlights of some kind of best practices around logging and monitoring, and how you think about that at PromptLayer?
Yes. So I think – not to sound like a broken record here, but the thing I go back to a lot is everything is use case-dependent. So you brought up that some people are very concerned about latency, some are very concerned about costs. I know teams that aren’t concerned about neither of those, and their only concern is “Are we getting the right answer?” For example code generation type startups. And a lot of times latency doesn’t matter; you’re giving them a task… And again, in our case, at PromptLayer, “Hey, can you –” We worked with a company called Grit, where we said “Can you move our whole codebase into TypeScript? It could take a week, I don’t care.” So latency and cost don’t matter to them. And then there’s a lot of other cases where - in most cases, probably latency and costs matter.
So in that case, or in either case, why logging is important here is just debugging. Honestly – I’ll be honest, logging and observability is kind of the most boring part of our platform. Bot because it’s not useful, but because it’s obvious - I think we started with observability. This is table stakes to shortening that feedback loop and that high-level goal. This is table stakes, because you need to collect the data; you need to collect the data to build these evals, to see when it’s not working, to be able to triage issues… Say one of your users tells you “Hey, I’ve got a weird error.”
[00:37:57.08] This happened to me, actually… I was using Superhuman AI, and it didn’t work. And I told them about it, I said “I got like a weird output. Maybe you guys want to debug it.” And they asked me what prompt I gave it to do the output. “I don’t remember… Don’t you have a logging system? Maybe you should use PromptLayer.” [laughter]
But yeah, you should be able to figure out why something broke; you should be able to step through, step by step, into the change, see which version of the prompt it used. Maybe you have multiple versions in production. Our logging just logs each request, and lets you integrate it with metadata like user information.
It’s one thing to log a bunch of things, but let’s say that I want to improve latency, or cost, or something like that… And I maybe have like 17 different prompts that I’m using across my system. How have you learned how to present that information to users so that they can kind of – especially from a… You mentioned kind of the skill of prompt engineering having this algorithmic thinking kind of piece to it… But there’s also a lot of people - you know, they’re coming in and maybe that’s the part of their brain that they’re building up, and they’re bringing in these other skills with them… So how have you found it useful to present this sort of information to people to give them the right sorts of feedback along their kind of journey of optimizing things?
So I think latency and cost are the easiest things to figure out how to – they’re the easiest metrics you get out of the box when you’re doing prompt engineering, because you’re always getting latency and cost. The harder metrics is “Is my answer correct? Is it rude to me? Is it mean? Does it have AGI?” I don’t know. [laughter] [unintelligible 00:39:44.02]
We did a prompt a generic tournament last night. The first round was “Can you avoid a PR disaster, like Microsoft Bing?” But that’s the hard part, is making sure your answer doesn’t go off the rails, or isn’t wrong. But latency and cost, and those type of base level logging specific things - those are detractable metrics. So we give you latency and cost for each prompt template, it’s broken down based on version, and then we also have a full analytics page that actually we revamped the other week for one of our customers, who… They were going to have to build out a whole [unintelligible 00:40:24.21] dashboard, because I think something about either the founders of the company or the investors were worried about some users spending too many credits. So we kind of revamped our analytics page to just save them some time there. So you can use our analytics page to see which prompt templates are costing you the most, which users are costing you the most, maybe if you’re segmenting things to prod, and maybe you’re segmenting based on geo… And just kind of filtering down based on that sort of thing.
I want to ask a question… As you have kind of pioneered this whole space, jumping into prompt engineering, quite honestly, before anyone really knew what it was, as you pointed out earlier, and you’ve been building out this capability… As you look to the future, and the future is changing so rapidly right now, and we’re all in this massive acceleration of things coming out… Daniel and I every week are trying to figure out, of all the things happening, what do we actually talk about? It’s getting harder, whereas it used to be there was something that happened last week, and we’ll talk about it. Now there’s so many… As you’re operating a business in this kind of intense, increasing environment, where do you think this is going from a prompt engineering standpoint? What will prompt engineering become as we become increasingly multimodal, and all the fantastic things that are happening on a weekly basis?
I would imagine that it would be fairly hard to try to plan ahead on where the industry is going, and the technology, and where to put your business… How do you see the future? What’s the next one-year, two-year, five years in your head look like?
[00:42:00.28] That’s a billion-dollar question, right? I think we try to do two things. We try to not predict the future, because it’s too hard, and we try to build something useful that is built on first principles that make sense. And I think that’s how we try to stay ahead of the curve there a little bit. I can give you some examples.
So for example, kind of just the whole process of iterating, of testing… For evals - we procrastinated a little bit on building that part of our platform. We always knew we needed it. It’s been the buzzword in the industry for like six months now. But we really wanted to know how to build that correctly. And I think we spent a lot of time talking to a lot of teams and saying “How do you do evals today?” And every team we spoke to did it in their own way; Google Spreadsheet, building out some weird, unique thing… So our eval product, if you try it out, it looks like a spreadsheet for that reason. And it’s very much inspired by the robustness of a Microsoft Excel type product, where it seems very simple, but you can take it in a lot of different ways.
So I think we are trying to become future-proof by avoiding taking strong opinionated wins. We want to support best practices and build best practices for the community, especially in a space like this, but we want to do it without pigeonholing people into different ways of doing things. And I think it’s been funny seeing how the hive mind has changed their opinion on what the future is… I remember a year ago we were talking to investors - obviously not the investors that are on our team right now - and the investors were like “Oh, prompt engineering – AGI is just going to take over. We’re not going to have any infrastructure anymore.” And [unintelligible 00:43:49.17] I don’t think many have that opinion anymore, let’s just say… And I think it’s very obvious to us, and it’s been obvious to us that prompt engineering is the process of giving inputs to the LLM, and choosing which model you’re using… And even with the most advanced LLM ever - let’s say the LLM is advanced as a human. You still have to tell a human what you want. You still have to tell the intern what task you want him to do. And that’s prompt engineering. And there’s always going to be a process of inputs there. So that’s how we think about the future, and lack thereof. [laughs]
Jared, thank you so much for coming on to share your insights today. We definitely appreciate that you and the team at PromptLayer are thinking deeply about these things, and building really good tools to support the community. And I encourage everyone in the audience to check out the show notes, follow the links, find out more about PromptLayer and the cool stuff that they’re doing… And I hope we can have you back on the show in another year, when I’m sure prompt engineering will look very different than it does now. But thank you so much for joining, Jared. It’s been a pleasure.
Yes. Thank you for having me. This has been fun.
Our transcripts are open source on GitHub. Improvements are welcome. 💚