Practical AI – Episode #269
Full-stack approach for effective AI agents
with Josh Albrecht, co-founder & CTO at Imbue
There’s a lot of hype about AI agents right now, but developing robust agents isn’t yet a reality in general. Imbue is leading the way towards more robust agents by taking a full-stack approach; from hardware innovations through to user interface. In this episode, Josh, Imbue’s CTO, tell us more about their approach and some of what they have learned along the way.
Featuring
Sponsors
Neo4j – Is your code getting dragged down by JOINs and long query times? The problem might be your database…Try simplifying the complex with graphs. Stop asking relational databases to do more than they were made for. Graphs work well for use cases with lots of data connections like supply chain, fraud detection, real-time analytics, and genAI. With Neo4j, you can code in your favorite programming language and against any driver. Plus, it’s easy to integrate into your tech stack.
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome to Practical AI | 00:43 |
2 | 00:43 | Josh Albrecht from Imbue 👀 | 04:07 |
3 | 04:50 | Categorizing agents | 03:38 |
4 | 08:28 | Making agents work | 03:19 |
5 | 11:47 | Usecases for agents | 02:09 |
6 | 13:56 | Full stack approach | 04:15 |
7 | 18:11 | Research first | 02:44 |
8 | 20:55 | Focusing on environments | 03:17 |
9 | 24:11 | Fundamental laws to deep learning | 04:35 |
10 | 28:54 | Sponsor: Neo4j | 00:58 |
11 | 30:02 | Engineering trust | 03:09 |
12 | 33:11 | New interfaces | 01:57 |
13 | 35:07 | Reviewable outputs | 01:51 |
14 | 36:58 | Day to day coding with an agent | 03:40 |
15 | 40:39 | Language context shifting | 03:02 |
16 | 43:41 | Things to come | 02:08 |
17 | 45:50 | Thanks for joining us! | 00:26 |
18 | 46:15 | Outro | 00:46 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Well, welcome to another episode of Practical AI. This is Daniel Whitenack. I am the CEO and founder at Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris?
I’m doing very well today, Daniel. I’m hoping that we can kind of Imbue today’s show with a sense of wonder and exploration.
Yes. Well, thankfully, we have an agent on the show with us that’s going to be very helpful in that. Today we have Josh Albrecht, who is CTO and co-founder at Imbue. Welcome, Josh.
Thanks. It’s great to be here.
Yeah, well, we sort of in a not very funny way teed up a couple of things to talk about there as related to agents, but… Could you give us a little bit of background? You talk with Imbue about the dream of personal computing, the dream of agents doing work for us in the real world, kind of your approach to that; we’ll dig into a lot of those things, but could you give us just a little bit of background in terms of how you as founders of Imbue came to these problems around agents, and accomplishing more kind of complete or complicated tasks with agents?
Yeah, I mean, AI is definitely something that I’ve always been interested in and excited by. I remember a long time ago my friend read some book in middle school, I think, like maybe Ray Kurzweil’s “The Singularity is Near”, and I was like “Oh, wow, there’s AI. So exciting!” And did all that come through? I don’t know necessarily… But it seemed like an interesting thing, and I have always been interested in thinking, and logic, and AI, and neuroscience… And when I went to school, I was originally going to do cognitive neuroscience, but the professor was a little bit too boring, so I did AI research instead. And so ever since then, I’ve kind of – so I’ve published a bunch of papers and things, but it felt like it wasn’t really gonna have a big impact on the world, so I went off to do startups… But all the time when I was in startups, I was always looking back and looking and saying “Oh, is now the time to get back into more fundamental AI research? Does this stuff work yet?” And eventually, it came a point where I was like “Yeah, this stuff is working.”
What I’ve always wanted to do with AI systems is make better tools for us. There’s so much work that we have to do in the real world that is just not that fun, not that interesting, and not really moving things forward… And so all my time at startups and the things that I’ve been working on, they’ve all been very practical, very applied versions of machine learning. And so I’ve always wanted to – we are an AI research company, but it’s not AI research for research’s sake; it’s AI research to actually make tools that are useful. And so what we’re doing at Imbue is we’re trying to make tools – even just starting for ourselves, like “Can we make robust coding agents for ourselves, that can really help accelerate us and help kind of take over some of the boring tasks that we don’t necessarily want to do?” And that’s why it sort of gets into agents.
Agents are AI systems that are acting on your behalf. Tools like chatbots etc. are really cool; it’s great to be able to answer questions, it’s great to be able to generate text. But if I have to copy and paste that text every time over into some other thing, and do all the work myself, it can only save you so much time. It’s like a better version of Google at the end of the day, or a better version of like a search engine, or something like that; or a book. And so I think the real promise of AI is in systems that actually take actions. But in order to get that to work, we still have a lot of work to do on the capability side.
When you’re talking about taking actions in the real world, there’s a lot more risks and a lot more kind of downsides that come from that… And you need to be careful about like – you know, you don’t want to empty the user’s bank account. That’s gonna be a really bad product experience. So how do you make systems that the user can actually trust, systems that are robust, systems that you can know are actually correct, and that flag for you “Hey, I’m not really sure about this”?
So this is kind of why we always talked about coding and reasoning, is we’re talking about the ability to kind of understand the outputs that are actually being created, and understand “Is this correct? Is this actually going to be useful for people?” And really like thinking it through more like a person, instead of just “Hey, here’s a generation. Good luck.” So that’s kind of how we got to agents, is like we want to make practical systems, we care about making these systems actually robust and useful for people, and that’s what a lot of our research is focused around.
When it comes to agents and sort of where we’re at with them now - so we’re recording this in May of 2024, for those that are listening back… How would you kind of categorize in your mind? Because you can download Langchain, you can create what is an agent maybe for this purpose or that purpose, that searches the web, or does this thing or that thing… And there’s certainly, even in my own experience, a lot of fun to be had in that, for sure, but there’s a lot of challenges in making this sort of - at least in the enterprise setting, making this a reality for solving problems, much less in those random times in my personal life where I need to do things. So how do you categorize, as of now, the state that we’re in now - of course, everything’s changing - the sort of main sets of challenges where people are hitting blockers when they’re trying to create these agents?
[00:05:48.17] Yeah, that’s something that we actually played around with a lot last year. We interviewed a whole bunch of founders of different agent companies, both on our podcast, and our Thursday Nights in AI events, and also just in-person, kind of off the record, a bunch of friends, friends of friends, people starting companies, really trying to understand what are the problems that people are running into when they’re trying to make agents. And the thing that we kept coming back to is there are all these tools like Langchain and all these other bits of infrastructure out, there are ways of testing things, like Scorecard AI, or all these different libraries… But the problem that people really had was what you really want as a software developer is “But does it actually work? Does it actually answer the question correctly? And can I get these things to do what I want as a product designer or as an engineer, without having to specify all of them details myself?” That’s sort of the promise of AI. And right now, they’re really great for getting like a first-pass version of this system working, where it’s like “Oh, cool”, you ask it a thing and 60%, 70% of the time it’s right. That’s great. That’s so amazing. Wow, it’s getting this really complicated question right some of the time. But 60%, 70%, 80% isn’t really enough for like deploying this. And going from that 80%, to 90%, to 95%, to 99%, to 99.99% - that’s actually a lot of work.
And so people have all sorts of techniques, for RAG, or for kind of other types of ways of conditioning the answers to kind of make them better and better… But the things that work today are kind of the more constrained versions, where you’re sort of – you’re asking like a very simple question, or you’re in a very narrow domain. And so the programmers, the product designers can make sure that everything works out within these rails. Once you are in the more general assistant kind of category, it’s a lot, lot harder. I think we’ve seen a lot less stuff be successful there.
But I think in terms of categories, and in terms of kind of the problems that people are running into, I would say the main one I would summarize is robustness, like correctness. Can you actually get these things to be robust all the time? I think that’s what really distinguishes agents. If we think about agents in the real world, a dog is an agent. I’m an agent. A robot’s an agent. A dog is actually extremely good at not dying for a really long time. It’s not that 90% of the time when it walks across the road it doesn’t get hit by a car. Like, 100% - almost 100% - most of the time; it’s like pretty safe. Usually, as agents, we’re being very, very conservative, very cautious, so that we take correct actions. And there’s a lot of like heuristics and intelligence that goes into being conservative, being risk-averse, being able to take a long chain of actions without going wrong, unless something else is horribly going wrong. And our agents don’t have that kind of common sense and that kind of reasoning right now. I think if they did, it would make it a lot easier for people that are building agents.
As we were kind of going through the last couple of questions, talking about kind of the problems that people run into when they’re trying to make agents work, and what can they do to ensure that it has a good outcome, I also run into people all the time who I think really struggle to understand within the context of all the hype and the boom of generative AI, what can you use an agent for productively in enterprises in 2024? They’re used to going to these web interfaces that are becoming ubiquitous for us all… But the notion of saying “Okay, I’m going to–” Going back to what you said earlier on, kind of getting it out of that web interface… Can you kind of paint a picture about how people out there who are trying to bring this productively into their organization, as an agent versus a web interface, how they might even conceive of how to approach problems that they might want to solve with the technology?
There’s a lot more work to be done today to make agents work for a system. I think if you approach it as a more holistic system, then it’s more likely to work. So if you think like “Okay, where are the places that it could go wrong? What’s the confidence that I’m getting back from the system? Can I flag that for human reviewers? Can I have like a bunch of different checks in place that are both like in-domain – like, for programming, does it pass a linter? Or does it pass this style guide? Or does it at least type check? Or is the syntax correct?” Like, there’s a lot of checks that you can do kind of in-domain, they can help out, and there’s different in different industries as well… And then there’s sort of – you can use the LLM to score an aspect, “Is this particular thing wrong? Is that particular thing wrong with it?”
[00:10:01.28] So as you start to build up more kind of like safeguards and guardrails around these, then you can start to get them to a level of robustness where like maybe for the easy cases it’s okay for your application for it to fail, and you know where that failure rate is, and you’ve done a lot of work to understand how much can we tolerate.
One of the things that we’ve done a lot internally is working on our own evaluations. This is a really critical thing for anyone who’s like trying to build real systems - you have to get really into the weeds of what does it mean for the system to be right. We’ve actually taken all of the open source NLP benchmarks and made our own internal versions of these systems, to make sure that they’re not contaminated by the training data, and to make sure that the questions are actually correct.
So one of the things that we’ll have coming out in the not too distant future actually is I think hopefully being able to contribute back some of that evaluation work that we’ve done, of like cleaning up these existing benchmarks, but we also have a bunch of our own internal ones as well. I think it’s kind of critical for anyone making these systems to make them yourself, by hand, at least like 100; look at them, “Is this the right answer? Okay. What did it get? Okay, is that right?” Like, getting to a place where as humans you agree on this, you’re getting a machine system to calibrate well to this… Then you’re checking “Okay, are the things that we’re getting as inputs in production, are they from the same distribution? Does this test actually make sense for this? They’re not drifting?” If you have adversarial systems, like fraud or something - much, much more difficult. If you have something where you’re getting the same kind of a query every time, then it can be possible to get something where you can trust it enough to say “Okay, cool, this is getting us 99%. That’s acceptable. We have some guardrails here, we can check how well it’s doing over time, we have people looking at these and auditing some of them.” That’s kind of the way to make this really useful, as you have to be like really getting into the weeds and into the details of “How do we evaluate this? What does success look like?” etc.
And for the use cases out there, the most successful use cases that you’ve seen - I don’t know if you have good examples of those, either internally or externally… But when you think of those, I like what you’re saying about digging into the details. I’m wondering also how much sort of specific domain expertise is actually factoring into how you handle those details. So if you’re building an agent to help people process data in a healthcare scenario, or data in a financial services scenario, or in a coding assistance scenario, there’s kind of this view – like, if I just download Langchain, if I go and kind of have the zero-shot approach, where this agent might be expected to do anything, my impression is that the most successful agentic type of workflows out there so far have been very much driven by people with high degrees of domain expertise in an area, that are able to work through those details. Is that impression correct? Do you have any thoughts on that?
Yeah, that seems pretty much right. I think there’s this promise of AI that like someday you’ll be able to just ask it to do anything. And the interface sort of affords that. It looks like “Oh, there’s this textbox. I can just ask it to do whatever, and it will give me back a response.” And “Wow, it even sounds so confident and so correct. Wow, that’s great. This can do anything.” Maybe it even succeeded in that case.
One example that I love from a little while ago was we were trying to see how well existing LLMs would do at detecting bugs. And so we asked – the first thing that I did was I looked, “Okay, is there a bug on this line?” I found a function that had a bug, I asked it, and “Yes, there’s a bug in this line.” It’s like “Oh, wow, it’s so good at this.” It’s like, wait a second… How about this other line that definitely does not have a bug? “Oh, yeah, there’s a bug on this line. This doesn’t work.” I’m like “Wait, wait, you’re just always saying yes. This is not quite right.”
So yeah, it seems to promise that, but you have to really dig into the detail, use few-shot examples, and retrieval, and all these other kinds of techniques to kind of get into the weeds… And the more domain expertise that you can bring to bear, the dramatically better I think the outcome is going to be.
[00:13:55.21] So Josh, I’m really intrigued by sort of the statement, what you put online in terms of Imbue’s thinking about building a robust foundation for AI agents as being a full-stack approach. I like that because it sort of reminds me - I don’t know, Chris, if you remember quite a while ago when we were still talking about data science… I guess data science is still a thing, but we were talking about it a lot more years ago… And there was this – I forget, I think it came up a few times, this discussion about being a full-stack data scientist. And oftentimes, those are the most productive, where you have an understanding of how data preprocessing happens, and building your application, how the model is embedded in software, and deployed, and all of this stuff… And so I love that sort of thinking in that respect, and I’m wondering, from Imbue’s perspective, how you think about taking a full-stack approach when it comes to agents.
Yeah, we take it, I think, to a slightly more extreme degree than most people, in that we do everything, from setting up our own hardware, building our own infrastructure, pre-training, fine-tuning, RL evaluations, data generation, cleaning, UI, user experience… The whole thing. And the thinking there is that at each one of these places, you can tweak some things to make the overall thing kind of work better together. So you can kind of change the training data that you’ve used in your system in order to make it more like the kind of thing that you actually need for your product. And then in RL, you can kind of set objectives that are related to the things your user actually cares about. And then on the UI, you can use the capabilities that you have to kind of help highlight places where this particular system fails.
So I think we’re really interested in kind of the full stack approach and the ability to tweak things at each one of these levels. And for us, it comes from our like history as a research company - one thing that we’ve always really focused on is being able to deeply understand the technologies that we’re working with. So for us, pre-training, fine-tuning, doing RL - it’s not just a black box. We want to open these things up and understand what’s actually happening inside of there.
We have a paper club every Friday, where we’re looking at the state of the art stuff that’s coming out, reading through this, and trying to really understand “What are neural networks really learning? How is this language model actually learning? Where does it fail?”
There are really interesting papers that show particular logic puzzles where this thing doesn’t work… And it’s like, okay, it’s not really doing logic; it’s not really doing addition, it’s doing this other thing. But if you tweak it in this way, “Oh, now you can get it to learn a simpler form of addition, that is more general. Okay, that’s really interesting.” So what is a transformer really good at learning? What things in the data actually matter? And how do you evaluate these things?
Another thing that we’ve also thought about a lot, one of the things that we’ve set up that has been super-useful is looking at not just the accuracy of our systems, but their perplexity on multiple choice question answering datasets specifically - not perplexity over all the tokens, both perplexity specifically for the multiple-choice question answering things. This gives you a much more fine-grained understanding of “Is this actually being right or not?” It gives you a really precise metric for this.
And this idea came from a paper, which was about – I think something like “Are emergent properties of language models a mirage?” or something like that, was the title of that paper. Their point was a year or two ago people were like “Oh, look, these language models have these emergent behaviors. They’re suddenly learning to reason”, or whatever. It’s like “Oh, well, they’re suddenly getting so smart.” But when you really dig into it, it turns out that if you look at the performance on a log scale, it’s linear. So what was really happening - it’s just our metric was not very good. We weren’t really asking the right questions, we weren’t deeply understanding what was happening; it was just always on a log scale, just always getting better, and you just couldn’t see it in the metric.
And so for us, this is a good example - if you want to deeply understand what’s going on here, we don’t want to just treat these as magical entities, but rather, they’re just technologies. They’re just really bags of features at the end of the day that we can use to do actual work in the real world. And so I think that that’s kind of our approach, is to take the full stack approach, understand everything from “Okay, how does the InfiniBand network work? How does that fit into our performance optimizations? How does the data work? How does the network work? How are all of these things adding up?”, to give us some final error, or some final user experience that’s really good.
[00:18:10.05] You’re kind of really fascinating me with that statement… So many people do take kind of that black box approach, and they don’t necessarily have that kind of research-first orientation that you’re describing. As a company, as a business, how does that research orientation where you are rejecting the blackbox perspective, and saying “We’re going to open it up. We’re going to tinker, we’re going to understand the specifics of how small changes affect that”, how does that affect how you approach this compared to whoever you would perceive as your competition, or something? What does it mean for you as a company to take that kind of research-first approach?
Yeah, I think there are trade-offs to it. One trade-off is that it takes a little bit more time and effort to do this, to really deeply understand things, rather than just like hack it together, and throw it out there. But I think the benefit is, in the long term, when we do really deeply understand these systems, it makes it a lot easier to make modifications, and to make changes, and to know how to improve things.
These systems are very expensive to train. There’s a lot of effort that goes into this, and it can be very expensive to just try a whole bunch of things. And if you don’t really know what you’re doing, it’s easy to waste a lot of time. And so I think for us, we would rather take a step back and say “Okay, what’s actually going on here? Can we make robust systems? Can we make a robust baselines? Can we get this working in a way that we can trust our results, that we can understand what’s going on, and build on top of those?”
Another thing that we’ve built internally that has been really useful kind of along these lines is CARBS. Cost Aware Pareto Region Bayesian Search, or something like that. But basically, it’s a hyperparameter tuner that is cost-aware. So we can take any system that we have and say “Hey, you have all these 10 or 20 different hyperparameters, these different knobs you can fiddle. How do you get – I have a system that works, but how do I make it way better?” We can take this, just throw it in there, come back the next day, and it’s tried hundreds of experiments, at different scales. So it tries at a really small scale, and it sees “Okay, for a really small scale, this is the best way to do it.” And then as we get higher and higher and spend more and more time and resources and money on it, like “This is kind of how these hyperparameters change, how things change as we scale.”
And just understanding that there are these scaling laws; there are scaling laws for different parameters. How can we back those out and learn, for any given architecture, any given problem - having an automated system to do this allows us to kind of like quickly develop this. And it took some time to make this system, right? But it really pays off to have that kind of deep understanding of the systems that we’re working with. So I think for us, it’s kind of like taking a long-term view. I think in the long term it’s much better to actually understand what’s going on. And it does take a little bit of upfront work. That’s why we don’t necessarily have a product yet; we’re working on it. I think we’ll get there, and I have confidence that we’ll get some really cool… But it does take a little bit longer. And that’s okay. I think we’ll end up with something much cooler as a result.
As someone who’s working both on the – all the way up the stack, even up to interfaces and all of that, but you’re also training these foundation models, certainly both the market and the technology and the options around foundation models have just sort of blossomed, and then these have proliferated over the past year especially… What’s it been like internally – we’ve had a couple people on the show, and I find this interesting… From the perspective of someone inside a company that is training their own foundation models, how do you go about maintaining focus within this sort of environment, where eventually you’re going to have to spend a significant amount of time investing in specific model architectures, specific datasets, that sort of thing… But things are shifting all the time. You mentioned reading papers, and trying to keep up, but… Yeah, how do you maintain that focus, and what’s life like in the midst of being a foundation model builder in May of 2024?
[00:22:11.04] Yeah, I would not necessarily characterize this as a foundation model builder. Part of what we do is train models. But that’s not the only thing that we do. And the reason that we do it is not necessarily to make the biggest, bestest foundation model ever. I think there’s a lot of money going into other companies spending huge amounts on these, and –
General purpose.
…on general purpose versions of these systems. And I think for us, the more interesting thing is “Can we make the more specialized ones? Can we take these, can we adopt them? Can we make them more specialized? Can we find ways to have them work together, to pull different things together and make a model that’s kind of better at doing that, that sort of synthesis, and kind of like pulling these things together, and better at the particular tasks that we care about?”
We’ve seen really good results from this. We’ll have some blog posts in the next few weeks about this, but… I think we’ve seen some really good results on much, much smaller models. And so I think if you look at like DeepSeek Coder, for example, I think that model still significantly outperforms LLaMA, a model that’s the same size, and even of much larger models. And this is because it’s really trained on a lot of code, and so to generate code is something it’s very familiar with… As opposed to being a pretty small part of its distribution.
So I think, again, this comes back to the fundamental understanding part… Because we know these are just bags of features, that yes, having a bigger bag of features is definitely better… But then your inference time goes up as well. And if you want better bags of features, you need to give a data – like, the really important thing here is the quality of data that you’re giving it, less so the absolutely massive size, I think, for practical uses.
So our focus is “Can we make these really specialized, and very useful for ourselves, for our own purposes?” And we’re pretty happy to see people out there competing, making better technologies, driving the cost of these things down, making the huge context windows, giving them away for free in many cases… That’s great. We’re happy to see more competition there, because I think the part that we’re more interested in is how do we actually use these things at the end of the day, and put it all together to be really useful.
I love that you mentioned DeepSeek. That’s a favorite of ours as well at Prediction Guard, and generating SQL to do data analysis and code, and in our chat interface… Yeah, we love that. And so yeah, I totally agree, there’s a lot that can be done with that sort of thinking.
You also mentioned in your work kind of more the – and I do want to get to kind of more the frontend interface side. But before we get there, you mentioned kind of pursuing fundamental laws behind deep learning in order to, again, understand and create this foundation for the agents that you’re building. What have been some of the things that you pursued in that area as kind of the theoretical underpinnings for this progression towards robust agents?
There’s a bunch of things that are still in progress that I can’t speak to directly, but we’re definitely interested in, say, how do you initialize things properly… Like [unintelligible 00:25:10.15] etc. One of our researchers is a collaborator of his, and like working on kind of understanding exactly what the right way is to parameterize these language models in a theoretical sense, but for a practical reason. So if theoretically this is the right way to parameterize them, then the practical implication is you no longer need to tune the learning rate as you scale them up. This is super-helpful, because it’s one of the key factors; and so to remove some of these hyper parameters makes it much more efficient to kind of explore this space.
So that’s like an example, a very concrete, simple example of a place where sort of the theoretical understanding can help you. Other places where this can help are not as easy to point out like the exact theory, and sort of more informed by that… Or more like physics. Physics didn’t start with like perfect theories of everything, right? We kind of did some experiments, and had a more experimental understanding of the world before we had perfect theory about why everything worked. I think we’re at that phase with machine learning as well.
[00:26:08.12] There’s an interesting work by one of our researchers, Jamie Simon, on kind of what’s actually happening in the fundamentals, when we’re learning things. There’s this notion from one of his papers about learnability. A network of a fixed size can only learn so many things, and it’s very precise. Or we’ve had another paper about like self-supervised learning, where you can see “Oh, there’s a sort of like stepwise nature, as it learns each piece of the thing.”
So each of these little theoretical things is telling you something about how they work. We don’t have a full picture, and the real ones are quite complicated, and a little bit more complicated than the smaller examples… But each piece is giving you a sense for what’s going on, and allowing you to operate in this space without having to guess and check quite so much. It’s not as much of a black box, it’s more like a machine where - yeah, you don’t know the exact internals, but you know like “Don’t make it too hot, or it’ll explode. Don’t make your learning rate too high, or it’s not gonna work.”
So you can see not just learning rate, but other sorts of precursors earlier. You can look at areas like norms, or other quantities to understand “Is this getting too large? Is this growing large over time? Is this something that’s actually too small, and we can actually up the learning rate later? Or do we need to apply more regularization of a particular type?” You can kind of get a sense for these things, even if we don’t have kind of perfect laws yet.
We also get some laws out of the car hyperparameter optimizer that I mentioned before. We can see things like “How do these parameters change with a scale?” and understand not just how do the learning rate and data and parameters change, but how do very specific hyperparameters change? What is the depth versus width that you should have for this particular type of regularization? How much exactly should you have, and how is that changing? And that goes back and kind of informs “Okay, what is actually happening under here? It’s weird that this particular trend holds over scale. It seems like it needs less and less of this. That’s kind of interesting. Why is that?”
And sometimes we’ll see a paper that’s like “Oh, that fits in. I see what’s going on there. That’s nice.” So we’re getting more and more – I think collectively as a machine learning community we’re also starting to understand these things a lot more. I think when people point at neural networks or language models as like black boxes, like “Oh, nobody understands”, I think that’s quite a mischaracterization of it. There are a lot of people that have a lot of very good ideas about how these things work… And nobody on this call probably knows exactly how a car works. I don’t think you can make a car from scratch. I certainly couldn’t… Especially modern cars, that are quite complicated. But we can use cars to go wherever we need. I mean, we roughly know how they work. So it’d be weird to say “Oh, we don’t know how cars work.” I think machine learning and neural networks are a lot more like that than most people [unintelligible 00:28:43.18]
Break: [00:28:47.19]
So Josh, going into the break you had a really good analogy there about the sophistication of cars - it means that while we all use them all the time, we may not understand every aspect of them… And I wanted to go back for a moment, because I’ve been kind of percolating on some of the things that you said earlier… And you’ve been talking about kind of the trust and robust systems and all, but I was wondering - I know in my own life I’m very involved in the trustworthiness of models. And you talked a bit about getting good outcomes, and being able to detect that. Do you have any guidance on what it means to engineer trust into model training? So many organizations that I’ve seen kind of tag the trustworthiness of models on at the end, as that “Oh, yes, we have to do that, too.” And you have such an insightful and deep way of approaching the engineering, rejecting the black box approach… Any guidance you have on how you engineer trust in it from up front, so that as you get through the training lifecycle, you come out with something that you have a high degree of confidence is what you’re intending it to be?
I think a lot of people are trying to do this, and there is good work to be done there, and we can do things to improve the models and make them more trustworthy, and during training. And that’s great. But I think by far the largest place that we should be focusing is actually after training. We don’t trust people because like “Oh, I looked at their schooling, and they seem real trustworthy up to this point. I’m gonna give them my credit card, and I’m gonna give them my bank account.” No… We’re gonna be looking “What is this person doing?” Okay, we’re gonna be checking things afterwards… There’s a lot of other stuff that needs to happen post training, and in deployment, where we can actually trust things.
So I think for me, it’s actually a lot more about “What is happening when you’re actually using the model? What kind of auditing or real-time verification or user interaction or other sorts of checks or things that you have – can you have other systems that are checking the behavior of this?” For an agent, maybe you’d want to predict “Is this action potentially going to have negative consequences?” Or “Is this going to be potentially dangerous?” Or “Will this be something that the user might not want?” And those seem like good things to have as totally separate systems, that are completely unrelated to the development of your original model. You would not want the original model to be responsible or connected to this at all. You’d want to have a totally separate thing that’s looking at this.
So I think trust is better thought of as a set of different types of data that can give you confidence that things are going well, that have gone well, and will continue to go well. And so you can only get so much trust up ahead by kind of designing the system in a particular way, and you have to understand “What is that model good at? What distribution was it trained on? Have we shifted from that distribution? Have we shifted from the tasks that it’s good at? How well has it done over time? Is it likely to go wrong in this new example?” So I think it’s more of a post-training, more of a practical kind of a problem, and the idea that we can solve this all by making safer, trustworthy models is a little bit – it’s going to be difficult to succeed at that task.
Maybe this ties in to the trust element, certainly the kind of collaborative approach with agents. But you do talk also a lot about some of the thinking that you’re doing around interfaces as well… And it sounds like you’ve also been utilizing or trying to utilize some of what you’re developing internally for coding and other things… So what are you thinking about in terms of interfaces, and how are you dogfooding some of those things internally to kind of learn about interfaces beyond the kind of AI chat interfaces that we’re all familiar with?
Yeah. So I think the learning internally from using our own kind of prototypes and internal products and demos has been – there’s been quite a lot of that. Without actually using it, it’s hard to kind of get this learning about “Okay, is this trustworthy or not? Does this actually work? What UI do I want to use for this?”
[00:34:05.11] I made some prototype, it generates a bunch of code, and very quickly I started to realize “Hm… That’s great, but it’s really annoying to review this much code.” I see a lot of products out there that are like “Oh look, it’ll like make a PR for you.” Yeah, I mean, how fun is it to review a PR of a few hundred lines if there’s a few lines that are wrong? You have to search through for this bug, and it doesn’t really tell you anything about where it is… This is just a really awful user experience.
So I think instead if we approach it from the perspective of “Okay, what do I want as the user here?” What I want is for this to be pretty interactive, and for this to tell me “Okay, maybe there is a bug here.” Or “Yeah, you asked me to make this PR, but your ask was kind of ambiguous, and I needed to make some assumptions. Here’s the assumptions I made. Here’s how confident I am in them. Do you want to change them?” “I guess I do.” “Okay…” Once it’s more interactive, once you’re going back and forth with the user and trying to flag places of ambiguity, uncertainty, risk etc. to the extent that you can be correct about those, it can make the user experience feel a lot, lot better.
Any anecdotes from your own sort of internal experiences with these, or things that you’ve tried either on the positive or negative side?
One thing that I really like about Copilot, just as an example, is that it keeps it short. So it’s easy to review. I think when Copilot style things make these huge generations, that’s why they normally don’t, because it’s kind of hard to review it and to trust it, and to do that. But I’m imagining that people are probably going to get to a world where they realize “Oh, okay, this is kind of annoying.” Maybe you could point out places where there are potential bugs. “Can you just tell me what lines seem like the most suspect?”
We for example made some internal error checkers and linters that’ll sort of highlight “Okay, yeah, this thing’s not even important.” Your editor does this for you. But you can also highlight things like “Hey, this spec doesn’t look like it was actually properly implemented here.” Or “This function specification is kind of ambiguous for these edge cases. Do you want to take a look at that?”
A lot of the work that we’ve done for our evaluations is related to this as well… So when we look at evaluation data, most of the time when systems fail, it’s actually from under specification, and not from the model messing it up fundamentally. It’s more like as a user I didn’t really decide what I wanted.
So I think one thing that’s really interesting to me is that coding is not really about pure correctness in like this abstract mathematical form, where there’s a perfectly correct version of this. The version of the function that you want, and that I want, are actually subtly different. And what I want in the moment might change from moment to moment as well. So the user really needs to be connected to that.
And as it happens, I also learn about things where I’m like “Yes, you did exactly what I wanted… But that turned out to be not a good idea.” And so I think the user needs to be there and able to learn and refine what they even want, and what’s even possible in the world.
You’ve piqued my interest again there… As a coder myself, who makes all sorts of errors in my code, constantly, as your doing that, and you’re kind of changing the workflow over time of how the coder is spending their time, and then ultimately potentially how they’re thinking about coding as they adjust to the new approach that your tools are doing, how does that look for the coder going forward, in terms of how does it change their day to day experience of coding? Are you able to rescue me from spending 90% of my time coding errors and forever trying to get myself back out of that hole?
[00:37:39.20] That’s really like the vision for Imbue and for the company and for the work that we’re doing, is can we get to a place where people, not just coders, but other even non-technical people can effectively write higher-level pseudocode or code, or intent, and actually have this translated into real code, and into something that actually makes your computer do what you want? That’s why when we’re talking about making a new personal computer etc. we’re really – at the end of the day, the thing that is missing is the ability to robustly write the software. And we can as software engineers get down in the details, and get everything there, and we spend a lot of time fixing our own bugs etc. And our goal is to make it so that as a user you can keep working at a higher and higher level of abstraction, and feel confident in that. Right now you can work at a super-high level of abstraction to say like “Make this whole thing for me.” It doesn’t work, and so that’s not very fun, because [unintelligible 00:38:29.28] and now how do you get into the details, etc? So how can we make it robust enough so that you can work at a higher level of abstraction and trust that this part was actually correct, and be able to have that dialogue back and forth when like “Okay, maybe it’s not quite working like I want it, or maybe it’s not possible to do this thing. Or not as easy to do it in the way that I wanted to do it, etc.” So how do you have a dialogue and help educate the person about what is possible, what isn’t working, what might not be working, [unintelligible 00:38:54.19] So it changes the workflow, and I think we’re interested in how do you change this workflow in a slightly more incremental way. You could, just say “Oh, we’re gonna have the AI system do everything for you, and magically try and figure it out”, but I think from our previous experience, we don’t think that these types of products are nearly as like good to use as the user experience.
Trying to fully automate something kind of is disempowering to people, and also results in kind of a worse experience and a worse product. So we’re more interested in this interactive dialog- like tool, that as a person I’m trying – maybe you can just write a line of pseudocode, you get a big block out, it tells you one line that is potentially problematic for you to look at… Or maybe it just gets it right. Okay, great. You move on to the next one.
So that’s one way that you can think about writing code, as like writing pseudocode. But there’s other ways you could write it. You might also write a command, change the file to add lots of log statements. Or you might also say like “Make this function more robust.” There’s lots of different ways that you can interact with this, and how can we give people more tools, more like paint brushes for being able to change code, and ultimately, make their computer do what they want?
I think the thing that’s really exciting about this is that when you can robustly write software, what you’re really doing is being able to create agents that can do a huge swath of tasks. If you’re not able to write robust software, then the only way your agent can interact with your computer is with things that we have already programmed as actions. Like “Okay, we’ve programmed it to go to a website, and like click a button, and that’s it.” But if they can write software, now they can do some huge set of things, and even things that you never intended or programmed in the first place. So for us, agents and writing code and reasoning are all like intimately connected.
I have one more tiny follow-up to that, that’s a personal thing I run into all the time, and having someone with your expertise, I want to throw it at you. Does it make a difference, as most software developers, including people in the AI space doing models and stuff - you know, they write in Python, they write in usually a variety of different languages. And as I shift from one to the other, I find that some of the capabilities that are currently out there, they are great on Python, because everyone on the planet is writing Python… But if I’m writing on something that’s slightly more obscure, maybe even something big like Rust, it struggles to do the exact same thing that it can do flawlessly on the Python side. Do you anticipate a time where that context shifting no longer applies very well, and that they’re all high-fidelity in terms of what they can do? Are we always going to be dogged a bit with the obscurity issue of certain languages?
It might go the other way. It might be that because it’s so much more robust in Python, we should only ever write in Python. And so what we do is we just write in Python, and we make a Python to Rust converter. Or we make a thing that assembles Python to Assembly, or whatever. It might be that it’s sort of better to like double down on a really small set of things that we’ve made tons of data for, and works really robustly… Because you get a better kind of user experience.
[00:42:01.03] One of the things that a lot of these models struggle with now is like you have different versions of like NumPy, or Python, or Ubuntu, or whatever. Things are different. How is it supposed to know what version you’re using? And so there’s this [unintelligible 00:42:10.28] explosion of complexity that comes from all these different possibilities… And so an alternative way to do this would be to say “You know what? Let’s not do that.” Let’s just say you’ve got Ubuntu 22.04, you’ve got this library version, you’ve got that one… If you do this, I think it might work a lot better.
So it could go actually in the other direction - instead of making it more robust on all these niche things, we might say “You know what, [unintelligible 00:42:31.20] Let’s not worry about what language it writes.” Maybe we only write at this higher level, and we never even look at that code anymore, so we don’t care if it’s in Rust or Python. I think once that happens, once we’ve sort of abstracted it up a level, then you might be able to come back and say “Why are we writing this in Python? This is not a typesafe languages. This is really slow. Why don’t we change it to be a language that fits better for language models?” And that might be an even better future thing. But that will require generating a ton of data to make this actually work… So I see that as like maybe probably a future thing, not a thing to focus on right now… But that’s my guess as to how it’ll evolve.
But also, an alternative world would be - you know what? It gets really cheap to just generate all this data, so we just make a converter from all of our Python pretraining data, to just make it do it in JavaScript and Rust and Elixir and whatever all the time anyway. So fine. We just like train it to be good in all these. I don’t know, we’ll see which way it goes.
Yeah, well, Chris will be happy if anything stays in Rust. I’m sure.
I wasn’t saying that… [laughs]
You’ll be happy. We just started working on our official Rust client for Prediction Guard, Chris, so you can be a beta user.
There you go.
It’s been great to talk through… Again, I love this concept of this sort of full-stack approach that you’re taking, and triggering things in my own mind to think through in my own work… But as you look forward, either you personally, or you at Imbue look forward to kind of the things that are happening this year, either in the community as a whole, or at Imbue, what’s kind of most exciting for you that you see as a possibility kind of coming into the future? …whether that be multimodal stuff, or new types of agents, or products, or directions that the community’s going, or the research is going? What kind of stands out to you about that as you look to the future?
I think the thing that is going to be most exciting over the next year or two, at least for us internally, and probably for other providers externally, is I think we’re going to make really good progress on what we have been talking about today, on actually reasoning, on robustness. I think once you can get to a place where you ask this question and you get back an answer that is really correct, and like robust and grounded… It’s not just “Oh, it said yes”, but it has all the right reasons, and it kind of like understands the nuance of “Okay, yeah”, it’s like “Yes, -ish…”, but like there’s a little bit of complexity here, and you can ask follow-up questions, and those are also right and robust… That ability to robustly reason and answer questions is going to unlock some huge amount of work that I think people are not really anticipating.
Once we really have the ability to robustly reason through scenarios, now we’re talking about a lot more like labor displacement and disruption than we were before, and there’s a lot of jobs that all of us can pretty easily put together, like “Well, first I do this, then I do that, then I think about this.” Like, okay, it only takes like one person to do that when you have these tools that are that powerful.
So I think there’s going to be a lot more change in this area than people are really expecting right now. It’s not to say that all jobs disappear or something, but the nature of work might change pretty dramatically, and we might have much more powerful tools than I think people are anticipating.
Great. Yeah, well, we’re really happy that Imbue is thinking deeply about those things as we look to the future, and at a really practical and useful way as we look forward. So thank you for doing that, thank you for your research and for taking time to join us. This has been great.
Yeah, it’s fun. It’s been great. Thanks a bunch, guys.
Our transcripts are open source on GitHub. Improvements are welcome. 💚