AI vs software devs

There’s some news that’s good. There’s also some news that’s maybe bad, maybe good… I don’t know. The thing that everybody’s talking about, this week at least as we record, and last week as well, is Devin, the first AI software engineer, according to the makers of Devin, which is Cognition Labs, a new company which raised a series A led by Founders Fund, headed up by Scott Wu, who seems to be a very intelligent person, even from a young age, if you watched that video of him doing math very quickly, at ages when it seems like you shouldn’t know math very quickly… And they got a demo out there on this new AI software engineer. So I could say more; I’ll stop right there. You all have probably seen the demo, Kball and Nick, or at least heard about what’s going on… And this is a new tool, which can start from scratch, and do some cool stuff. I’ll just leave it there for now, and we can talk about the details.

I mean, if you’re excited, you too can pay for the right to have a software engineer that can only fix one in seven of your tickets, and spin up lots of new ways for AWS to charge you money without your oversight.

Sounds like an intern. I’m just kidding. Sounds nice. And what are you referring to? Or is this some specific things that Devin’s been up to?

So high level, there’s a couple of things that I’m referring to here. So one is they’re pumping up the market, “This is a standalone – why get a coding assistant? Get something that can go and do your software.” And they published some data on it, and it does do better than the state of the art in terms of tackling going from a GitHub issue, to “Okay, I’m gonna actually solve this [unintelligible 00:04:29.17] to happen.” But the number they published, I think, was 13.86% of issues unresolved. So that’s about one in seven. So you point out a list of issues, and it can independently go and solve one in seven. And first off, to me, I’m like “That is not an independent software developer.” And furthermore, I find myself asking “If its success rate is one in seven, how do you know which one?” Are the other six those “It just got stuck”? Or has it submitted something broken? Because if it sets up something broken, that doesn’t actually solve the issue, not only do you have it only actually solving one in seven, but you’ve added load, because you have to go and debug and figure out which things are broken. You have a whole bunch of additional load. So I think the marketing stance there is little over the top relative to what’s being delivered.

The other thing - and this is around… I think a part of what they do is “Oh, it can spin up resources for you.” And they showed this cool demo of “You point it at this thing and it allocates a bunch of different production resources for you.” And the person who’s handled DevOps in me before, and now the engineering leader who has to sign off on our Digital Ocean, or AWS, or Google Cloud or whatever expenditures you might have, looks at that and is terrified by “I’m gonna give an LLM, which is known for hallucination, which is…” These things are not – you have to design an application… And I’m building applications with an LLM, but you have to design around their unpredictability and their willingness to lie… And I’m going to give that raw access to spinning up resources in my cloud? Like, that sounds – well, it sounds like something I would not sign up for, I’ll say that.

[06:20] Okay.

Kball, let he whose success rate at issues that is greater than one in seven cast the first stone…

Yeah, I was wondering what Nick’s ratio is over there. One in seven sounds about the way I would do. I’d pull off the easiest one first. Does Devin know what the easy tickets look like? Because that’s a skill right there.

I’m over here counting on my fingers trying to see if I’m within that ratio…

[laughs]

But do you know when you fail? Or do you just throw out broken code and you’re like “Yeah, here you go.”

It’s more of a question of “Do I know when I succeed?” I guess… Which is - yeah, same thing.

You think you’ve succeeded, until you find out later that actually you’ve failed? That’s been my experience. Or you succeeded under the constraints that you put yourself under, right? Or that was actually specified in the ticket itself. But you actually failed at some other unnamed, unlisted constraints, that were unknown at the time, but are obviously clearly there, in production. And so in that context, you’ve failed. It’s not easy. It’s not easy to succeed in this world. Well, what about – Kball, can’t you point Devin at like a $5 a month DigitalOcean and say “Deploy to this?” Can’t you cap your risk, I guess, on the DevOps side?

Probably. You probably can. And I do want – so I’m taking a hard skeptic stance on particularly the claim that this is an AI software engineer. Like “Don’t hire a person, use this thing.”

And this is their claim. So I think it’s fair for you to be that harsh on them, because they say “Meet Devin, the world’s first fully autonomous AI software engineer.” That’s a very bold claim. So I think it’s fair that you’re being that harsh. Go ahead.

Yes. They’re showing some cool stuff. It looks like a pretty interesting tool to put in the hands of someone who knows what they’re doing, and is able to validate it, and is able to say “Okay, go and solve this relatively well-constrained problem, where I can easily validate the correctness of your output. Go at the sandbox, where I know that you’re not spinning up massive amounts of resources in a way that I’m going to regret”, or even “Go at this non sandbox situation, but I have the knowledge to check what you did, look at the logs and be like “Yeah, that’s okay.” Those are really cool things. That could be really valuable. That could dramatically increase somebody’s productivity. And those are so far from being something that I would trust independently to replace a software developer that they’re not even in the same country; maybe not even in the same world. These are just completely different claims.

Yeah. I think that the sensationalism of this comes from not what it can do now, but what it represents, and the progress that it’s made when comparing to other things, like - whatever it was comparing that 13% to, to other AI chat things that can do things. It’s way better than all of those. It still sucks compared to a human, but it’s made monumental progress in terms of AI. And I guess the question is, “Does that continue?” Can it get further than that? Or will it reach some kind of limit? And then the other piece of it, I think, just from a marketing thing - and I’ll be honest, the only thing I’ve seen on it really is a Fireship video - is that it’s already doing some work on Upwork. So in a way, that’s a marketing claim, that it competes against real humans for jobs.

Yes. Truth. According to them. I haven’t confirmed it, but what you said is true, that they say that, yes.

So this is a struggle with all of the LLM world right now, and all the AI world… Because on the one hand, we have been in a place where we’re in the rapid part of an S curve. There have been some very rapid advancements in the core capabilities of these things. And they are super-freaking cool. Like, really cool. And also, they have a lot of limitations. A lot of those limitations are baked into the architecture that’s being used.

[10:07] And so you get kind of a situation where there’s a bunch of people doing really cool stuff with this, and trying to figure out what it’s good for… But it demos way better than it does anything reliably in production. Because you can get a really cool outcome 40% of the time, in some situations 70% of the time… And like you show that, and people are like “Oh my gosh, this is gonna take over the world!” And I would not trust a, for example, AI software engineer, even that could handle 70% of my tickets, but 30% of the time spins up millions of dollars of cost for me, right? Or like other things.

And once again, I’m not trying to take away from the technology, but I don’t think these hyperbolic claims actually serve anyone, except for getting attention. They get attention - okay, great. And you’re gonna get a whole bunch of people who buy this thing are disappointed. If it cost them a bunch of money, they’ll sue your ass off. I’m like “Why would you do that to yourself?”

It’s somewhat similar to generative AI in the image - let’s just stick with the static image world - where everywhere you see is impressive results… And it’d be like “This new Midjourney 7 is off the charts amazing. Here’s nine examples that will blow your mind.” Right? And if you click through on that, they’re all going to be very impressive; those are amazing things. But then you have to stop and think “Well, Midjourney didn’t create nine examples that blew my mind. Midjourney probably created 40, 50, maybe 500 examples, and then you, human, decided which ones were amazing, and you cherry-picked those out as the examples.” And that’s a great team work, guys… Right? Computers plus humans equals better results. And so there’s the cherry pick, and that’s what code review on these things will be, that’s what happens when you tell Copilot “No, I do not want that function.” Right? It’s all, as HipsterBrown calls it in the chat room, “Human in the loop”, and that’s exactly what is necessary. And I think the reason why you call them hyperbolic claims, Kball, is because they’re saying it’s a fully autonomous AI software engineer. Human out of the loop. Let it rip! And maybe fans of the bear will like to say let it rip, but those of us who aren’t fans of Devin are thinking “Let’s not let it rip too much, because it might just tear the whole thing down.” Now, I’m being hyperbolic. Nick, you’re nodding along… Do you agree with me?

Somewhat, yeah. Yeah, it’s humans who are deciding what is good out of that, and kind of helping to train that going forward. But I was trying to think and trying to relate this to another article I saw, that wasn’t about Devin specifically, but it was about like prompt engineering as a “profession” being taken over already by AI, because an AI can iterate and more quickly come up with a way to answer the questions that you want, by appending exactly what it wants to hear at the end of a string. And I think the example that I heard from that was like “We want you to answer this question”, and the AI is “incentivized” to answer it a little bit better if you put it into a scenario that it likes. So the AI is Captain Kirk on the Enterprise, and it has to answer this question to save the planet from whatever. And the question could be “What’s two plus two?”, or something really simple. And by putting in all of these extra prompt words that the AI is coming up with on its own, it’s making better results overall. And I’m just wondering how that marries to the idea of humans being the ones who curate the good ideas that come out of it.

[14:00] Well, prompt engineering, I’ve been convinced by Swyx that it’s a code smell. At first, I was convinced this is the new thing that everybody needs to learn. And I think it’s just a leaky abstraction that’s we’re currently dealing with as humans, because the tooling is not good enough, so that we have to engineer the prompts. I mean, Google’s search box is prompt engineering. Knowing how to google – it’s the exact same thing, it’s just way harder, and it’s like way more magical now to tell it the magical incantations to get the best results back out. And so the fact that it knows what results are better to me is not intelligence or anything, it’s just, we just need that to go away.

And I think Devin is actually an example of where they’ve productized and hidden a lot of the innards that we’ve currently been exposed to, in order to make the tool work better than it would for an inexperienced user to use it. Like, they’ve actually turned it into a product. And I think that’s great. I think it’s one step on a long line of iterative improvements that will make it so that prompt engineering – I mean, you’re just going to basically talk to it in layman’s terms, and it will know how to feed itself the correct prompt, so to speak, in order to get the goodness out. But I don’t know… Kball, back to you.

Yeah. I mean, I think – so high level on all this AI stuff is there’s really cool stuff there. We’re figuring out how to use it, and the current state is clearly intermediate. However, the thing I want to keep coming back to with this is there are things that it’s like “Okay, this technology is immature, and we’re going to evolve around it”, and figuring out how we handle prompts, and managing prompts, and what’s generating them, and whatever - that fits well in that bucket. And there are things that are fundamental pieces of the way the technology is designed. LLMs, machine learning models in general, are statistical, probabilistic. They’re very different than most things you think about in software, where you’re trying to make something that is logical, consistent… Like, you could put it A in, you get B out. And that is not there with these things.

And so you can design applications around that, and there are things that you can do to just sort of pin that down, to add validation that is outside of the LLM, and do other things, and maybe Devin is doing that… But I think the more we start looking at these sort of places that require judgment, places that require precision, places that - like, if you just make some random s**t up, it can cause a lot of problems. Like, those are not actually – like, there’s a fundamental thing about what the technology does, that means it’s not necessarily going to be a good building block for that. And so making hyperbolic promises about where it’s going to develop, that depend on it being a fundamentally different technology than what it is… I feel like they are setting yourself up for a lot of heartbreak.

What about the job market? Do you think it’s fundamentally affected by tools like Devin as they progress over the next three to five years? Because we’re not talking about humans out of the loop. I think we’re all in agreement here that that’s not feasible, or smart, at least, in today’s technology plateau of LLMs… But less humans in the loop. That seems like it’s very feasible if these tools continue to iterate and even just not have revolution, but evolutionary advancements from here.

Yeah, if it makes me three to five times faster, do we need three to five times fewer engineers…?

Yeah. So this is a technology that has the potential to dramatically impact the productivity of software engineers. And I think there’s a couple of different things around that, as we think – so short-term, that can create some disruption. Short-term, that means that a company that had been running on, say, five engineers, and might have needed to hire and expand to 15, now they don’t have to expand nearly as soon, and things like that.

[18:00] So I think there is the potential for relatively short-term disruption. I will say, both – the history of economics broadly and software in particular is that every time we make it easier to code, we discover there a whole world now that we can address and build software around, that we couldn’t before. So if, for example – and this, actually… There’s a particular example of this that I think is interesting to dive into. So one of the big economic challenges in the tech industry in the last four or five years is that we had these massive tech companies, with incredibly high revenue per employee: Google Meta, Netflix… The FAANGs, mostly. And so they were able to set the salary bar that was super-high; they were paying ridiculous amounts of money - that’s a technical term, ridiculous - for software engineers.

And then when we had very low interest rates, and a ton of VC money flowing into the industry, there were lots of companies whose fundamental business economics do not support that level of salary per software engineer, who were nevertheless paying that amount of salary per software engineer based on VC capital. And sort of this thesis that “Okay, we’ll be able to scale out of this, and we’ll get whatever.” And I think that caused a lot of distortions and problems in the field.

Now, if suddenly software engineers are three to five times more productive, the range of businesses that could use software, but previously could not afford to compete with the FAANGs etc. of the world - there’s a whole set of business models in there that become viable, because it’s that much cheaper to develop software. And so I can imagine this actually dramatically expanding the number of viable, either software businesses, or businesses that are non-tech, but would like to include software, or could have custom software, and dramatically expanding the number of those that haven’t.

So long-term I don’t think it’s a negative impact on the software engineering career path. I think that what it means to be a software engineer looks a little bit different when you have different types of tooling. That has been true as long as I’ve been around. JavaScript land - I remember when jQuery was a revelation. Oh, my gosh, this is gonna make me much more productive. It did make me so much more productive, all these other different things. And now the level of tooling that we have there that supports our productivity building things on the frontend is astronomical. And has that taken away from the number of people running JavaScript?

Speaking of astro-nomical, Astro has a new database…

Break

[20:40]