It was an amazing week in AI news. Among other things, there is a new NeRF and a new Llama in town!!! Zip-NeRF can create some amazing 3D scenes based on 2D images, and Llama 2 from Meta promises to change the LLM landscape. Chris and Daniel dive into these and they compare some of the recently released OpenAI functionality to Anthropic’s Claude 2.
Featuring
Sponsors
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Typesense – Lightning fast, globally distributed Search-as-a-Service that runs in memory. You literally can’t get any faster!
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Intro | 00:38 |
2 | 00:43 | It's a different world | 01:34 |
3 | 02:16 | ZIP NeRF | 05:02 |
4 | 07:20 | Uses in eCommerce | 02:57 |
5 | 10:16 | Industrial usecases | 01:36 |
6 | 11:53 | Military ops? | 01:00 |
7 | 12:54 | Everyone can benefit | 01:34 |
8 | 14:28 | Connect the dots | 00:50 |
9 | 15:33 | LLaMA 2 | 03:15 |
10 | 18:48 | Parameter limits | 03:00 |
11 | 21:48 | What do you need? | 02:11 |
12 | 23:59 | Why the jump? | 03:46 |
13 | 28:01 | Mark's anti-competetiveness | 03:50 |
14 | 31:51 | Walled garden? | 00:43 |
15 | 32:34 | Claude 2 | 04:25 |
16 | 36:59 | Using different models | 02:55 |
17 | 39:54 | OpenAI vs Anthropic | 05:57 |
18 | 45:52 | Dig in! | 01:10 |
19 | 47:02 | Goodbye | 00:15 |
20 | 47:17 | Outro | 00:53 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another Fully Connected episode of Practical AI. In these episodes Chris and I keep you fully connected with everything that’s happening in the AI community. We’re gonna take some time to discuss the latest AI news, and then we’ll share some learning resources to help you level up your machine learning game. This is Daniel Whitenack. I’m a founder and data scientist at Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
Doing cool. I’m trying to figure out how did we survive before all these great new models, and stuff? Like, it’s changed my –
Yeah, it’s been crazy. I’ve just created a post for LinkedIn, and I was grabbing text, putting it into ChatGPT, getting nice rephrasing, and then I’m like “Oh, I need an image.” And in particular - we’ll talk about it a little bit in this episode, but I was like “Oh, there’s this FreeWilly model from Stability AI, which is like whale-themed”, and then I’ve got the LLaMA thing… So I just went to Stable Diffusion XL on Clipdrop and said, “Hey, generate me an image with a whale and a LLaMA 2gether… And you know, how did I even post to LinkedIn before without these things? It’s like a different world.
Yeah. 2023 versus 2022 is totally different. The content generation, the way you code… It’s a different world.
Yeah. And this week, as most weeks are, it seems like, in 2023, had some pretty groundbreaking announcements and releases, which we’re going to dive into a bunch of those things. There’s just a huge amount to update on, and I think it’s a good time for one of these episodes between you and I to just parse through some of the new stuff that is hitting our feeds.
Well, I mentioned LLaMA… One of the big things this week was LLaMA 2, but I think before we jump into LLaMA 2, which I think was maybe the main thing dominating at least my world this week, it might be worth just taking a little bit of time to highlight something outside of this stream of large language models, which also crossed my desk this week, which I thought was really cool… It’s this latest version of NeRF. This is work from Google, presented at ICCV 2023, so it’s called ZIP NeRF, anti-alias Grid-based Neural Radiance Field.
That’s quite a name right there.
It is quite a name. It stands for neural radiance field. So NeRF, it’s like camel-cased, capital N, small e, and then capital RF, NeRF… These are fully-connected neural networks that create unique, novel views of complicated 3D scenes based on a set of images that are input. So I don’t know if you’ve seen that video yet…
I’m looking at it as we are talking… And when you say “the video”, I know which video you’re talking about, because it’s amazing. I’ve just left it on.
It’s pretty spectacular. This is a podcast, so it’s hard to express some of this for people… If you just search for ZIP NeRF, you can go to the page for this paper, which is a great summary. But there’s a video on the page, and just to describe what it is - imagine this kind of complicated house, with a bunch of different rooms, and an outdoor patio, sort of garden area… And the video is actually this kind of almost like a drone flythrough of the house and then the outdoor area. If you imagine a drone flying through a house - there’s hats, and coats, and toys, and couches, and plants, and all sorts of things everywhere… But the video is extremely seamless, and it’s not generated by a drone. It’s actually just generated by interpolating between a whole bunch of 2D images, and then interpolating from that the 3D scene. So yeah, I don’t know, what are your impressions, Chris?
First of all, from the perspective - the drone flight, if you will, that you have as a perspective viewing it, it’s like the best drone operator in the history of the world.
Yeah, it would probably be hard to get one to do that.
Yeah, you’re not gonna get a real drone operator that could fly that amazingly, and get those things. It’s just phenomenal. And the house is like – for a moment, you look at it, and I mean, it looks real. But I have noticed, it’s cluttery, but it’s immaculately clean at the same time as well. The clutter is cleanly distributed, and stuff. I wish when my house was cluttered, it looked as beautiful as this house. It doesn’t.
But yeah, I mean, just, if you didn’t know, if you weren’t listening to the Practical AI podcast to go look at it or something like that, and you just stumbled upon it, you’d think it was a drone video, if you didn’t have the education. You’d go, “Oh my God, this is just really cool. I wonder what they’re doing here.” But it’s indistinguishable from real life, for all practical purposes.
[00:06:09.20] Yeah. So it’s based on 2D images, and then there are these generated interpolations, which maybe gets to – there was something that we were talking about prior to hitting the Record button, which was this whole field of generative AI is sometimes conflated with large language models, or ChatGPT… But there’s a whole lot going on in generative AI that’s not language-related, or maybe even based on language-related prompts. So I mentioned that image that I generated for my LinkedIn post… That was still in a text prompt into a model that generated an image. But here, what we’re seeing is we’ve got static 2D images that are input to a model that’s actually generating a whole bunch of different perspectives that are synthesized in a 3D scene. So this is, I would say, still fitting into our current landscape and world of generative AI, but it’s not a text in/text out, or text in/image out model.
Right. And I think people – there’s so much coming at people right now. We keep talking about that this year - in the five years we’ve been doing this podcast, we’ve never had a moment the last few months, where new things have been coming up at people so fast. New terms, new models, and people are trying to distinguish… So it’s pretty, I think it’s pretty fair that people are trying to make sense of how they relate together. And there’s a lot of connecting between the idea of generative and the idea of large language models overlap in a lot of areas. You have models that are both, and you have models that are just one. But I think it’s a brave new world right now in terms of the amount; every show, we’re just trying to figure out what matters right now. Because there’s a lot we’re not hitting.
Yeah. And this side of things, maybe like the 3D or video or image-based side of things I know has its own set of kind of transformative use cases that are popping out. I even remember a little while ago there was some technology, I think from Shopify, but others have done this as well, where maybe you have a room in your house, and you want to see how you can transform it with new furniture or something, that of course you could buy… This is a real kind of e-commerce or retail sort of use case for some of this scene technology of a different kind. If you think of this sort of technology that can take 2D things and create these 3D scenes, certainly there’s use cases within game development, for example, but even other cases where maybe AI has never impacted the process as much like in real estate, for example… You know, how expensive is it to literally have a person come out with specialized camera gear… I know that we’ve had this in the past, where it takes a special person to come out, with special camera gear, to capture the kind of 3D walkthrough, essentially the Street View walkthrough of your house, and map that onto an actual schematic of your house… And here, if you imagine someone – maybe I’m now selling my house myself, without a real estate agent, and I can take an app potentially and go through my house just taking 2D images and create this really cool flyaround 3D view that’s interactive. That’s really, I think, a powerful, transformative change for a number of different industries.
I came across a company called Luma AI in one of the posts about this technology… I don’t know exactly how much of the – if they’re even using the zip NeRF stuff, but certainly some things related to NeRF to take these 2D images, and they have an app that will create 3D views… It’s pretty cool to see some of this kind of hit actual real users.
[00:10:16.08] We keep talking about the fact that we’ve hit this inflection point where it’s hitting all the – you don’t have to be in the AI world for this to have a big impact. So it’s very easy looking at the ZIP NeRF video to imagine walking around with your cell phone on an app… You’re just kind of like walking around and the app takes care of whether it’s video, or whether it’s still images or what, and it just uploads it to this, and produces this amazing… So it’s not your walkaround that it’s doing. It takes that as raw video, but then it produces this super-high quality thing. So yeah, I mean, I think this is another case where there’s this one technology with thousands of use case possibilities, where it just changes everything.
Yeah. And maybe also in the – I’d be curious to know your reaction to this also, with respect to kind of the industrial use cases, where –
Oh, I’ve been thinking about it…
Of course, capturing 3D scenes is very important, for example for simulated environments, where you’re trying to maybe train an agent, or even kind of an industrial training for human sort of scenario, where you want to kind of take someone into an environment that it’s physically hard to bring a lot of people into…
Yeah. Or there could be safety issues, and such.
Yeah, safety issues… I don’t know if that sparks things in your mind. I think in the industrial sense, this could have a more B2B sort of impact than just a consumer app.
Sure. I mean, a simple thing - and I’m making something up in the next thing I’ll say. It’s very easy for me to imagine intelligence agencies that are – if you go back some years to when Osama bin Laden was found, and they had various imagery and stuff, but with stuff like this they might take all those images that they’re getting from various sources and produce a high –
Like a flyover –
Yeah, a flyover, and very photorealistic, of certain parts of the compound with that kind of imagery… And that can be used in a military operation subsequently. Now, I’m making that up, so nobody should take that as a thing. But it’s not hard to imagine that. It’s not hard to imagine a lot of factory uses and other industrial things where you have safety issues, you have limited access kind of concerns, where you’re trying to convey that… But there’s a lot of mundane things, there’s a lot of home-based things and small business things; as you pointed out, the real estate one earlier. So this is just one technology that we’re talking about so far.
Yeah. And I think what you’re saying - it illustrates how this is impacting very large organizations, all the way down to small organizations.
Yeah, sole proprietorships.
Yeah. And it’s interesting how - like, if we just take this use case, for example, these kind of 3D scenes, and kind of large-scale organizations that maybe their bread and butter was either the compute associated with like rendering videos and 3D scenes, or they’re hardware providers that are creating specialized kind of 3D type of equipment… Like, their whole business model, they’ve got to be thinking, similar to other organizations that are dealing with maybe language-related problems that are thinking about these things with respect to LLMs - there’s a fundamental shift in maybe how their businesses will operate. But then, at the same time, it provides an opportunity for the kinds of small to medium businesses, to embrace this technology very quickly and actually make innovative products that can be widely adopted very quickly, and actually be competitors within an established market. So there’s an established market for 3D things; that has been quite expensive over time, in terms of access to that technology… So now that whole market is going to change, and I think a lot of these players will be these kinds of small to medium-sized businesses.
I agree. I think there’s a moment here, kind of ironically, because people are so worried about the impact on human creativity because of all these models and stuff like that… But on a more positive note, there’s this huge opportunity that you’re just now alluding to, for people, that if you can connect the dots as things are coming out, and you can stay on top of it, it’s a great equalizer. And so it will clearly change many, many markets that are out there, and many, many industries. And so there’s huge opportunities for those who want to surge ahead at this moment and take advantage of that. And so I think that the message we tend to see in the media tends to be a little bit do me and gloomy on that, but it kind of discounts the fact that change isn’t always a bad thing. People are afraid of it, but there’s huge opportunities here as well if people choose to go find them.
Break: [00:15:22.13]
Well, Chris, there is a new LLaMA in town.
LLaMA 2…!
LLaMA 2. Basically, it destroyed all of my feeds and concentration this week when it was released, because it is quite - to me an encouraging thing, but also another transformative step in what we’re doing. So LLaMA 2, for those that maybe lack the context here… Meta, or Facebook, or however you want to refer to it - Meta had released a large language model called LLaMA, which was extremely useful. It was a model where you could host it yourself, as opposed to like OpenAI; you could get the weights and host it yourself. But the original LLaMA had a very restrictive licensing and access sort of pattern. Even though you could kind of download the weights from maybe like a BitTorrent link or something like that, and those propagated, technically if you got those weights you were still restricted by a license that prevented commercial use cases specifically.
And now with LLaMA 2, Meta has released the kind of follow-on to LLaMA, and we can talk through some of what the differences are, and what it is, and some of what went into it. But I think one of the biggest things, which is I think going to create this huge ripple effect throughout the industry is that they’ve released it with a commercial license. As long as on the day that LLaMA 2 was released you as a commercial entity don’t have greater than 700 million monthly active users, you can use it for commercial purposes. So maybe if my company maybe later on has 700 million monthly active users - which would be great; probably never, but…
There’ll be something past LLaMA 2 by then though.
Yes. It does though, I could still actually use it, because it’s only on the release date. So on the release date, which was this week, as long as you didn’t have greater than 700 million monthly active users, you can use this in your business for commercial use cases, and I think that’s going to have a huge ripple effect downstream. And we can talk about the model itself here in a second, but maybe just - I’ll pause there to get your reaction on that, Chris.
It made me smile when I heard that, because it’s kind of like saying, “So long as you don’t compete with us at Meta, you can use this for commercial.”
Oh, it’s totally true. Yeah. Like, who is that? So that’s Snapchat?
Yes.
TikTok… You can think of who this is. And I guess one way to put this is it’s not totally open source, quote-unquote. We wouldn’t call this maybe open source in the kind of official definition of open source. But it’s certainly commercially available to a very wide set of people.
Yup. You know, one of the first things I noticed when this came out on their page - and I’m diving into the specifics of the model here - is we had an episode not too long ago, and you were describing about kind of the… I believe it was the 7 billion limit in terms of hardware usage, and stuff. And having been taught that by you, I immediately locked in on the smallest being 7 billion there, and I thought, “Ah, this is what Daniel has taught all of us about that limitation on accessibility and who can do it.” So it has the 13 billion, and the 70 billion size, but I definitely picked up on the 7 billion, which I’m assuming is going back to what you were teaching us a few episodes back.
Yeah. And so just to fill in a little bit on that… So the LLaMA 2 release includes three sizes. So again, thinking back to what are the kind of characteristics of large language models that kind of matter as you’re considering using them. One is license. We’ve already talked about that a little bit here. We might revisit it here in a second. Another is size, because that influences both the hardware that you need to run it, and then also its kind of ease of deployment.
[00:20:03.20] So LLaMA 2 was released in 7 billion parameter, 13 billion parameter and 70 billion parameter sizes. And then there’s also, of course, the training data and that sort of thing that’s related to this, and how it’s fine-tuned or instruction-tuned. So LLaMA 2 was released in these three sizes, both as a base large language model, and a chat fine-tuned model. So there’s the 7 billion, 13, and 70 billion LLaMA 2s, and then there’s the 7, 13 and 70 billion LLaMA 2 chat models… Which we can talk about that fine-tuning here in a second.
But yes, you’re right, Chris, in that 7 billion - I could reasonably pull that into a Colab notebook. And maybe with a few tricks, but certainly with the great tooling from Hugging Face, including ways to load it in even 4-bit, or other quantizations, I can run that on a T4, for example, in Google Colab, with some of the great tooling that’s out there. So not needing to have a huge cluster.
The 70 billion - even with that, that’s kind of another limit where using some of these tricks, I’ve definitely seen people running the 70-billion parameter model on an A100; again, loading in 4-bit, with some of the quantization stuff and all that. But 70 billion is certainly going to be more difficult to run; it might require multiple GPUs. But that’s kind of that sizing range for people to have in mind in how accessible things are.
I’m just curious, if you’re looking at these, you’re a business out there, or a data scientist… Can you make up a couple of use cases that you might target with each of these, where you might say, “Oh, I want to go 13 on this. Not 7, not 70 for something like this.” Can you imagine something like this? I’m putting you on the spot.
Yeah, I think – I mean, there’s certainly innumerable use cases… But I think maybe two distinctions that people could have in their mind is if you want like your own private ChatGPT… Or another way to think about it is a very general-purpose model. You could do anything with this model. Any specific prompt, whatever. You’re probably going to look towards that higher end, the 70-billion parameter model for that kind of almost ChatGPT-like performance; you’re going to have to go much higher.
But as we’ve talked about on the show before, most businesses don’t need a general-purpose model. They need a model to do a thing, or a task, or a set of tasks. And so in that case, I think businesses, because this is open and commercially-licensed, businesses that could take those 7 and 13-billion parameter models and fine-tune them for a task in their business, which also increasingly has amazing tooling around it, again, from Hugging Face and others, with the PEF library, parameter-efficient fine-tuning, and the LoRA technique, which is the low-rank adapter technique, which basically only adapts an existing model, it’s kind of an adapter technique, rather than retraining a bunch of the original model… This opens up fine-tuning possibilities in these smaller models where that fine-tune for an organization is going to perform probably better than any general-purpose model out there. And because it’s that smaller size, you can run it on a reasonable set of hardware, that’s not going to require you to buy your own GPU cluster to host the thing. So that’s kind of a maybe a range of use cases that people could have in mind.
[00:24:00.00] I have one more question for you before we abandon this. 7 billion to 70 billion being an order of magnitude jump on that - why would you have something fairly close to that, at 13 billion parameters? What’s the difference in 7 and 13, when the next step is all the way up to 70? What’s the rationale, do you think?
Yeah, so it is interesting, actually… If I’m understanding right from some of the sources that I’ve been reading, there was actually a – I forget if it was 30 or 34-billion parameter model that they were also had in pre-release, and were tuning… So there was another one that kind of fit in that slot, that is kind of missing that gap, like you’re talking about… If you think of MPT, MPT has a 30-billion parameter model; that fits in that kind of gap.
My understanding - and if our listeners can correct me if I’m wrong; please do. But my understanding is that they actually did test that size of model and found it to not pass their kind of safety parameters around harmful, potentially harmful output, or not truthful output, that sort of thing. So they decided actually to hold that back.
So it could be possible as they instruction-tune and get human feedback potentially more iterations of reinforcement learning from human feedback, there may be a model that they release in that parameter range. So that was one thing that happened, I think.
It is interesting - several different things here that are unique about this model specifically, or maybe the release as well, other than the license, is they were fairly vague on the data that went into the pre-training. So they talked specifically about some very intense data cleaning and filtering that they did on public datasets. And it was trained on more data than the original LLaMA, but they were fairly vague on the mix of that data, and all of that. So that may be related to feedback they got on the datasets that were used in the first LLaMA, I don’t know, but the technical paper was mostly related to the modeling and fine-tuning trickery and methodologies that they used, which was interesting.
And one of those interesting elements of the way that they fine-tune this model was I think the reward modeling. So if you remember, the GPT family of models, the MPT, Falcon, these different models - one of the things that is often done with these models is this process of reinforcement learning through human feedback, which is this process… And we covered this on a previous episode, which we can link in the show notes… But actually using human preferences to score the output of a model, and then actually use reinforcement learning to correct the model to better align with human preferences, or human feedback.
They actually used two separate reward models in this fine-tuning of the chat-based model. One that was related to helpfulness, and then the other one which was related to safety. And one of the interesting things that they’ve talked about in the paper was how sometimes those things can kind of work against each other, if you’re trying to do both of them at the same time. So they actually separated out the reward models that they used for the chat fine-tuning into these two-reward models, one for helpfulness and one for safety, which is quite interesting, I think.
Break: [00:27:47.20]
So Chris, maybe just a couple other things related to LLaMA, and then I want to see your feedback on the code interpreter as well, because we haven’t talked about that yet on the show. And maybe Claude 2, if we can get to it.
Yeah, we’ve got to mention Claude 2 as well, because they were both big releases.
Yeah. So just one maybe other note, which I find quite interesting, and actually, I love our previous guest Damien’s thoughts on this, who was in our last episode about the legal implications of generative AI… But one of the interesting things about the LLaMA license, in addition to it allowing this commercial usage, is that there is technically a restriction in the LLaMA license, that says “You will not use LLaMA materials”, which includes the model weights and etc. “or any output or results of the ultimate materials to improve any other large language model, excluding LLaMA 2 or derivative works thereof.” So essentially, what this means is if you’re using LLaMA 2 and you want to fine-tune a model, or you’re fine-tuning a model off of LLaMA 2 outputs, you’re stuck with LLaMA 2. Basically, LLaMA 2 is your model, and that you’re going to stick with LLaMA 2. So you couldn’t, for example, technically take outputs from LLaMA 2 and fine-tune, say, DALL-E 3 billion. That would not be allowed by the license, and of course, that’s something that people are doing all over the place. They’re taking outputs from GPT-4 and fine-tuning a different model, or taking outputs from a large model, like maybe LLaMA 2 70 billion now, and fine-tuning another model that’s smaller, based on a certain type of prompt or something. So this is restricting that family of models that you’re allowed to do that sort of thing with, which is the first time I’ve seen that, and I think it’s kind of interesting.
Yes, it strikes me as another Mark Zuckerberg anti-competitiveness thing… Which he’s fairly famous for. I mean, that’s kind of – even before this.
Yeah. And how could you enforce such a thing? [laughs]
That was my next question to you - is there any possible way that you could conceive of to actually know that from an enforceability standpoint?
I have no idea.
I don’t either. So it seems like it’s a license thing, and it will concern the lawyers… But it’s hard to imagine. I mean, going back to our conversation last week, once you have output, and that output is input to more output, there’s a point where it becomes very, very, very difficult to know what the sourcing really was.
Yeah. And the fine-tunes are already appearing off of LLaMA 2. The most notable probably is FreeWilly, which is from Stability AI, and is a fine-tune of the largest, 70-billion model. But there’s other ones coming out as well. And so I think we’re about to see just a huge explosion of these LLaMA 2-based models for a whole variety of purposes. And who knows how they will fit into that licensing restriction, or how open people will be about that… But it’s about to start. The fine-tunes are already coming.
Yeah. Well, to your point earlier, they weren’t terribly clear about the data that they were sourcing from their own standpoint… And I find it interesting, a little ironic.
It’s a bit of a double standard maybe…
Yeah, a little bit of a double standard right there, in terms of like “We’re not going to tell you everything about how we’re doing input, but by the way, you’d better not use our output.”
Yeah.
So yeah, a little interesting. Do you think there’s any risk of a walled garden kind of concept happening in large language models, if others were to follow this lead on anti-competitiveness?
[00:32:03.04] Yeah, it will be interesting… I think it is a notable trend that the first LLaMA from Meta was not open for commercial at all, and now they’re opening it up for commercial purposes. And maybe there’s a separate trend that will happen with some of these use-based restrictions that people are importing into their licenses, and how useful those things are over time; that may shift, and we’ll see those things die off. Or maybe if they’re enforced, and there’s precedent, maybe we’ll see something go the other way. I’m not sure.
But speaking of models that you might get their output and use it to train other models, that is these large-scale proprietary closed models from people like OpenAI, and Anthropic, and others - we’ve got a couple of things that we haven’t talked about on the show yet, which people should probably have on their radar. One of those is Claude 2. What do you think about Claude 2, from Anthropic?
Yeah, I’ve been playing around with it a lot in the last week, and I kind of have a set of things that I try over and over again; they’re kind of my standard tasks as new models come out. And some of them are coding, and some of them are content generation, which are kind of the two big things that I use most often. It was interesting, the input size for Claude 2 is much larger than the others.
Like, much, much larger.
Much, much, much larger.
So 100,000 tokens.
Yeah. And so it’s had me kind of change the way I’m approaching it, in that, by contrast with ChatGPT, and you’re trying to figure out with the limits that you have both on input and output how do you kind of prompt-engineer your way to get where you’re trying to go… Which has become this whole skill set we’ve been talking about in recent months. And yet Claude 2 almost kind of wipes that out a little bit - in some ways, not in all ways - in that you can hit it with a much larger input space… And so it’s changing how I’m thinking about kind of getting to the output that I want. And the output is a bit different. It’s not the same. I’m getting different outputs from all the models. They’re not all the same, definitely.
I think my biggest thing is with all these new releases - I’m trying to figure out how do I use each one. I’m trying to develop my own strategy on “When do I go to ChatGPT by default? When is that the right thing?” And that’s changing as we’ll talk about with things like plugins and stuff; that’s evolving. But then Claude 2 comes out, and then you have on the open source side, as we just talked about, LLaMA 2.
So I think trying to understand all the tools in the toolbox in relation to each other has been interesting. So Claude 2 I’m really focused right now primarily on large content output, is kind of where I’ve landed on that.
And the 100k context length of Claude 2 is something I find really compelling as well. There was also a significant paper that came out, that caused a lot of waves in terms of context length and thinking about that, which showed kind of, as you increase context length, you lose any significance of the middle bit of that context. So the beginning and end is more important in terms of what makes the output of the model quality or not in terms of how you would measure that. So we’ll link to that paper maybe in the show notes as well.
But I’ve tried some things… I mean, I don’t know exactly all of the details… Again, Claude is one of these closed models, so I don’t know all the details of how they’re doing things. And because it’s sitting behind an API, it’s hard to know how those things evolve over time. But for example, I took – one of the things with Claude 2 is I just took one of our complete podcast transcripts, so a full episode, so 45 minutes of audio transcript… I took episode 225, which I really enjoyed, talking a lot about the things that I’m working on right now with Prediction Guard… And I just asked it to give me a summary of the main takeaways. I pasted in the whole thing, and it’s like a fairly good, comprehensive takeaways, like “Many companies banned usage of certain LLMs”, blah, blah, blah. Prediction Guard is trying to provide easy access, structuring, validation, compliance features for LLMs. Making LLM usage easier, blah, blah, and it gives these great takeaways…
[00:36:28.11] And then I asked, “Hey, suggest a few future episodes that we could do, that maybe cover related topics, but things that weren’t covered in this episode.” Pretty good. Some of them are kind of generic… A look at current state of AI agents, and automation, how close are we to no code AI app generation, blah, blah, blah. So that all kind of off of this large context of the transcript input was quite interesting.
I’m curious - I’m gonna put you on the spot also. As someone who’s working on your own product - and I know this is not a Prediction Guard episode, but I’m asking on my own behalf and on behalf of the listener… How do you as someone who is looking at these different models, how do you think of those different models? How do you kind of structure them in your mind in terms of what you’re offering? You’ve been evolving rapidly over the last few months, and I’m always curious to see kind of where your head’s at on this now, as you’re looking at them?
Yeah, I think the things consistently that I’m seeing are that – I made a post on LinkedIn about this as well; even my own applications that I’m building, LLM-based applications, having access to multiple models, rather than a single model, I think is a really nice usage pattern. The easier we can make it – and there’s other people that are doing this as well. In Prediction Guard you can query a whole bunch of models at the same time concurrently… There’s other systems that will let you look at that output as well. Not.dev, and some of the toolbar stuff that Swyx is doing… We had a collaboration with him in the Latent Space podcast…
So the more you can tie these things together and look at the output or automatically analyze the output of multiple models at the same time, I think that’s really useful. Because it’s hard to generally evaluate these models until you start evaluating them for your use case, and building intuition about them for your own use case. So I think the pitfall that people maybe fall into is saying, “Oh, I’m going to use this model”, before they’ve even tested that for their use case.
Try creating a set of evaluation examples for your own use case, and then try out a bunch of different models for that. And also try out the things that are becoming more standard kind of operating procedures for building LLM applications, like looking at the consistency of outputs, running a post-generation validity or factuality check on the output. So checking a language model with a language model. Doing input filtering, and all these sorts of more engineering-related things. So those are some of the things that I’m seeing… But having access to a bunch of models at the same time I think is something that can really boost your productivity.
I appreciate that. And to our listeners, we’re not making it a Prediction Guard show or episode, but as a co-host, Daniel’s excursion through this in his professional career has made him, in my view, one of the world’s true experts in how to look at all these together. And since we have the benefit of him co-hosting the podcast, I’m going to continue to take advantage of that expertise for all of us.
Thanks, Chris.
Sorry about that, Daniel. Sorry for putting you on the spot.
Yeah, no worries. I think the other thing maybe to highlight with Claude 2, and something that you were talking about in chat before we jumped into this episode was Claude 2, or maybe Anthropic and their offerings, versus Open AI. How do we understand that? How do we categorize these things? I think one of the interesting things with Claude 2 – so we’ve seen both Anthropic and their Claude models, and OpenAI and their GPT models increase context size over time. GPT models not quite as far as Claude, but both have increased.
[00:40:28.09] They’ve also both added in some of this functionality, which I think is very interesting… Claude 2, I think, first, if I’m not wrong - the ability to add in your own data. So in Claude 2 there’s a little attachment button, and you can upload PDFs or text files or CSVs and have that inserted into the context of your prompt… Which I think is, of course, extremely powerful. We’ve talked about adding in external data into generative models and grounding models in the past; it’s very powerful.
Now, OpenAI is doing this in a slightly different way, and I think this is something worth calling out on the podcast, is with their new code interpreter beta feature within ChatGPT you can upload data, but it’s processed through the code interpreter in a different way than what Claude is doing. So we all know that ChatGPT and GPT models can generate really good code, and specifically good Python code… And so what OpenAI has done for their kind of data processing agent within ChatGPT is “Well, let’s just have our model generate Python code, and then we’ll hook up the ChatGPT interface to a Python interpreter, and just go ahead and execute that code for you over your data, and then give you the output.” So this is maybe a distinction that people can have in their mind - Claude 2, you can upload this huge amount of context, you can upload files, insert it into the prompt. As far as I know, they’re not running any kind of code interpreter type thing under the hood.
ChatGPT might not be inserting all of that into the prompt, but they’re actually saying, “Well, what if we decompose what you’re wanting me to do with this external data into something that can be executed by a sort of agent type of workflow, where you upload your data and ask me to like do some analysis over it? I’m going to generate some code”, so the language model generates some code, and then that code is actually executed in the background, it returns a result, which is then fed back through a model to give you generated output back in the interface. So it’s actually a multi-stage thing happening in a code interpreter in Open AI.
It effectively produces a no code solution, where you get an output, and you’re just kind of skipping the whole thing… Instead of using the language model to generate your own code, and to be your code assist, and all that, and then you’re still doing it… It’s kind of skipping that whole step right there.
Yeah. And I can give an example I actually ran prior to this show. So I have Claude and the OpenAI code interpreter side by side open; I uploaded a file with a bunch of Yorùbá, which is language in Africa, transcriptions out of audio, which are from the Bible TTS project that we worked with Coqui and Masakhane on… And so I uploaded this file, which includes this Yorùbá text, in a CSV format. OpenAI said “Great, you’ve uploaded this file. Let’s start by loading and examining the context.” And then it has this sort of Show Work button, and you can see the actual code that it generated, which is Pandas code to import the CSV, and then output some examples. So you can expand that and actually see the code that it ran under the hood, and the conclusions that the agent came to.
[00:44:05.06] Then I asked it, “Okay, well, plot the distribution of the transcript links. Are there any anomalies?” And then again, it says, “Hey, Show Work.” And you can see it’s importing matplotlib, it’s taking in the CSV, it’s actually creating the plot, and it actually generates an image out of the transcripts, and says “I didn’t find any anomalies. They’re all kind of within the same distribution. There’s not any anomalies.” Then I asked it “Can you translate all the Yorùbá to English?” and that’s where it ended up stopping, because it said “No, I’m not good at doing that.” And Claude actually stopped there as well and said, “No, I’m not going to do that.”
I also uploaded the Yorùbá alignments to Claude, and it said, “Hey, sure, let me analyze these transcripts”, and it just output some general, like “There are 50 audio links. The transcript links–” There’s no Python code there. It just gave me some takeaways. And then I said, “Are there any anomalies?” And it said, “I checked and I can’t find any.” And “Could you translate it?” and it said, “Unfortunately, I can’t.” So it’s all still a chat-based thing.
So you can see kind of different approaches to this complicated workflow of having almost an assistant agent executing code for you, versus putting more context in the language model and having it reason over that context.
So they’re almost getting their own strengths at different types of approaches to problems. Would that be fair?
Yeah.
So that’s another way of thinking about it, is you start understanding how the different large language models approach a problem, and the tooling that might be better or worse for a given use case; that also will help you kind of pick which way you want to go, in addition to maybe just using multiple models, as you’ve talked about earlier.
Yeah, exactly. And there’s so much to dive into on all these topics that we’ve covered today… I am going to make sure that we include some really good learning resources for people in the show notes, so make sure and click on some of those. There’s a guide from DataGen on the Neural Radiance Field stuff, the NeRF stuff that you can learn a bit more about that… There’s a Hugging Face post, and Phil Schmidt post on LLaMA 2, that are both really practical; kind of like how do you run it, how do you fine-tune it? What does it mean?
And then there’s a nice post from the One Useful Thing, Ethan Mollik blog or newsletter about Code Interpreter, and how to get it set up, and some things to try. So we’ll link that in our show notes, and I think people should dig in. Get hands-on with this stuff. Things are updating quickly, and the only way to really get that intuition about things is to dive in and get hands-on.
It is. It’s the most interesting moment we’ve had in the AI revolution of recent years. Just so much cool stuff right now. Anyway, thank you for taking us through all the understanding and explanation of these things.
Yeah, definitely. It was a good time. Hopefully, people enjoy the rest of their week, and maybe go see Oppenheimer, or Barbie, depending on which of those is most interesting to you… But we’ll see you next time, Chris.
See you later. Thanks.
Our transcripts are open source on GitHub. Improvements are welcome. 💚