Large Language Models (LLMs) continue to amaze us with their capabilities. However, the utilization of LLMs in production AI applications requires the integration of private data. Join us as we have a captivating conversation with Jerry Liu from LlamaIndex, where he provides valuable insights into the process of data ingestion, indexing, and query specifically tailored for LLM applications. Delving into the topic, we uncover different query patterns and venture beyond the realm of vector databases.
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Typesense – Lightning fast, globally distributed Search-as-a-Service that runs in memory. You iterlly can’t get any faster!
|Chapter Number||Chapter Start Time||Chapter Title|
|1||00:00||Welcome to Practical AI|
|4||08:12||What do I get?|
|5||11:23||More power less work|
|6||13:26||Fitting the pieces together|
|7||16:49||3 Levels of integrating|
|8||19:13||How to think about indexing|
|10||23:09||Index vs Vector storage|
|12||30:41||Query scheme workflow|
|14||40:03||Awesome new stuff|
|15||43:23||Links and show notes|
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist and founder of a company called Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
Doing very well, enjoying this fine springtime weather of LLMs…
Yes, the spring LLM bloom, I guess…
Well, I don’t even think we can use the word bloom, because that’s loaded now…
Yeah, I was gonna say, that has a whole different meaning.
There’s no word that’s not loaded with some sort of AI meaning at this point…
Yeah… We should just go straight to our guest. Yeah, including llamas, which we’re excited today to have with us Jerry Liu, who is co-founder and creator of Llama Index. Welcome, Jerry.
Yeah, thanks Daniel and Chris for having me. Super-excited to be here.
Yeah, I’m really excited, because we’ve had a few conversations in the past, and I’ve used Llama Index in some of my own work, and also kind of tried some integration stuff with various data sources… So I’m really excited to hear a little bit more of the story and kind of the vision behind the project. If I’m just reading from the docs, Llama Index is about connecting LLMs or large language models with external data. So maybe a first question, kind of a general question, not specific to Llama Index necessarily, is “Why would one want to connect large language models with external data?”
Yeah, it’s a good question. And so for those of you who are already in the space of LLM application development, this might sound obvious to you, but for those of you who might be still somewhat unfamiliar, large language models have a lot of different sorts of capabilities that are really good at answering questions, doing tasks, being able to summarize stuff… Basically, anything you throw at it, trying to write a short story, write a poem, it can do. And the default mode of interacting with a language model ChatGPT is that you would write stuff to it in a chat interface, this query would hit the model, and you’d get back some output.
I think one of the next questions that people will get into, especially as they’re trying to explore building applications on top of large language models, is “How can this language model understand my own private data?” …whether you’re kind of a single person, or you’re an entire organization. And these days, there’s a lot of different ways for actually trying to incorporate new knowledge into a language model. The models themselves are trained on just a giant corpus of data, and so if you’re an ML researcher, your default mode is just “How can I train this model on more data, so that it can try to memorize this knowledge?” And the algorithm there is basically through some sort of gradient descents or the waits or RLHF or any sort of fancy ML algorithm that actually includes the knowledge in the weights of the model itself.
I think one interesting thing about large language models these days is that instead of training the model, you can actually take the model as is, and just figure out how to have it reason over new information. So for instance, use that input prompt as the cache space to feed in new information, tell it to reason over that data, and to answer questions over that data. And I think that’s very interesting, because you can take the model itself, which has been trained on a variety of data, but doesn’t necessarily have inherent knowledge about you as a person, or your organization data… But then you can tell it, “Hey, here are some new data that I have. Now, given this data, how can I answer the following questions?” And this is part of the stack that a lot of people are discovering these days, where you can actually just use the language model itself as a pre-trained service, and then wrap that in this overall software system to incorporate your data with the language model.
Cool. Yeah. And your project is called Llama Index. Now, before the past few months, or six months, or whatever, when I was thinking about indices or an index, one of the things that first came to my mind was Oh, I have a database maybe, and there’s an index that I use to query over that database… And some of that is a little bit fuzzy magic to me, in terms of how that actually works at the lower level in a database, but what is this idea of an index, or indexing in context of Llama Index, or in the context of data augmentation for large language models?
It’s kind of funny… I think when we first started the name, it was a bit more of a casual naming convention. It used to be called GPT Index, and I kind of made up that name because it sounded roughly relevant to what I was building at the time… But I think over time, especially as it’s morphed into more of a project that people are actually using, this concept of an index has become a bit more concrete, and so I can articulate that a bit better.
The idea of Llama Index is - you know, just to step back and talk about the overall purpose of the project… It’s to make it really easy and powerful and fast and cheap to connect your language models with your own private data. And we have a few constructs to do so within Llama Index. So part of the way you can think about Llama Index is how can we build some sort of stateful service around your private data, around something that at the moment is somewhat stateless? Like, the language model call is a stateless service, because you feed in some input and you get back some output. So how can we wrap that in a stateful service around your own data sources, so that if you want to ask a question, or tell the LLM to do something, it can reference that state that you have stored.
[06:12] And so if you think about any sort of data system, there is the raw data that’s stored somewhere in some storage system, there might be indexes or views, similar to a database analogy, where you can kind of look at the data in different ways; and I can talk a little bit about how that works. And then there’s usually some sort of query interface that you can actually query and retrieve the data.
So if you look at a SQL database, you have the raw data stored in some sort of tables, you define different indexes over different columns, and then the query interface is a SQL interface. You run SQL, and then it’ll be able to execute the query against your database. And there’s a lot of kind of roughly similar concepts that apply to thinking about the Llama Index itself as this toolset… Because if we’re going to build this stateful service on top, that can integrate with large language models - by the way, to clarify, we’re not really solving the storage part, where we integrate with a ton of different vector storage providers, we integrate with other databases, too… But if you even think about us as some sort of data interface or orchestration, there’s the raw data, which needs to be stored somewhere. And so if you have a bunch of text documents, you need to store that in a vector database, or a MongoDB, or S3, all those types of things. And then you can define these different indexes on top of this data.
And the way we think about indexes is “How do we structure your data in the right way, so that you can retrieve it later for use with LLMs?” And so then I can talk a little bit how this works, but the set of indexes that you can define is actually pretty interesting. Basically, the set of data structures that offers a view of your data in different ways. And you wrap that in this overall query interface that can use these indexes on top of your data, to do retrieval and LLM synthesis and give you back a final answer.
And so I would look at this in terms of the components of the overall system, there’s just - if you’re building this stateful service, there’s these three components: how do you ingest and store the raw data, index it and then query it?
So I want to actually pull you back for just a moment as we’re kind of learning this… If you’re an app developer and you’re interested in creating a stateful service, and you’ve started kind of going down the path about like, well, there’s kind of the old school way of going and doing a SQL query, and all that, and now we’re using LLM models and adding our data to it… I know that we’ve kind of gone beyond that just a little bit, but if you can back up and talk a little bit about what are you getting? …if you’re the app developer and you’re listening to this and you’re trying to understand, “Why would I go down that path? I sense that there’s value there”, but we haven’t talked about it. Versus a robust set of SQL queries on your own data. Why would you bring in that large language model in the beginning? What is it bringing to bear that’s worth all of that effort? Could you talk a little bit about that baseline value-add to it?
Yeah, that’s a really good question. And I think I might have jumped the gun a little bit, so I appreciate you bringing me back.
No worries. It’s because you’re excited, as are we. But I also want to make sure that people listening have a chance to truly understand it the same way that you are.
Definitely. I think one thing about language models that’s very powerful is their ability to just comprehend unstructured text and also natural language. And so this matters in both ways, in terms of how you can store the data, as well as query the data. Because now let’s say you’re the end user, you can just type in a natural language English question, ideally into this interface, and get back a response. And so the setup is way easier than having to learn SQL over some source of data, or having to even code up this very complex pipeline to try to parse the data in different ways. Because you could treat the language model itself as a black box - feed it something, get something out. And so I think that by itself is a very, very powerful tool, and I think these days people are trying to figure out what you can do with that tool.
[10:05] So another kind of illustrative example of the power of language models, using this as intelligent natural language interface is you actually don’t have to do a ton of data parsing when you actually feed in the data. So for instance, let’s say you have a PDF document, or any sort of Microsoft Word document, or even an HTML webpage - just copy and paste that entire thing. Just extract the text from it, dump it into the input prompt, and then just tell the LM, “Hey, here’s just this giant blob of text I copied over. Now, given this text, can you please answer this following question?” And the crazy thing is the language model can actually do that, assuming it fits within the prompt space. And that’s also very powerful because this kind of affects the way you do ETL and data pipelining. In the traditional sense, if you had a bunch of this unstructured text, you’d have to spend either manual effort or write a complicated program to pull out the relevant bits from this text, parse it into some table, store it, and then you’d run SQL or some other query over this text. Whereas here, with the power of language models, you can store this text in a bit more of a messy, unstructured format as like raw natural language, and then still figure out a way to pull out this unstructured text, just dump it into the input prompt, and ask a question over.
Is it conceivable with what you’re saying, if I’m thinking as an app developer about diving into this, that I’m hearing you say you’re going to do this, which is an additional thing to learn, and be able to go – it’s an additional skill set that you’re adding on. But I also hear you talking about other things that I used to have to do, that maybe I don’t have to do anymore. And to some degree, is it realistic to say, from an effort standpoint, it becomes awash once you have the skills a little bit? Or maybe even you’re gaining more power and doing less work along the way to do it, so that it’s kind of like of course you would do it going forward? Is that a fair way of thinking about it?
Yeah, so it’s an interesting way of thinking about it, because I think the high-level question is just what parts have become easier, and what parts have gotten harder once you have this language model technology? Because on one hand, things have gotten a bit easier and powerful to build these expressive question answering systems, with less effort. You take in this giant blob of unstructured text, you figure out how to store it, you feed it into the language model, and then all of a sudden you can ask these questions over these files that you couldn’t really do before with more kind of traditional AI technologies, or just manual programming.
That said, I think this new paradigm kind of involves its own set of challenges, that I’m happy to talk about. I think there’s a lot of stacks emerging about how to make the best use of language models on top of your data, and there’s some very basic stuff that’s happening these days, but there’s also kind of more advanced stuff that we’re working on. And I do think it’s very interesting to think about what are the technological challenges that are preventing us from unlocking the full capabilities of language models. Because again, with a very basic stack - and again, you can see this if you just play around with ChatGPT, you can already get a ton of value from your data by just doing some very basic processing on top of it, and you can start asking questions that you couldn’t really ask before. But with some more advanced capabilities, and some – once you’re solving some more interesting technical problems, what are kind of the additional queries that you can ask on top of your data that you also couldn’t do before?
Before we jump into – so I want to kind of dive into the weeds about the two things that you talked about, like how do I index my data, how do I query my data, all the goodness around that in Llama Index… Before we do that, maybe just also to set the stage for some people that are coming into this and maybe parsing some of the jargon that’s thrown around… So one of the other things that people are really kind of diving into is thinking about “How do I engineer my prompts? How do I chain prompts together?” and all of that sort of thing. Could you highlight – because at least the way I would phrase it, those two things are complementary with the things that you’re doing with Llama Index. But could you kind of help people understand how do those pieces fit together in terms of architecting one of these systems?
[14:12] I guess in the end LLM application development, to put it in a very over-simplified view, is just some fancy form of prompt engineering and prompt chaining. It’s actually not super-different with how we’re thinking about building this interface with data. And so just as a very basic example, if you’re kind of coming into this space fresh, a very basic prompt that you could put into a language model is something like the following: “Here is my question”, and then you put the question here, and then you put in “Here’s some context.” And in this context variable you just dump all the context that could be relevant to the question. You copy and paste a blog post, you copy and paste the API documentation… Just copy and paste it into the input prompt space. And then now at the bottom say “Given this context, give me an answer to this question.” And you send it to a language model, and then you get back an answer.
So that’s like the most basic question and answer a prompt that you could use to kind of perform some sort of question answering over your data. It really is just prompting, because you’re putting stuff into the prompts, you have this overall prompt template, and then you have variables that you want to fill in.
I think one interesting kind of challenge that arises is how can you feed in context that exceeds the prompt window? Because for GPT-3 it’s 4,000 tokens, for Anthropic I guess it’s like 100,000 tokens. But if you look at an Uber SEC 10-K filing, it’s like 160,000 tokens. So if you want to ask a question like “What’s a summary of this entire document?” or “What are the risk factors in this very specific section of the document?”, how do you feed that entire thing so that you can basically answer the following question? And I think that’s where things get a little bit more interesting, because you can basically do one or more of the following things. One is you could have some external model, like something separate from the language model prompt, that’s actually doing retrieval over your data to figure out what exactly is the best context to actually fit within this prompt space. Two is you can do some sort of synthesis strategies to synthesize an answer over long context, even if that context doesn’t actually fit into the prompt. For instance, you could chain repeated LLM calls over sequential chunks of data, and then combine the answers together to actually give you back a final answer. That’s one example.
In the end, all this architecture is just kind of designed around being able to feed in some input to LLM and get back some output, and the core of that really is the prompting. So part of this is just developing an overall system around the prompting.
Well, Jerry, you had mentioned kind of these three levels of integrating external data into your LLM applications or sort of data ingestion, and there’s like indexing and query. I’m assuming data ingestion has to do with like “Oh, I’m going to connect to the Google Docs API and pull all the data over, and then indexing and query build on top of that.” But before we dive into those second two phases, which is where I think a lot of the cool stuff that you’re doing is found, what should we know about sort of data in the data ingestion layer in terms of relevance to how Llama Index builds on that, and other things?
The data ingestion side is just the entrypoint to building a language model application on top of your own data. I think LLMs are cool. I want to use it on top of some existing services. What are those services, and how can I load in data from the surfaces? One component of Llama Index is this kind of community-driven hub of data loaders called LlamaHub, where we just offer a variety of different data connectors to a lot of different services. I think we have over 90-something different data connectors now. And these include connections to the file formats. So for instance like PDF files, HTML files, PowerPoints, images even… They can include connectors to APIs like Notion, Slack, Discord, Salesforce… Sorry, we actually don’t have Salesforce yet. That’s something that we want.
Yeah, it’d be very useful. If you’re interested in contributing a Salesforce loader, please. I would love that. And then the next part is just being able to connect to kind of different sorts of multimodal formats, like audio, images, which I think I’ve already mentioned.
So the idea here is you have all this data, it’s stored in some format, it’s unstructured… It could be text, or it could even be images, or some other format. How do you just load in this data in a pretty simple manner, and just wrap it with some overall document abstraction? So there’s not a ton of tech going on here, and the reason – it’s more just a convenience utility for developers to just easily load in a bunch of data. And again, going back to the earlier point, the reason there’s not too much tech is LLMs are very good at reasoning over unstructured information, so you actually don’t need to do a ton of parsing on top of this data that you load to basically get some decent results from the language model. And so once you actually load in this data in a lightweight container, you can then use it for some of the downstream tasks, like indexing and querying.
Awesome. Yeah. And this is, I think, where things get super-interesting, like I mentioned. So in Llama Index - I’m in the docs right now; you mentioned list, and table entry, and vector store, and structured store, and knowledge graph, and empty indices… Could you describe, generally, how to think about an index within Llama Index, and then why are there multiple of these, and what categories generally do they fit in?
One way of thinking about this is just taking a step back at a high level, what exactly does the data pipeline look like if you’re building an LLM application? So we started with data ingestion, where you load in a document from some data source, like a PDF document or an API. Now you have this unstructured document. The next step, typically, is you want to chunk up the text into text chunks. So let’s say - naively, let’s say you have just a giant blob of text from a PDF; you can split it every 4000 words or so, or every 500 words, into some set of text chunks. This just allows you to store this text in units that are easier to feed into the language model, and a lot of this is a function of the fact that the language model itself has limited prompt space. So you want to be able to chunk up a longer document into a set of smaller chunks.
Now you have these chunks that are stored somewhere. They can be stored, for instance, within a vector database, for instance like a Pinecone, Weaviate, Chroma; they could also be stored, for instance, in a document store, like a MongoDB, or you could store it in a filesystem on your local disk… Now they’re stored. The next part is how do you actually want to define some sort of structure over this data? A basic way of defining some sort of structure over this data - and this is where we get into indices - is just adding an embedding to each chunk. So if you have a set of texts, how do you define an embedding for each set of texts? And that in itself could be treated as an index. An index is just a lightweight view over your data. The vector index is just adding an embedding to each piece of text.
There’s other sorts of indexes that you could define to define this view over your data. There’s a keyword table that we have, where you just have a mapping from keywords to the underlying text. You could have a flat list, where you just basically store a subset of node IDs as its own index.
Before I get into the technicals of the indexes and what they actually do, one thing to maybe think about is just what are the end questions that you want to ask, and what are some of the use cases that you’d want to solve?
Before you dive into that, I was gonna ask you really quick - could you define what an embedding is, for those people who are learning large language models at this point, just so they’ll understand what it is when you say you’re defining that as the index?
[21:55] Embeddings is a part of kind of a very common stack emerging these days around this LLM data system that’s emerging. So an embedding is just a vector of numbers, usually like floating point numbers. You could have like 100 of them, 1000 of them… It depends on the specific embedding model. And the way an embedding works - just think about this list of numbers as a condensed representation of the piece of content that you have. If you can somehow, in a very abstract manner, take in some piece of context… Let’s say this paragraph is about the biography of a famous singer. And then you get an embedding from that. It’s a string of numbers. The embedding has certain properties such that this string of numbers is closer to other numbers that are semantically about similar content, and farther away from other strings and numbers representing texts are farther away in terms of semantic content.
So for instance, if you look at the biography of a singer, it’s going to be pretty close to the biography of another singer. Versus if it’s about - I don’t know, like the American Revolution or something - that embedding will probably be a little bit further away. And so it’s a way of condensing a piece of text into some vector of numbers that has some mathematical properties where you can measure similarity between different pieces of content.
Maybe this is another point of distinction… And I get all these questions very often, so I think it’s useful to discuss them on the show. Last week at ODSC I got a lot of these sorts of questions. So we’re talking about bringing in data, creating an index to access that data; that index might involve a vector store, or embeddings… But Llama Index is sort of not a vector store. It’s cool to be a vector database company right now, but Llama Index is something different, and again, these are two things that are complementary, I think. Could you draw out that distinction a little bit, just to help people kind of formulate those compartments in their mind?
I think these days there’s a lot of vector store providers, and they handle a lot of the underlying storage components. So if you look at like a Pinecone, or Weaviate, they’re actually dealing with the storage of these unstructured documents. And one thing that we want to do is leverage these existing kind of storage systems, and expose query interfaces, I guess a broader range of query interfaces, beyond the ones that are just directly offered by a vector store.
So for instance, a vector store will offer a query interface where you can typically query the set of documents with an embedding, plus a set of metadata filters, plus maybe some additional parameters. But we’re really trying to build this broader set of abstractions and tools through our indices, our query interfaces, plus other abstractions that we have under the hood, to basically perform more interesting and advanced operations, and manage the interaction between your language model and your data, and almost be a data orchestrator on top of your existing storage solutions.
So we do see ourselves as separate, because we’re not trying to build the underlying storage solutions. We’re more trying to provide a lot of this advanced query interface capability to the end user, using the power of language models on top of your data.
I think we got off-track a little bit, but I think it was good… So kind of circling back to the indices that are available in Llama Index… And you’ve talked about this pipeline of processing, and potentially one index being a vector store… And maybe listeners are a little bit more familiar with kind of vector search, or semantic search, or that sort of thing with everything that’s going on… But you have much more than that; these other patterns and these other indices that enable other patterns. Could you describe some of those alternatives or additions to vector store index, and when and how they might come into play?
Yeah, that’s a good question. And maybe just to kind of frame this with a bit of context - I think it’s useful to think about certain use cases for each index. So the thing about vector index, or being able to use a vector store, is that they’re typically well-suited for applications where you want to ask kind of fact-based questions. And so if you want to ask a question about specific facts in your knowledge corpus, using a vector store tends to be pretty effective.
[26:13] For instance, let’s say your knowledge corpus is about American history, or something, and your question is, “Hey, what happened in the year of 1780?” That type of question tends to lend well to using a vector store, because the way the overall system works is you would take this query, you would generate an embedding for the query, you would first do retrieval from the specter store in order to fetch back the most relevant chunks to the query, and then you would put this into the input prompt of the language model.
So the set of retrieved items that you would get would be those that are most semantically similar to your query through embedding distance. So again, going back to embeddings - the closer different embeddings are between your query and your context, the more relevant that context is, and the farther apart it is, then the less relevant. So you get back the most relevant context or query, feed it to a language model, get back an answer.
There are other settings where standard Top-K embedding base lookup - and I can dive into this in as much technical depth that you guys would want to, but there’s a setting that’s really standard, kind of like Top-K embedding-based retrieval doesn’t work well. And one example where it doesn’t typically work well - and this is a very basic example - is if you just want to get a summary of an entire document or an entire set of documents. Let’s say instead of asking a question about a specific fact, like “What happened and 1776?” maybe you just want to ask the language model “Can you just give me an entire summary of American history in the 1800s?” That type of question tends to not lend well to embedding-based lookup, because you typically fix a Top-K value when you do embedding-based lookup, and you would get back very specific context. But sometimes you really want the language model to go through all the different contexts within your data.
So a vector index, storing it with embeddings would create a query interface where you can only fetch the k most relevant nodes. If you store it, for instance, with like a list index, you could store the items in a way such that it’s just like a flat list. So when you query this list index, you actually get back all the relevant items within this list, and then you’d feed it to our synthesis module to synthesize the final answer. So the way you do retrieval over different indices actually depends on the nature of these indices.
Another very basic example is that we also have a keyword table index, where you can kind of look up specific items by keywords, instead of through embedding-based essence. Keywords, for instance, are typically good for stuff that requires high precision, and a little bit lower recall. So you really want to fetch specific items that match exactly to the keywords. This has the advantage of actually allowing you to retrieve a bit more precise context than something that factor-based embedding lookup doesn’t.
The way I think about this is a lot of what Llama Index wants to provide is this overall query interface over your data. Given any class of queries that you might want to ask, whether it’s like a fact-based question, whether it’s a summary question, or whether it’s some more interesting questions, we want to provide the tool sets so that you can answer those questions. And indices, defining the right structure of your data is just one step of this overall process, and helping us achieve this vision of a very journalizable query interface over your data.
Some examples of different types of queries that we support - there’s the fact-based question lookup, which is semantic search using vector embeddings, that you can ask summarization questions through using our list index. You could actually run a structured query, so if you have a SQL database, you could actually run structured analytics over your database, and do text-to-SQL. You can do compare and contrast type queries, where you can actually look at different documents within your collection, and then look at the differences between them. You could even look at temporal queries, where you can reason about time, and then go forwards and backwards, and basically kind of say “Hey, this event actually happened after this event. Here’s the right answer to this question that you’re asking about.”
And so a lot of what Llama Index does provide is a set of tools, the indices, the data ingesters, the query interface to solve any of these queries that you might want to answer.
So Jerry, you really got me thinking a lot about this, the possibilities of the query schemes is pretty darn cool. We kind of started with ingest and moved into kind of indexing, and now we’re talking about queries… Could you kind of give me an example with the tool at a little bit more of a practical level? Because you kind of hit the concepts about what the possibilities are… But as someone who hasn’t used the tool myself, I’m trying to get a sense of what that workflow is like… Pick what would probably be a really common query scheme that you’re doing and dive into that just a little bit to give us a sense of a hands-on, practical, fingers-on-keyboard sense of it… Because I’m trying to get a sense of where I’m gonna go for playing after we get done with the episode, so… I want to try it.
A hundred percent. I think one thing that has popped up pretty extensively after talking to a variety of different users is actually financial analysis. I think looking at SEC 10-Ks tends to be a pretty popular example. If you look at the Anthropic quad example, they also use SEC 10-Ks… And my guess - the reason it’s popular is one, there’s just a ton of text, and so it’s just very hard to parse, to read it as a human. Two, it’s a useful thing for people in financial institutions, like consulting, because you want to compare and contrast the performance of different businesses, and look at the performance across years.
Believe it or not, I actually read 10-Ks a lot, and that would be a really useful example for me. Believe me, I’m not kidding you.
As a result, we’ve actually been playing around with it a decent amount, too. Some of the cool things that we’re showing that Llama Index can do on top of your 10-Ks is, for instance, let’s say you have two companies, let’s say Uber and Lyft, for the year 2021. You can actually ask a question like “Can you actually compare and contrast the risk factors for Uber and Lyft? Or their quarterly earnings across these two documents?” One is the Uber 10k, One is the Lyft 10k. This is actually an example where if you do just Top-K embedding-based look-up, the query fails, because if you ask the question “Compare and contrast Uber and Lyft” and don’t do anything to it, and let’s say your Uber and Lyft documents are just in some one vector index, you don’t really have a guarantee you’re gonna fetch the relevant context to this question to be able to answer this thoroughly. And then the model might hallucinate, you’ll get back the wrong answer, and then it’s just not a good experience.
I think what you typically want to do is have some sort of nicer abstraction layer on top of this query, that can actually kind of map that query to some plan that would roughly be how a human would think about executing or answering this question.
Let’s say you want to compare and contrast the financial performance of Uber and Lyft in the year 2021. Well, first, okay, what was the financial performance of Uber in 2021? What was the financial performance of Lyft? You break it down into those two questions, and then for each question, use it to kind of look over your respective index. Let’s say you have an index corresponding to Uber and an index corresponding to Lyft. Get back the answer, get back the actual revenue, for instance, for Uber and Lyft, and then synthesize both of them at the top level again; be able to pull in the individual components you extracted from each document, and then synthesize a final response that’s able to compare the two. So that’s an example of something that we can actually do pretty well with Llama Index, and we have a variety of toolsets for allowing to do that. And that’s an example of a query that’s kind of more advanced, because it requires comparisons beyond just kind of asking stuff over a single document.
[34:08] Another example to just kind of take the 10-K analogy further is let’s say you have the yearly reports of the same company across different years; let’s say from 2018 to 2022. You can ask a question like “Did revenue decline, go up or down in the last three years?” And then you can actually do a very similar process, where given the query interface that we provide, break this question down into sub-questions over each year, pull out the revenue, and then basically at the end, do a comparison step to see whether or not it increased or declined.
Just as an aside, to any listeners out there wondering why on Earth somebody would read 10-Ks, especially considering that our audience is focused on data and such… If you want to learn about another technology company and really understand what it does and be able to compare it, this is an example where you can gain tremendous intelligence on another company with publicly-available information, and by comparing multi-year 10-Ks, like you just said, you’ll learn way more about that company than its own employees know about it. So anyway, just thought I’d mention that as an aside.
Yeah, I look forward to hearing your success with speeding up your workflows around reading the 10-Ks, Chris, with Llama Index.
I’m excited about this. This is gonna save me a lot of time.
Jerry, one of the things that we talked a little bit about in one of our previous conversations, which I know you’ve also thought very deeply about, and even have a portion of the docs and functionality in Llama Index devoted to, is evaluation; like query-response evaluation, like “How do I know my large language model barfed up an answer?” based on some query, and I pulled in some external data, and I inserted some context, and maybe I strung a few things together… How am I to evaluate the output of that? Could you give us maybe a high-level, from your perspective how you think about this evaluation problem, and then maybe go into a little bit of some of the things that you’re exploring in that space?
Yeah, totally. Just to preface this - we are super-interested in evaluation, or more tailored towards this interface of like your data with LLMs. I can dive into that a bit more. And we have some initial valuation capabilities, but we’re super-community-oriented. We’d love to just kind of chat with – there’s a lot of different toolsets out there that allow you to do different types of evals over your data, and building nice interfaces for doing so… And so I think this is an area of active exploration and interest for us as well.
And so just kind of thinking about this a little bit more deeply - evaluation is very interesting, because there’s the evaluation of each language model call itself, and then there’s the evaluation of the overall system. And so diving into this a bit more - at a very basic level, if you have a language model, you have an input, and then you get back some output, you can try to validate whether or not that output is correct, given a single language model call. Did the model actually give you the correct answer given the input? Did it spit out garbage, did it hallucinate? That type of thing.
The interesting thing about a lot of systems that are emerging these days is that they’re really systems around a repeated sequence of language model calls. And this applies whether or not you’re dealing with a more agent-based framework, which you ask a question and it can just repeatedly do react train of thought prompting, or be able to pick a tool, but the end result is it’s able to give you back a response. Another example of this is AutoGPT, where you just let it run for five minutes and it just keeps on doing stuff over and over again, until it gives you back something… Or even in the case of retrieval-augmented generation, it’s just a fancy name for roughly what we’re doing with Llama Index, which is a query interface over your data.
[37:48] Even within our system, there could be a sequence of repeated LLM calls. But the end result is that you send in some input into the system, and you get back some output. Given this high-level system, how do you evaluate the input and output properly? I think in traditional machine learning, typically what you want to have is you want to have ground truth labels for every input that you send in. So if you, for instance, ask a question, you want to know the ground truth answer, and you want to compare the predicted answer to the ground truth answer, and see how well the predicted answer matches up.
This is still something that people are exploring these days, even in the space of generative AI and LLMs, you have ground truth like text, and then you have predicted text, and you want some way of scoring how close predictive text is to ground truth text.
I think the core set of eval modules that we have within Llama Index actually are ground truth-free or label-free. And that part in itself is actually very interesting, because you have this input, you ask a question, you get back this predictive response, you also get back the retrieve sources, like the documents themselves. What we’ve found is that you can actually make another LLM call to just compare the sources against the response, and then also compare the query against the sources on the response to see how well all two or three of these components match up. And this doesn’t require you to actually specify what the ground truth answer is. You just look at the predicted answer, see if it matches up to the query or the context in a separate LLM call. And it’s interesting, because one, it makes use of LLM-based evaluation, which is kind of an interesting way to think about it, basically using the language model to evaluate itself. I’m sure there’s downsides which we can get into, but a lot of people are doing it these days, too.
And then the second part is it doesn’t require any ground truth, because you’re using the language bottle to evaluate the capabilities of its own answer, plus context. You don’t actually need to, as a human, feed in the actual answer. And the benefit of this is it just saves a bunch of time and cost. You don’t actually need to label your entire dataset to run the evals.
I still think this overall space is probably relatively early. I think there’s still some big questions around latency and cost, if you’re trying to really do LLM-based evals more fully. Using the LLM to evaluate a large dataset takes a lot of time and costs a lot of money, and so this is generally kind of like an area that we’re still kind of actively thinking about.
Yeah, that’s awesome. As we kind of get nearer to the end here, I know things are progressing so quickly, I can’t keep up with all of your tweet threads about awesome new stuff that’s happening in Llama Index… But I know there’s a lot – as you look to kind of the next year and where your mind is at, what you really want to dive into, and also what’s really exciting to you about the field… There’s a lot of people excited about a lot of different things, but from your perspective, having been in the trenches, building large language model applications, interacting with users of Llama Index, what is it that really excites you moving into this next year in terms of the possibilities, the real practical possibilities on the horizon, and how kind of our development workflows will be changing in the near future?
Yeah, totally. I think there’s a few related components to this I’m both excited for, as far as the challenges that we’re gonna solve. Probably the first component is just being able to build this automated query interface over your data. If you’re looking at all the query use cases that we solve, one of the key questions that we keep going back to is “Here’s a new use case on top of this data, and here’s a new question that you’d want to ask. How do we figure out how to best fulfill that query request?” And I think, especially as your data sources become more complicated, then it’s just like “How do you think about how to index and structure the data properly? How do you think about interesting automated interactions that happen at the query layer between the language model and your data?” How do we basically just make sure we solve this request?
And then the second component of this is, we want to make sure that we build this interface that can handle any type of query you throw at it, and how do we do this in a way that is cheap, fast and easy to use? For a lot of users, once they move on beyond the initial prototyping phase, they think about starting to minimize const and latency, and they think about “How do you pick the best model for the drop?” There’s the Open AI API, which works well, generally. It’s probably the best model out there. But it can also be quite slow. And then there’s also these open source LLMs popping up, probably like a few new ones every week, and then how do users make the best decisions whether to use that over their data.
And then I think the next part to this is - you know, a lot of LLM development is moving in this overall trend of automated reasoning. If you look at agents and tools, and AutoGPT, and all this stuff, it’s just like “How do you make automated decisions over your data?” And then I think as a consequence, there’s always going to be this trade-off between how few constraints can we give it, and how many? Like, should we give it more constraints or fewer constraints? Because the fewer constraints you give to something this, it has more flexibility to potentially do way more things, but then also just error. It will just make mistakes, and then there’s really no way to correct it easily, and then you can’t really trust the decisions. Whereas if you kind of constrain the outputs of these automated decision-makers or agents, then you can potentially get more interpretable outputs, maybe at the cost of a little bit of flexibility in terms of functionality. And I think we’ve been thinking a lot about that with respect to the data retrieval and synthesis space, too. Like, how can we give you back results that are expressive, but also perform well, and aren’t going to make mistakes a ton of time?
Awesome. Yeah. Well, I’m really happy that we had a chance to talk through all the great Llama Index things, and make sure - if you’re not following Jerry on Twitter, find him there. He posts a lot of great stuff that they’re working on. And of course, you can find Llama Index, if you just search for Llama Index; there’s a great set of docs, and all those things. We’ll make sure we include those links in our show notes for people to find out about that, and get linked to their docs and their blog and all the good things. So check out Llama Index, and thank you so much for joining us, Jerry. It’s been awesome.
Our transcripts are open source on GitHub. Improvements are welcome. 💚