Seems like we are hearing a lot about GraphRAG these days, but there are lots of questions: what is it, is it hype, what is practical? One of our all time favorite podcast friends, Prashanth Rao, joins us to dig into this topic beyond the hype. Prashanth gives us a bit of background and practical use cases for GraphRAG and graph data.
Featuring
Sponsors
Fly.io – The home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes.
Assembly AI – Turn voice data into summaries with AssemblyAI’s leading Speech AI models. Built by AI experts, their Speech AI models include accurate speech-to-text for voice data (such as calls, virtual meetings, and podcasts), speaker detection, sentiment analysis, chapter detection, PII redaction, and more.
Speakeasy – Production-ready, enterprise-resilient, best-in-class SDKs crafted in minutes. Speakeasy takes care of the entire SDK workflow to save you significant time, delivering SDKs to your customers in minutes with just a few clicks! Create your first SDK for free!
Notes & Links
- Kùzu: A highly scalable, extremely fast, easy-to-use embeddable, open source graph database: GitHub repo
- The goals and vision of Kùzu: Blog post
- Kùzu YouTube channel
- Graph RAG strategies with Kùzu: GitHub repo
- Prashanth Rao’s blog: thedataquarry.com
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome to Practical AI | 00:35 |
2 | 00:35 | Sponsor: Fly.io | 03:06 |
3 | 03:49 | Welcome back Prashanth | 04:52 |
4 | 08:41 | What are graphs? | 01:24 |
5 | 10:05 | DB vs graph experience | 03:26 |
6 | 13:31 | Data examples | 04:31 |
7 | 18:15 | Sponsor: Assembly AI | 02:21 |
8 | 20:45 | What is a RAG workflow? | 03:22 |
9 | 24:07 | GraphRAG/NaiveRAG | 04:47 |
10 | 28:54 | Hallucination risks | 01:59 |
11 | 30:54 | Building the data side | 04:33 |
12 | 35:35 | Sponsor: Speakeasy | 00:53 |
13 | 36:38 | Practical uses | 05:07 |
14 | 41:44 | Graph construction | 05:42 |
15 | 47:26 | Where is this headed? | 05:00 |
16 | 52:27 | Thanks for joining us! | 01:27 |
17 | 53:57 | Outro | 01:05 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack, I am CEO and founder at Prediction Guard, and I’m joined, as always, by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris?
Doing very well today, Daniel. How’s it going?
It’s going extra wonderful today, because if I think back over the past couple of years, and the way that these shows that we produce together have also impacted my day-to-day work, one of the ones that has impacted me a lot, and also, I think, been super-beneficial, just connection, is the one that we did about vector databases, beyond the hype… And super-happy to have our guest from that show back, who is now an AI engineer at Kùzu, Prashanth Rao. Great to have you back on the show.
Oh, yeah. Hi, Daniel. Hi, Chris. Very good to be back on the show again.
Yeah. I’ve taught a lot of workshops over the past couple of years, I’ve done trainings with our customers, I’ve built things with our customers, and always, when it gets down to vector databases, I’m like “Hey, anything I say, you can maybe learn some things from me, but if you really want to learn about vector databases and get the right sort of intuition, you really need to go to this series of blog posts on the Data Quarry, which is your blog, which was the first thing that I saw, which brought you originally on the show. We talked through vector databases, and trade-offs between different types of vector databases, and all of those things.
Since that time, of course, you’ve moved on to a company, a data company, working in a different type of data, which we’ll talk about today, but also is related to some of those things… So yeah, how has the year been? Lots of updates and fun things it sounds like you’re working on.
Oh, absolutely. So I think I can summarize very briefly where I’m coming from since the last time we spoke, which was I guess just about a year ago. So as we spoke in our previous episode on vector databases, in 2023 I was working in the Royal Bank of Canada. And I spent most of that year thinking about vector search and the difference between the various database options there. And during that time, I was also simultaneously working on using graphs, and graph databases. And I’ll talk a little bit more about what they are and how they’re useful. But as I was working on them, and I think I mentioned this even in the conversation we had last year, is that I’m very interested in understanding how these two worlds come together. Because I know knowledge graphs and graphs have been around for a long time in their own space, and there was a huge hype around vector search and vector semantic search around this time last year. So there was a lot of scope, I guess, to understand how these two systems and these two methods can work better together, in a way that we can build more advanced retrieval systems.
So while I was thinking about those topics, I discovered a new open source embedded graph database, Kùzu, which is where I now work. And I happened to meet the CEO, Semih Salihoglu, at a local meetup in Toronto. And that got me really interested in what this tool and technology is all about, because I’d been using other graph databases. And when I discovered Kùzu, the fact that it’s embedded and open source - which I’ll talk a little bit more about - was very attractive to start with.
So I spent my spare time outside of work, experimenting with it and engaging with the developer team, as well as working with other people in the community in terms of understanding how they’re using graph databases and graph tooling.
So long story short, I realized at the end of last year that this was a great and fun opportunity for me to really go deep into the world of graphs and graph databases. And at the beginning of this year, I moved over and I joined Kùzu as an AI engineer. And currently, I’m leading the developer relation efforts and I’m engaging with the growing user community of Kùzu, which is really, really fun in itself. And of course, working with use cases that are taking this whole space forward, and working with the community to understand what they are using these tools for.
[00:08:04.14] And I should also add, I’m still very plugged into the vector database ecosystem. I’ve not left that behind, and I’m actively experimenting with some of those in my spare time as well. And I still very much enjoy using LanceDB, which is the tool I mentioned last year in our conversation as well.
So again, it’s really exciting to see how these different kinds of tools are coming togethe… And yeah, I’d love to go deeper into some of these topics in our conversation.
Fantastic. Yeah, the episode that we did last time with Lance, and with the other vector database issues - that was a great episode. And if anyone isn’t familiar with that, they definitely should go back and listen to that one as well. But just to lay a little bit of foundation here, some groundwork, for those who may not be intimately familiar with graphs and how are they useful and applicable and such.. Could you kind of start us off with just what is a graph, and why should we care about it, and what’s it going to do for us?
So graphs are sometimes also known as knowledge graphs, and they’re a great way to represent structured data via nodes and edges. The nodes are basically entities in the real world, like a person or a company. And the edges are relationships between these entities, like “worked with” or “is CEO of”, and so on.
Now, more formally, I also want to make the point here that the term knowledge graph is actually used to mean something slightly more involved than what I just described. A knowledge graph, in theory, can be used to express data that’s quite hard to tabulate and store as records. It supports some logical reasoning over graphs as well. This is kind of what you would do when you store Wikipedia data, for example. But that’s outside the scope of what we should go into for the purposes of this discussion. For most people and developer audience, the term graph and knowledge graph can be used interchangeably. So I’ll just use them interchangeably for this discussion.
It’s kind of like distinguishing machine learning and AI, to some degree. The edges - pun intended - don’t matter that much in terms of their actual use.
Good one.
I want to direct you for one second before we get past the point… As you did mention about graphs, and you briefly mentioned it in relation to a relational database… And just because they’ve been around so long, and it’s where a lot of people are starting from on relational databases, and you talked about the edges of the graph describing those relationships, and being able to do that in some ways much better than a relational database can… Could you talk just for a moment, so people can kind of make the jump from where they’re at into this new tool, and maybe think next time “I’m going to leave whatever database I’m in”, Postgres, you name it, and actually try a graph approach… Could you talk just for a second about what the difference there is and bring people along on that path?
So I think they’re going into the description of what a graph database is. So I’ll just quickly – before I go into the database aspect of things, I just want to point out the benefit of what a graph is even useful for. Because most people need to understand what a graph is for, and then you think about where to store it.
Great point.
So I think I want to highlight that the benefits of graphs become obvious when you’re looking at data itself that’s highly interconnected. So for example, in medicine you can have relationships between patients and symptoms of diseases, drugs that treat those diseases, and their side effects. And these all have complex, interweaving relationships with one another.
Similarly, in finance you can have chains of direct or indirect transactions, let’s say between onshore and offshore accounts. And these ultimately connect to given individuals anywhere in the world. So you could choose to model these using tables in a relational database, or you could choose to model this as a graph in a graph database. And in either case, there are pros and cons. But specifically when you’re analyzing patterns, and you want to actually understand these complex relationships in an analytical way, using a graph database can actually be very powerful. It’s a lot more intuitive, it’s much easier to construct the queries that can answer these questions… And this is where the idea of a graph and how you model the data as a graph becomes very, very powerful.
[00:12:09.29] But going back to the idea of databases. So a graph database can be thought of as a specialized database that allows you to scalably manage and query data that’s organized as a graph. And the performance aspects of these graph databases come from specialized data structures and operators that allow you to express complex joins, and efficiently traverse paths in your data.
Now, a lot of listeners may have heard of graph databases already. And the most popular graph database model is called the property graph data model, which was invented and popularized by Neo4j. And I’ve been a user of Neo4j myself in the past. Today you have many other graph databases like Kùzu, that also implement the property graph data model. But in general, you start off with a data model, which is more conceptual, which allows you to express how to store and query your data. And a graph database basically is the underlying system that allows you to express the graph data model of choice. And the reason it’s more intuitive than a relational database for certain kinds of queries that involve connected data and this sort of interrelated data is that it allows you to express your queries in a much more concise and intuitive manner, using the query languages that graph databases offer… As well as the performance aspects of traversing the paths in the data.
And just to sort of hone this in a concrete way… Are you able to give a few examples? There’s sort of the typical personnel-related things that you already mentioned… Like, you know, Daniel is a node, and he’s CEO of PredictionGuard, and that’s another node, and there’s organizations… Maybe for people that can’t connect how such a structure would kind of be relevant to data that maybe they have in their enterprise, could you give just a couple other examples of data that could be represented as a graph, maybe in different verticals, or that you’ve run across in your use cases with Kùzu? …sort of data that’s kind of beyond that kind of social network type of data, I guess.
Absolutely. I think social networks get a little excessive credit for being the original graphs that we know about… Of course, they’re very powerful, and of course, graphs are heavily used in social networks. But yeah, I think I mentioned an example of the medicine scenario which I described, which is you have not just individual persons or patients in a network, but you also have the drugs and the different symptoms of the diseases that are being treated by those drugs. Each of these can branch out into much more complex relationships themselves. And a lot of this data that you might imagine in a healthcare scenario or a biomedical scenario - this data pre-exists in structured form in many different sources. You might actually have medical records of people in a relational database, or a data lake somewhere.
A lot of existing workflows that work with these kinds of datasets tend to just stick with the database that is already used as a primary store, and many of those happen to be relational databases… Which is fine, because they’re the most, I guess, efficient and convenient way to store this kind of data… When you have records of people, let’s say what drugs they’ve been taking and what symptoms those drugs may cause, and so on.
So I guess the idea here is that you can actually think of that very same data that exists in a relational database as a graph, and store that in a graph database, and apply your graph query logic in a way that allows you to answer specific questions that might have been quite complicated to answer using SQL in the relational database.
So that’s one example of, let’s say, a healthcare scenario. I’ve already mentioned the financial transaction scenario, where you have a transaction graph between individuals, the merchants they’ve interacted with, the money transfers they’ve made between their accounts, how these accounts are connected… So financial institutions make heavy use of graphs to answer these kinds of questions.
[00:16:05.10] Traffic networks is another common use case. If you’re working with city authorities and you want to understand the flow of traffic and the numbers of people moving between locations, this is something that can actually be well modeled in a graph. And the kinds of questions you can answer can also change based on whether you choose to model this as tables or as a graph. And there are many, many other examples.
I will highlight one thing, though… The example of Wikipedia that I gave earlier - the idea of Wikipedia as a graph, I think, reinforces the idea that knowledge graphs are a term to be used, I guess, with a bit of caution, in the sense that in general a graph is a general representation of how nodes are connected to other nodes in the network. A knowledge graph is something that you can think of as the collection of all knowledge that is available in that domain.
In the case of Wikipedia, you can imagine that there are certain scenarios where it’s quite hard to tabulate every bit of information. As an example, if you have - I don’t know, the current president of the US is Joe Biden. So based on the current political structure and the different parties that Joe Biden represents, and all the other people represented in those parties, and then the relationship between a party and a state, the relationship between the state and the country - you can imagine that this becomes a very complicated branching sort of structure. And not all of it renders very well in a table, because you can’t imagine one table that connects to another table when you have this kind of complicated data.
So there are certain scenarios where actually the data model and the way you build your graph can actually really have a big impact on the kinds of questions you can ask of your data. So that’s why I tend to use the term “knowledge graph” in a bit more specialized way.
In general, I would say that you can think of your tabular data or records as a graph very conveniently using the property graph model, and model things like transaction networks and social networks and drug interaction networks and so on.
Well, Prashanth, before we get into one thing that I’m really excited to talk about on the show, which I’ve been telling Chris we need to talk about for a while, which is graph RAG… But before we get there, we sort of talked a bit about graphs. I think it would be useful to kind of just give people a reminder of the kind of, what most people would refer to maybe when they’re referring to a RAG workflow, an AI RAG workflow.
I was at a CIO dinner last night, and one of the speakers is like “We all hear about the RAG, and the raggedy RAG, ragged RAG RAGs…” And everybody’s hearing about these things, but maybe it’s worth just a quick 30-second, one-minute reminder. So when most people are referring to the RAGs, or the RAG workflow, what are they referring to?
Yeah, this is a very fascinating topic. I think about RAG a lot. So let me just, I think, take a step back and describe the term itself. So the term RAG itself is really interesting to me. As we know, it stands for retrieval-augmented generation. The key thing to note here is that the term RAG emerged prior to the emergence of the term LLM. And it’s come as a result of generative language model improvements. That came, I think, in the end of 2019, early 2020 time period. And there were two papers that came around in early 2020. One by Google, and another by Facebook Research, that introduced the term retrieval-augmented generation.
Now, note that retrieval itself is not new. We know information retrieval is a field that’s been around for decades. And we’ve also had systems that can do keyword-based information retrieval for decades. So what is new right now is the fact that generative models have become much better than they used to be. So the generation capability is what’s new.
So when you look at the term RAG, the term retrieval-augmented comes before the term generation in that acronym. So I hope that makes it clear as to the fact that the generation part is what we are stressing as the novel aspect of it.
Now, in 2020 the way RAG was done was to combine a sequence-to-sequence language model, which was the standard way of doing language modeling back then. And those models could generate text quite convincingly. And you combine those models with the retrieval capabilities of dense embeddings. And the Facebook Research paper that introduced the term RAG based their work on dense embeddings of Wikipedia articles.
Now, you had the initial makings of vector embedding-based retrievals that we are taking for granted today. And later in 2020, GPT-3 was released, so you had the coinage of the term LLM, or large language model. And you could think of GPT-3 as one of the first large language models that really extended the capabilities of pre-existing models at that time. But what really made RAG take off, in my opinion, from 2021 and beyond was the arrival of this whole host of new systems that we call vector databases. And they began offering specialized features to make retrieval from these dense embeddings far easier, and also more scalable. This is what led to the explosion of all those vector database companies that we discussed about last year. And I hope this gives some context as to the term RAG itself, and how I guess it grew to what it has become today.
So I guess we’ve been teasing this for a while on the show here… We’ve talked about graphs, we’ve talked a little bit about RAG… Everyone’s waiting for us to ask you the question about graph RAG, and get into the topic. So without further ado, if you want to dive in and kind of give us an intro to that, we’d love to hear that.
[00:24:25.23] Absolutely. So let’s understand what we were doing with, RAG and then go into graph RAG. So the early approach to doing RAG is – we call it naive RAG now. And in that approach, you just create chunks of your data, you embed that using an embedding model, and you store them in a vector database. So essentially, you just store the chunks on the chunk embeddings in a vector database, and when you do a retrieval, you convert your query into an embedding model, using the same embedding model that you used to embed the data. And this returns the most similar chunks, that are similar to the query vector.
So you typically return like the top K, let’s say top 5 or top 10, whatever number you choose. And these top K chunks can then be sent to the LLM as context to synthesize a response in natural language. So in a nutshell, that’s kind of what you could say traditional RAG does.
Now on paper, this naive approach to doing RAG is great. But it quickly became obvious that this has limitations. The first limitation is that the dense embeddings are typically done at the sentence level. And many user queries use keywords. And keyword-based search methods like BM25 can do a fair job at this and they’ve been around for a long time.
So towards the end of last year, you could see a lot of these vector database vendors starting to offer a combination of hybrid search methods, and the term hybrid search itself becoming more popular, where you perform both keyword-based search, which is a form of sparse vector search, with dense vector search, which is a search via dense embeddings. And you pass the retrieved chunks from either of these approaches to re-ranker module. So you had specialized modules that do re-ranking, that give you the most relevant chunks from either of these retrievals. And this is how you combine the sparse and dense vectors into what you call a hybrid search.
Now, even hybrid search can have its limitations, which is, I guess, where people began exploring further options earlier this year and maybe beyond. Because neither sparse nor dense embeddings can capture explicit relationships between entities very well. And I’ll demonstrate this with an example. In certain cases, you can really benefit by modeling some of these entities explicitly.
So let’s look at an example of a professor and let’s say the PhD students the professor is advising. So let’s say you had a block of text, which is talking about the students and the professor and a bunch of other things related to their work in the university. So in natural language, we understand the relationship between the professor and the student as follows. Student X worked with Professor Y, because we know that the act of being a student of a professor means that you worked with them. But in the text itself, you may not have expressed it that way. The text may be written as so and so, person X, was a student of person Y. Now, if you try to search this using the query “Who did X work with?”, this is a very intuitive question in natural language. We humans immediately can put two and two together, that “work with” and “student relationship” are more or less semantically similar here… So we are able to piece together this information and know that a person was a student of someone, and inherently they worked with that person. However, if you try to search for this using vector search, the dense embedding may not capture the relationship correctly, where “student of” isn’t close enough to “work with” in the vector space.
So your vector search alone may not retrieve this answer, because you didn’t model the relationships in that explicit way. However, if you had chosen to model this as a graph, you would explicitly capture this relationship using this concept of a triple, which is “person X worked with person Y.”
[00:28:03.21] So this is where triples come in. A triple essentially is two nodes that are connected via a relationship. You have a source and a target, and the person X is a source, person Y is a target, and the “worked with” is what represents the relationship.
So the very powerful idea here is that where graphs come into this whole picture and why it’s relevant to RAG is that you can actually provide additional valuable context to an LLM by modeling these relationships explicitly, and simultaneously retrieving, both from a dense embedding vector search, as well as a graph traversal. And then using the retrievals in combination with one another to provide additional context to the generation LLM, so that you can actually include this explicit relationship in your answer. And this actually has been proven in practice from some work that’s been done recently.
I want to ask a question that may be a bit of a stretch, but… As you were describing that, and that failure of making those explicit - if you’re doing that, could that lead to hallucination in terms of your output from the model, because it’s not able to make that explicit, and so it’s still trying to provide information and it comes up with whatever it comes up with? Is that a possibility there?
That’s a very fair point, and you’re absolutely right, hallucination I think is definitely a high possibility, or it’s definitely likely in certain scenarios. You obviously cannot predict when hallucination happens, which is a big issue in general with LLMs. Now, I definitely want to expand on this a little bit. The selling point for Graph RAG – by the way, the process I was just talking about before is essentially what you could loosely define as Graph RAG, where you bring a graph as part of the retrieval process.
Now, the so-called, I guess, benefit of Graph RAG as it was, you could say, marketed in the last several months by various sources is the fact that it reduces hallucinations. And it’s important to note that whenever context is provided to an LLM for the purposes of generation, essentially you provide a prompt, and the prompt is what LLM uses to provide a response. In the event of providing such a prompt, it’s always possible that at some point of time you will have a hallucination. It doesn’t matter whether the information came from a graph or the information came from a vector retrieval. The source of the information is irrelevant. So the very act of using an LLM to generate text means that there is an inherent chance of hallucination.
So I wouldn’t state that the benefit of Graph RAG is that it eliminates or reduces hallucination. What I would state the benefit is is that it actually increases the chance of factual accuracy, in the sense that a relationship that was not explicitly captured in this vector embedding is now explicitly captured in the graph. And by providing both these pieces of context to the LLM, you’re essentially increasing the chances of a factually correct or a more relevant response.
Yeah. So one of the things I’m thinking about is people might have some thoughts in their mind… I know maybe people have built some sort of naive RAG system, or maybe even implemented some advanced RAG methodologies, but most of the time what they’ve had is a sort of set of documents. Like you say, they split up into chunks, they embed those, they retrieve one or more chunks, maybe even in a hybrid way or in some advanced way… But here you’re saying you’re combining the vector approach and the graph approach. I’m wondering if you could break down very concretely for us the data side. So if I have documents, or if I have internal data, what I would need to have in place and how I would construct the data side to be ready to do Graph RAG.
And then at the time that, let’s say I receive a user query question in my chatbot or whatever that is, what is actually kind of concretely retrieved, and how is that combined, or how could that be combined with the prompt into the model? So just walking us through those kind of very concrete things might be helpful for people.
[00:32:00.25] Sure. That makes a lot of sense. So the two key stages in any RAG application, not just Graph RAG, is the fact that you divide it into an indexing stage and a retrieval or a serving stage. So to get the data in and store it and index it is what we call as indexing stage. So this is the stage that is upfront, or upstream, where you have data that already exists in different structured or unstructured sources.
Now, you could apply a variety of techniques, including using LLMs itself for this stage, where you could extract the entities, or you could say named entities from the unstructured data, or structured data that already exists, and store them as entities or nodes in a graph database. And simultaneously, you can also extract relationships from this unstructured text. There are many different methods that you could use to extract the relationships.
Now, what I’ve noticed in recent times is a lot of these recent papers are using LLMs to help with this information extraction step, where you actually use the LLM to extract portions of your chunks, and then from those portions, you further refine it and say “Okay, from this block of text tell me what are the triples”, which is nodes connected to other nodes via a relationship exist in that block of text.
So once all these triples are extracted, they’re essentially stored in a graph database. And simultaneously, you can also have a parallel pipeline that stores the vector embeddings in a vector database. In some cases, you can have both the vectors, as well as graph entities sitting in the same database. Certain databases have those features. In other cases, you may not want to have them both in the same database. You may want to leverage the graph database for its strengths, and the vector database for its strengths. And you may also have pre-existing workloads that already have the data in those systems.
So it’s perfectly valid to have your independent sources of data move the respective data into those respective databases. For example, your graph entities would go into your graph database, and the vector embeddings would go into a vector database. Downstream of this, you could do additional post-processing like linking the chunk reference IDs to individual entities in the graph.
Essentially, the node that represents an entity in the graph can have an ID that links it back to which chunk it is a part of, so that when you retrieve a particular chunk, you can actually point to which entities are existing in that chunk. So these are a lot of additional upfront steps that people are doing to construct, both the graph as well as the vector store. And once this indexing stage is complete, the retrieval stage can actually begin, which is the stage that we are very familiar with. You have a user query that comes in in natural language, you transform that into an embedding, you find a similarity search on that embedding, which returns the most similar vectors from the vector database… And then simultaneously, you can also use whatever methods you have to translate that query into a graph query, and retrieve the entities and relations from the graph that answer that same question, and then use a re-ranker to combine the retrievals in a way that provides additional context to the LLM for generation.
So again, just to summarize, the two key stages in the RAG pipeline include indexing and serving… And each of these stages has a suite of tools that you can use to help the user achieve the required outcome.
Break: [00:35:28.27]
So here we are after break, I am still thinking about what you were telling us going into break, and trying to grok it myself… And I’m kind of thinking about how I can use it in a practical sense to help me get it down, and kind of get it from the notional sense into more of a practical thing that I can go do after we stop talking on the podcast. Can you give me like an example, something really hands-on that folks out there might be doing, that really puts it into that “Okay, I get it. Now I’m going to go do it” kind of context?
I actually have a repo that I can share once we’re done with this conversation, and I would love for people to pick this up and experiment with it. And obviously, because I work at a company that builds a graph database, I’m very eager to talk to users who are using these kinds of tools.
So for my experiment, I’ve used Kùzu as a graph database, and LanceDB as a vector database, because each of these, as I mentioned, have their own benefits in their domains. And the practical example that I want to demonstrate is this. So the dataset I’m thinking about, which I can showcase in the repo, is you have a block of text about the scientist Madame Curie. You know, she discovered radium and polonium, and she won two Nobel Prizes… And she was related to Pierre Curie, they were spouses. And she also was related to other scientists in the whole ecosystem.
So I have a text sample that contains Madame Curie’s contributions to science, and the relationships that existed in her life… So this is unstructured text. So the first step is, we can do two things here - we can do the conventional, naive RAG retrieval, where I try to ask a question, “Who did Pierre Curie work with?” And that’s a very simple question to answer, and a vector search will definitely give you an answer if you embed the text and do the required steps.
So what I noticed in this dataset is there is an implicit relationship, and this is why I gave that example earlier about the professor and the students. There is one particular person in this dataset, Paul Langevin, who was a student of Pierre Curie, and who later had a relationship with Marie Curie after Pierre Curie passed away.
So it’s mentioned in the text that Paul Langevin was a student of Pierre Curie. Now, the question asks, “Who did Pierre Curie work with?” We obviously know that Pierre Curie worked with Madame Curie to find or discover these elements. Now, we also know implicitly that Pierre Curie worked with Paul Langevin, who was his student. The vector search, if you naively chunk these and store them in a vector database - which I do in LanceDB - gives me one of the answers. It gives me “Marie Curie worked with Pierre Curie.” But the graph search, because of the fact that I explicitly insert the relationship, and I have some code that shows how the information was extracted from the unstructured data using the LLaMA Index framework - so it’s very intuitive and easy to begin experimenting once you actually install the required packages.
So what I’m getting at here is, in the graph I was able to retrieve both the answers, the Paul Langevin as well as Marie Curie, who worked with Pierre Curie. And rather than just using the result from the graph, there may be other scenarios where my question may have been a little bit more vague or fuzzy, and a vector search might have given me a better result. I’m sure if you tinker with the dataset and the questions here, you’ll find such examples.
[00:39:57.17] So what I’ve done is I’ve included a re-ranker downstream of the vector search and the graph search, and when I retrieve the result and pass it as context to the LLM, I’m adding that re-ranker step so that I get the most relevant graph search results, as well as the most relevant vector search results. In this case, the vector search missed one of the entities, but the graph search captured it.
So the combined context from both these retrievals allowed me to get the generator model to actually give me the correct response. And if you want to experiment more with this, I think it’s pretty straightforward to come up with other queries that will show the reverse result to be true, where the semantic match between the vector search and the query might be closer. You might get a more relevant result from the vector search, and the graph search missed the result because of a mismatch.
So that’s kind of where I’m going at here. There are many ways you can combine vector search and graph traversals to improve the retrieval accuracy. So I’d love for the community to think about these individual parts.
I think my biggest takeaway from what I’ve seen in the last few months of reading the literature and talking to people who are questioning what Graph RAG is, is that people tend to think of Graph RAG as like a graph-based solution alone… Whereas the more I think about it, I think that the two approaches of using dense vector retrieval and using graphs kind of go hand in hand, for the purposes of RAG.
All this being said, there are a lot of, I guess, conflicting articles that you may see online, and blog posts claiming that this is the way to do Graph RAG. The key takeaway for everyone should be think of it as a suite of tools and methodologies that come together to enhance retrieval in a way that you can get better generation.
I would love to maybe double-click on one of the things that you mentioned, which I had in my mind and I think other people might have in their mind… Mainly because - I don’t know how many years ago it was; five or six years ago I was doing some graph-related data work… I know I looked at a lot of things at that time. There’s even a whole area of research called automatic or automated knowledge graph creation. There’s sort of this idea that once you have a graph, there’s so much that is opened up to you in terms of querying, and the rich structure and all of that… But sometimes it can be daunting to construct the graph in the first place. You mentioned this nice tooling around LLaMA Index, and of course, we had Jerry on the show in the past, and such an amazing project there that’s helping many people… But I’m wondering if I could double-click on that point… What has been your experience, current state, in terms of how much work and how hard it is to do that graph construction piece? Because I think one of the things people love about RAG is you just put a bunch of documents in, and hands off, you construct these chunks, and “Oh, cool”, you get some nifty things out of it.
Of course, there’s the element that you highlighted, which is sometimes it’s hard for people to get that last bit of performance that isn’t captured by naive RAG… But yeah, on that graph data construction piece, how does that look right now, and what have you found to be useful?
You hit the nail on the head. I think the biggest issue that people I speak to are having in relation to both Graph RAG and in general using graphs is how do you create the graph from existing data that you have in other forms? I was exactly going to touch upon this anyway as my next point, which is Graph RAG is by no means perfect, and the most significant challenge is indeed around graph construction.
Now, there’s two things regarding graph construction that we can delve into. The first is that the quality of the graph is absolutely paramount, because as we know, in any RAG system the quality of your retrieval greatly impacts the quality of the generation downstream. A poor retrieval with garbage results is going to result in a garbage output from the generation model.
[00:44:03.27] So first of all, we have to stress on the fact that to get the most out of Graph RAG, we need a high-quality graph. But then you go one step further back, which is how do you even get a graph from unstructured text? So as you mentioned, LLaMA Index, and I think Langchain as well, some of these frameworks offer valuable tooling to help you extract triples or entities and relationships from unstructured text, and they do this primarily through the use of LLMs. But as we know, LLMs themselves have issues with hallucinations, or they just have issues in general with reproducibility.
You are not guaranteed to get the same results if you run the same LLM multiple times, and although some APIs provide seeds where you can control the reproducibility, that still doesn’t mean that it’s not random. The output of an LLM is still more or less unpredictable.
So there are a lot of other parallel works going on that are not using LLMs to extract triples from an unstructured text source. And I can talk about a few of those that I’ve found really exciting. And in fact, that’s kind of the stuff I’ve been exploring in this space, and I’d love to chat with people who’ve been doing this as well. So there are custom machine learning models coming out. One of them is called Rebel, which is R-E-B-E-L. And it’s been around for a while, but I think they’re upgrading their internals now for a newer version. So that’s a model that’s trained explicitly for the purposes of extracting triples for the purposes of a graph. And there’s another model, I think, called Relik, R-E-L-I-K. That’s also an open source model that’s been released recently, and people have been using it to extract relationships from unstructured text.
Now, it’s not to say that these are mature models. It still requires a little bit of coursing, and people need to understand their outputs and what they’re getting from them… But the idea here is that these models are, I guess, more controllable in what you can output from them. It’s not like an LLM where you just don’t know what you’re going to get. And of course, I think because of the fact that they’re small models, you can actually use them at scale on very large data.
So that’s one angle of things. I’m sure a lot of users have heard of the NLP library, spaCy. And I’ve been using spaCy for many years myself. It’s an amazing library. On its own, spaCy doesn’t have the tools to extract relationships and entities from text. Of course, you can extract named entities, or named entity recognition, NER. But recently, there have been some add-on modules or libraries that plug into spaCy. Two of them, which I’ll name here, they’re called Gliner and Glyrel. And as their name suggests, one of them is for extracting named entities from text, and the other one is for extracting explicit relationships from text. But they plug into the underlying spaCy tokenizer and the underlying spaCy representation of the data. So I feel like it’s a lot more usable, and there are some experimental notebooks that I’ve been working on, and I’m going to be experimenting more on this, to see how they compare in relation to using just LLMs alone to extract data.
So yeah, I think this is a very active space, and there is no right answer in terms of how you can use these tools… But I feel like there’s enough options and methods out there that people don’t have to feel that I’m relying on completely black box, unreliable LLM to do things.
What you’ve covered so far is really fascinating to me, and I’m learning a lot in this conversation. As we’re kind of winding up on this, and we’ve been covering Graph RAG, as well as the components that make that up, and kind of talking about how to combine vector search, graph search… And it feels like a brave new world that we’re leaping into, even for the AI topic, which is that way anyway. Where’s this going? Where do you envision this going over various timelines in the future? Things are happening at light speed, so you just go a few months and there’s a lot of change… But if you could maybe pick a couple of points out and tell us kind of what you think might happen… If you’re wrong, no big deal, but I’d love to see what your imagination has in store for us there.
[00:48:10.23] Yeah, I’d love to see this a year from now, because I know half the stuff I’ll say will be out of date or wrong…
Of course, but that’s fine. That’s this [unintelligible 00:48:16.20]
The stuff I said last year, I think I was still spot on, in the sense that I did talk about graphs and vectors last year too, and it’s still relevant today. So yeah, I’ll take my chances this time around. [laughs] So I think - yeah, you’re absolutely right. Things change way too fast. And my personal take here is that graph RAG is at a point in time, and we don’t even know in the next three to five-year timeframe whether RAG is going to be as hot as it is today. I don’t think LLMs are going away anywhere. Let’s face it… I mean, they’re here to stay. And they’re going to continue to evolve. We’ve just seen with the 01 model that came out from OpenAI they’re able to do reasoning… They’re able to do so much more than what people give them credit for.
So as this capability keeps evolving, and LLMs keep morphing into whatever else they become, there’s no guarantee that the kinds of tasks being done today using custom models, machine learning models, won’t be done by an LLM. Now, some of the wild takes that I’ve seen involve the fact that you don’t need to index anything in a database. You could potentially have your LLM’s parameters store the index in a very fuzzy way, and you could retrieve from that index, assuming that research goes in a certain way. But obviously, that’s a long way away. I don’t see that happening in the next few years.
What I’m particularly excited about in the next few years is everyone’s been talking about agentic systems. And if you look at the pivots that all these framework companies have had in the last year, Langchain and LLaMA Index, they’re heavily trying to push this field forward in terms of how agents can help build, I guess, more capable systems. Now, RAG is just one small subset of the things agents are orchestrating. They can actually do many other things related to recommendation, and say like search and retrieval, and a host of other activities.
So the conventional way or the standard way of doing agents has been using this framework called React, which is - you could say it’s sort of like a graph-based framework, but it’s sort of static, where you kind of decide the behavior of the system by programming in these paths, and the agent acts on it, and then you send it back after the reasoning step is over, and so on. But where I think actually graphs are going to be very interestingly used in the future - and I’ve seen some examples of frameworks that are already doing this - is can you use an underlying graph representation of action spaces to guide your agents? That is, can the agent actually actively update the action space via a graph structure? And once that happens, you can essentially have sort of more powerful agents, that are not constrained by a rigid, static sort of framework. You kind of have a dynamically updating agentic system, but it’s not as open-ended as having this recursive agent calling that we have today. You still have a framework. The graph kind of acts as like a base structure for the agent to take its actions. So that’s definitely one aspect of where agents are going.
But that being said, the field of graph, or you could say knowledge graphs, and their role in symbolic systems - they’ve been around. It’s been around for so long. And in the future, they could potentially be symbolic systems that utilize graphs in conjunction with these statistical models that are based on LLMs, and a hybrid sort of symbolic statistical system that combines these tools together. That may not technically be called an agent, but it’s still a hybrid sort of AI system, in the sense that you might have leveraged the power of an LLM for the language capabilities, but then used symbolic systems to do the other tasks.
All this is to say that I think graphs are way broader of an entity than what we are seeing today in terms of how they’re used. There’s a lot of other use cases that we couldn’t go into in this discussion. And Graph RAG obviously is a very, very small part of it.
So I would say I’m currently excited about Graph RAG, but I don’t see this being a topic that is going to be as talked about in the long term… But graphs as an entity and graph databases - these are systems that are going to actually be a core component of many systems of the future.
Yeah, it’s a good perspective to have, because this is such a broad topic… And of course, we’ll look forward to you coming back onto the show to give us more of a deep dive, and/or reading your blog posts. I highly recommend people check out the links to Prashant’s blog posts… And also we’ll make sure and include in the show notes links to the things that we’ve discussed here.
But yeah, thank you so much for joining again. This has been a real pleasure, and I feel like I’ve got a lot of things now that I want to go check out and try hands-on and learn more. So thanks for bringing out my curiosity on the topic.
Yeah, no, thanks a lot, Daniel. Thanks a lot, Chris. And I just want to end by saying Kùzu is an embedded database, so you can just pip-install Kùzu, and it’s super-easy to get started. And you can find me on Twitter and LinkedIn - I’m sure Daniel will share the links - and we can always chat more with anyone who’s interested about graphs.
Yeah, definitely check it out, and we’ll also link some of the examples that you mentioned as well, so people can get hands-on and try some things. I love this element of these embedded tools out there, that both let you try things locally, and then even have a pathway to getting those things into production.
Also, thank you for building great tools. That’s a great contribution, so… Thanks, and we’ll talk to you soon.
Thank you very much. Yeah, it was a pleasure talking to you both.
Our transcripts are open source on GitHub. Improvements are welcome. 💚