There’s so much talk (and hype) these days about vector databases. We thought it would be timely and practical to have someone on the show that has been hands on with the various options and actually tried to build applications leveraging vector search. Prashanth Rao is a real practitioner that has spent and huge amount of time exploring the expanding set of vector database offerings. After introducing vector database and giving us a mental model of how they fit in with other datastores, Prashanth digs into the trade offs as related to indices, hosting options, embedding vs. query optimization, and more.
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Typesense – Lightning fast, globally distributed Search-as-a-Service that runs in memory. You iterlly can’t get any faster!
|Chapter Number||Chapter Start Time||Chapter Title|
|1||00:00||Welcome to Practical AI|
|3||01:40||What are vector DBs|
|4||03:21||Semantics and vectors|
|5||04:20||How vectors are utilized|
|6||09:53||Evolution of NoSQL|
|7||15:52||Sponsor: Changelog News|
|8||17:53||How do we use them|
|9||20:37||Use cases for vector DBs|
|13||33:04||Speedrunning the blog points|
|14||34:48||In-memory vs on-disk|
|16||44:21||Exciting things in the space|
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am the founder of Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
Doing well today, Daniel. How are you?
I am doing so good, because a lot of my dreams are coming true in terms of topics to talk about. I’ve been wanting to talk about vector databases on the show for quite some time. I know that we’ve mentioned them, but we haven’t had a full episode on them… And I was scrolling through LinkedIn and saw a set of amazing posts, and very practical posts about vector databases that I quickly shared, and also sent a message to Prashanth Rao, who is a senior AI and data engineer at the Royal Bank of Canada. Welcome, Prashanth.
Hi. Thanks so much for having me.
Yeah. Well, now you have a three-part series on vector databases, a three-part blog series, “What makes each one different?”, “Understanding their internals”, and “Not all indices are created equal”? I hope we can get into a bunch of that. But maybe to start out, could you just let us know what a vector database is? And in particular, why are people talking about them now?
For sure. So I think the way I want to answer this question is I’d like to break it down into parts, and answer each bit sequentially.
So to answer what a vector database is, let’s start with what data is in the first place. The definition in my head is data is an organized collection of structured or semi-structured information, and it’s stored digitally in a computer. Now, when you have data, you need somewhere to put it. So that brings us to the question, “What is a database?” So a database is the system that’s built for easy access, management and updating, and also querying the data at hand.
We also need to talk about what vectors are. Vectors - you could call them a sort of compressed data representation that contains semantic information about any underlying entity. It could be text, images, audio, anything like that. So now we’ve put all of these things together. What is a vector database? A vector database is a purpose-built database that efficiently manages, stores and updates vectors at scale. I think the scalability is a very key factor there, and it also retrieves the most similar vectors to a given query, in a way that considers the semantics of the query. So I think all of these terms holistically come together to form what we understand as a vector database.
And when you say “semantics”, what do you mean in terms of semantics and how that maps onto a vector?
So I’m sure everyone mostly [unintelligible 00:03:29.13] familiar with the concept of language models; LLMs are everywhere. So the thing about semantics is typically a query that you have; if you write a query to like a search bar on Google or something, you’re thinking in terms of keywords. You’re just thinking in terms of “Okay, I want this particular thing, this item”, whatever you’re thinking about, and you type the word in there. Where semantics comes in is did you type in something along the lines of what the data itself has? So can that query actually translate into something that the database actually understands, and produces the result that is most meaningful to the query that you put in? So it’s not just about the words or the features of that word, it’s also about the meaning of that word, and how that comes together in the underlying internal [unintelligible 00:04:15.10]
Cool. So before we keep going, just because – you know, you have developers, and data scientists, and they’ve worked with kind of all the other database types that most of us have worked with for decades… And we have multiple times over the years had to kind of understand the new thing that’s out, and what the value is…
There you go. And so I’m gonna jump on that - you know, we started with the SQL query language, with these that are for relational databases, and then we went to NoSQL, which there are variants of, and things called object databases… I understood your definition of vector, but I didn’t understand how it related to the utility or lack thereof in some of those other approaches. Could you kind of lay the groundwork or the landscape of what that is?
For sure. So yeah, I’m a total database junkie. I love thinking about the various kinds of databases out there. So actually, before we go into that, a quick summary in terms of where I’m coming from.
So I started off as a data scientist. So I’m fully in your world, Daniel. And it’s been a few years down that road for me, and I think for me I’ve hit that point where I’ve been lost in the world of models, and hyperparameter tuning [unintelligible 00:05:30.03] But the more I began thinking about it, there are people who have entire Ph.D’s in database theory and their implementation. But then the more I’ve worked with data, I realized that you don’t need a Ph.D. to understand enough to build a working application built on top of a database. That’s when I began thinking about what exactly are these different flavors that you have out there? Of course, we’ve all come across SQL databases at some point in our careers if we’ve worked in tech.
[05:59] So to answer that question, I think the gender history of how these things panned out is quite interesting. I believe the origins of SQL databases come from way back in the ‘70s, I think, when this field called relational algebra was formalized. It’s a kind of formalization of the mathematics around what it means to join data, query data, store data in a database in a way that is queriable. I think SQL databases are so mature, so tried and tested, and the reason they’ve withstood the test of time is because they view the underlying storage or the underlying data as structure. And in many cases, you have structured data that is in the form of transactions. What a transaction basically means is some event happens in the real world, and you log that information. And you essentially build up a sequential chain of data, which is basically a table, and that’s kind of what the relational data model came from.
And where relational models get interesting is you have tables that are related to other tables. And that kind of maps into real world complexities, where not all data is independent. Some of the data depends on other things. A person’s metadata could depend on what company they work at, and things like that. So that’s how relational data kind of became the norm. People were gathering data from digital systems, and then putting them together… And SQL became the sort of standardized query language that you could use to query data.
Fast-forward to mid-2000s, and the NoSQL movement starts to pick up. And where that comes from is - there’s a point beyond which relational data modeling can become a bit inflexible, it becomes a bit rigid… Because in the real world, you have data that comes in from various sources. Now, some of that data can come in very rapidly. With the advent of big data, and streaming, and all these rapid ways of gathering data that we have today, it became very obvious that the schema-based approach – a schema is basically what kind of data types exist in your table? So the way relational models were built was you needed to define a schema, and the schema kind of was the ground truth; the data has this data disk type only, and that’s what you expect in there, all the time.
I think the NoSQL movement was sort of built on top of the limitations of the relational approach, of being pre-decided by a schema… Because to be truly flexible in terms of the massive amounts of data coming in from various systems, you need to have a schemaless approach at times. And a schemaless approach basically means you store documents, you dump data in semi-structured JSON blobs, and things like that, in a scalable way. And I think horizontally scalable became very, very important in that period.
The earlier databases were relational. I think they were more vertically-scalable, in the sense that you could just add more and more compute, and you essentially scaled up your data that way. But now, with no sequel, the idea of distributing the data as documents across multiple machines and having those machines communicate with one another - that became a new paradigm. But I think the challenge with NoSQL is because of the underlying nature that the data need not necessarily be dependent on itself, in the sense of relational tables, they didn’t adhere to the SQL language standard, and they kind of diverged. MongoDB was among the first, and there were many others that came after it, using JSON-based query languages.
So there was a big bifurcation, I guess, in, you could say the database community, when on one hand you have SQL enthusiasts, who swear by the declarative nature of SQL, and then you have the other community, NoSQL who uses JSON, essentially, to query the database. They claim it was developer-friendly, and JSON is a developer friendly interface, language-agnostic and so on… So in some ways, it does have its benefits. But then, depending on your use case and depending on what you’re trying to do, there are people who will argue on both sides that SQL should be the only thing you should use, or NoSQL should be the only thing you should use, and so on. So does that clarify aspects of both those camps before we move into the modern ones?
It does. And then if you could distinguish as you go kind of how a vector is different from those others; that would be helpful for me, and I know maybe some other folks in the audience.
[10:00] Yeah. And I think maybe one thing that I loved about your blog posts is I see some of the players from the world that we just talked about represented within that landscape. And then also some that I’m not familiar with, or at least that I’ve seen only recently. And so you’ve got these different axes, like Postgres, which is a SQL-based query language to a relational database has some part to play in this vector database ecosystem… But then others, which seem to have their own query language. So maybe you could also start to break down for us - so we want to store vectors in databases now to do these sort of semantic queries. Does that need to be stored in one or the other of these types of databases that you’ve talked about developed over time? Or how has that happened, and what are the sort of major categories of players in the vector database space?
Absolutely. So I think, before we get into the specifics of databases, I think, to answer Chris’s point, we definitely do need to talk about the evolution, right? I see that vector databases are a natural evolution of the NoSQL class of databases. If you imagine a Venn diagram, you have like a circle that represents SQL, and the other circle represents NoSQL; you have an intersection. That intersection point - I believe they’re called NewSQL now; I’m not sure if you’ve come across that term. It’s quite interesting. But NewSQL - they technically use SQLite languages, but they also claim horizontal scalability, and a bunch of other things related to asset compliance and all the other things. So it marries the benefits of both SQL and NoSQL paradigms. I was thinking initially, “Where do I place vector databases? Does it go in that intersection, or does it sit purely in the NoSQL camp?” Then I imagined this as you extend that circle that has NoSQL, it becomes like a blob, like a fuzzy, amorphous blob. NoSQL is huge, and in my head, vector databases are like an extension to NoSQL. And why they came about - to understand what vectors are, and how they’re stored in a database, I think it’s important to understand what search is, and what essentially you’re doing when you query a NoSQL database.
So where it comes from is, in the early days, I guess people were just submitting an exact query, using a JSON sort of query language, like our MongoDB has… And that query has to have all the terms or parameters in there that tell you what you want to fetch from the database. In a SQL world, it will be done with a declarative query in SQL, whereas in NoSQL, you typically do it in JSON.
Over time, I think the idea of full text search became very important, because I think everyone wants to be able to retrieve information from massive blobs of data sitting around. And how do you query that, right? If it’s in a NoSQL sort of format, you can’t write a SQL query to retrieve it, how do you get that information? So the idea of a full text index came about. And what essentially that is is it uses a concept of inverted indexes - inverted file indexes, sorry - where you consider the term frequencies of terms that appear in a certain document, and obviously, the relative frequency of how often those terms exist in a document, versus the entire dataset.
So you combine all those things together, similar to how [unintelligible 00:13:13.21] is in data science; there’s an algorithm called BM 25, which is the most popular inverted file index algorithm. It’s the most commonly used one for full text search. So the early days of search involved how do you scale that up, because you have massive amounts of data; how you build that index very, very efficiently? And then the query interface sits on top of that, so you essentially submit a query saying, “Okay, so and so done. And the keyword that you put in, and the inverted file index, the BM25 algorithm, it considers the word’s frequency, and it considers subword features, and a bunch of other things to intelligently retrieve relevant documents that contain that term… But also throwing out useless words, stop words, and things like that. So it was more of like a bag of words, sort of… Considering an NLP analogy, it’s kind of like a bag of words way of approaching text.
[14:05] Now, fast-forward a few years, I think ever since the transformer revolution happened, people began observing the obvious power of transformers in encoding semantics. A transformer is way better at isolating meaningful terms in a document, especially when you’re doing things like classification, retrieval, and so on. So how can you merge those benefits of a transformer with what you have in a database?
So I think vector databases - the term got coined, I think, much later, after transformers came about. It was mostly called Search Engines before that, a more generic term, I think a catch-all term for anything that involves search. But nowadays, I believe search engine refers to a more – like, you consider semantics as a key component. So essentially, vectors are the only thing that can do that.
So to really describe what a vector is - essentially, you have a language model, typically a transformer-based language model, that you use to embed the representation of a sentence into tokens, and the representation is stored as a vector. The vector that you have essentially for a particular sentence - typically those are done using sentence transformers, which is the most common kind of model you use. That essentially embeds the entire semantics of that sentence in the vector. And then the way this scales up is you consider the context of each and every token in that vector in a way that when you submit a query, the semantics of the query are mapped to the vector in your database, and you can find a similarity between what you entered as a query, and what exists in the data. So a vector is a very powerful way of, you could say, compressing the representation of meaning in a sentence or a document, in a way that scales up numerically, and you can rapidly query that in [unintelligible 00:15:49.24]
So Prashanth, you kind of alluded to this, and I think that explanation was amazing, of how this vector-based semantic search really exploded around the time that transformers and large language models did. I think even in this past, let’s say year, there’s been this huge explosion of interest in vector databases. Could you maybe describe a little bit – so we know that you can search a vector database to find similar statements, let’s say, or similar chunks of text, where the similarity is based on semantics… How are people using them with regards to their AI workflows, and how does that kind of correspond to what’s sort of popular right now in terms of what people are exploring with AI?
So I think I need to highlight the fact that I’m both fascinated and frustrated by the current state of marketing in vector databases, both at the same time. I’m genuinely interested in the use cases, don’t get me wrong, when combined with LLMs, large language models like ChatGPT. You could say any sort of language model layered on top of a vector database can be used to build some very, very interesting applications.
One of those interesting applications is querying your data via natural language. I think this has always been a dream of data scientists and people who work with data, right? Rather than writing my query by hand, or constructing the query painstakingly from the ground up, can I just talk in natural language and have the database kind of respond to that query in natural language as well? The application we built using an LLM at the core, and essentially, that will be powering the whole translation of human instruction to machine instruction and back to human.
I could go into the details of specific applications, but one thing I do want to maybe throw back at you is - I know this is a Practical AI podcast, so I guess what I was hoping to get into is… I have an idea for a fourth blog post; a series, basically. Part of it is the trade-offs. What really interests me about the various vector databases out there, and why I began writing about these at length, is when it comes to understanding what tool to use in the real world, when you have a business problem, when you have a particular case you’re trying to address, obviously there’s tons of information out there, you could go out and read a bunch of blogs and papers and come up with your trade-offs. But I think it makes sense to actually walk through some of these trade-offs. And my understanding is that as you go through these trade-offs, you actually begin formulating the value of these things much more clearly. And in my head, I think it makes sense to talk about the use cases once we go through some of these key trade-offs. Because in many ways, using a tool depends on what goes into it and what you thought about the different options.
You can dive right in, because yeah, I had follow-ups, which were essentially what I think you’re about to cover anyway… So I’ll just leave the mic with you, man.
For sure, yeah. So basically, it makes a lot of sense to write about this, and obviously read it at your own time… But this is a great place for me to begin talking about it, and eventually I’ll put these down in words as well. So I’ve broken these down into, I think, roughly eight categories… The trade-offs; I’m specifically speaking about what you need to think about when you’re thinking about a database. And this will answer exactly what you talked about earlier, Daniel, about – so the first thing I think Daniel mentioned is the idea of deciding between existing databases that have been around, document format, and things like that, versus newly-designed databases, specifically for vectors. So I’m going to call it purpose-built vendors versus incumbent or existing vendors.
I think it’s very important to understand, in many cases you might just be looking to add semantic search capability, or just retrieving information using semantics on top of an existing application. And that existing application could very well be built on a well-known, tried and tested solution like Elasticsearch, Postgres, and so on. There’s many solutions out there. And obviously, in those cases it makes sense to just say “Hey, why can’t I just leverage the vector index or the vector storage of that database itself?”
[21:56] For example, you mentioned Postgres. One real big concern with this is if you look at some of the material online on the performance of these metrics, the pgvector – pgvector is basically the vector plugin add-on to Postgres. And there’s been enough documentation about this, but essentially, the way it’s been slapped onto Postgres is as like an add-on. It’s like built by a third-party called Supabase. And they add a vector functionality to the existing engine that Postgres has.
So by its very nature, because it’s not tightly integrated with the underlying internals of the database itself, like the storage layer, the indexing and all of that, you’re going to miss out on a lot of optimizations. Not you, but the technology is basically not optimized from the ground up to speed of indexing, performance during querying, and so on. And this has been well documented. So that is a very big concern. Depending on your use case, and how much accuracy and what quality of results you want, are you better off using an existing database that you already have in your stack, or actually bringing on a new, tried and tested, purpose-built database for that very reason? And from my experience – I’ve been tinkering around with quite a few options out there with purpose-built vendors. In my opinion, they’re always a better solution in terms of scalability, efficiency, and also accessing the latest technology, the latest algorithms out there, what indexing algorithms are out there, how do you get the best bang for buck in terms of your speed of indexing, the quality of query results, the latency of those results, and so on.
So I feel like in the long-term, if you actually are serious about building a vector search, or a large-scale information retrieval system that considers semantics, it makes far more sense to think about a purpose-built solution. Many, many database solutions are out there; I’ve listed some of those on my blog. And I think those are going to win out over the incumbent vendors who have kind of built vector offerings, if you can call them.
What we’re talking about is exactly what I had hoped we would talk about in this episode, because your blog posts were so practical. In terms of how you think about the infrastructure that you work with day to day, would you recommend – because sometimes you don’t know how much you need to optimize at the beginning, and you can over-optimize… So would it be a valid maybe a stepping stone to say “If I’m already working with Postgres, I could try out the vector capability of that, and if it works for my use case, and I don’t have 3 million documents that I’m searching over, maybe it’s fine. I’m just doing my personal blog”, or something. And then kind of optimize as you hit a wall, or is there danger in kind of trying to put a square peg in a round hole sort of thing and get yourself in trouble?
You hit the nail on the head. I was going to exactly put a square peg in a round hole, because I faced those issues myself. I won’t name exact database vendors, but I’ve worked with SQL and NoSQL databases, which obviously have vector solutions. I think the challenge and the issue with saying “Okay, I already have something that works” is you’ve gotta remember that every single database that has existed for, I think, more than 10 years - databases come with baggage, and they have their own tech debt that is associated with the underlying programming language they are built on; there’s years of decision-making and architectural decisions under the hood that they’ve taken to implement solutions the way they have. So they can’t just throw all of that away, and then build a vector solution that is optimized from day one. It’s gonna take a fair amount of time before these incumbent vendors are able to optimize their offerings to a point that they perform as well as purpose-built vendors, because these purpose-built vendors have spent thousands of man-hours, I guess, per offering, in just tuning and building for a very specific goal.
So what I’ve noticed in my experiments is that a lot of features that you take for granted in a purpose-built offering are not even available in the existing solution. Pgvector is a very, very young solution right now. Elasticsearch’s vector offering - I’ve worked with that as well.
[26:00] Considering Elasticsearch has been around for so long, they only released their first vector, like ANN algorithm, I think last year, in 2022. So in terms of a database’s capabilities, that’s very, very young. So I would say there’s a lot of things that you could potentially be missing or lacking. And I’ll cover some of those in my other trade-offs that I list as we go forward.
Yeah, yeah. Let’s go on to those. I’m curious what number two is.
For sure. The number two is - so I came across this in my first blog, and reading some of the comments on there. And one of them brought up the fact that the trade-off between using a database that allows you to build your own embedding pipeline, versus using a built-in hosted sort of embedding pipeline. And by that, I mean how do you generate these embeddings or these vectors? Many people are familiar with sentence transformers; it’s available on Hugging Face and a bunch of other open source platforms… So essentially, it’s quite easy, or you could say it’s trivial to put your data into these pipelines and generate sentence embeddings that you can just use to ingest into a database alongside your actual data. So you have your document data, that has all the fields and attributes that you have in there, alongside the vectors that encode the useful information in that that you want to query on. So that’s a relatively trivial thing to do. But there are certain database vendors who offer convenience features on top of that, where they embed the API of these models inside their own offering. So if you’re just getting started, and you don’t know much about how vectors work, or how LLMs work, or any of these things, that might be something to consider. You might be better off using something like [unintelligible 00:27:32.10] which has pipelines built in, where you can just tell it “Okay, connect to Hugging Face so and so model”, and it will build the embeddings for you… As opposed to you writing your own custom transformer pipeline, that actually takes in the vectors, generates the vectors, and so on.
Now, if you have experience with transformer models, you might be far better off in doing all of the embedding work upstream, parallelizing and optimizing that portion, generating those at scale, and optimizing from a cost perspective; getting those done with the least resources and most quality that you can. And then just sending the vectors over to your database. So this is an important thing to consider, depending on the level of experience that our developers have on your team, to actually bring the vectors in.
Gotcha. That makes perfect sense. What are some of the other trade-offs?
So then the other thing is the two key stages, right? You could break down – when you use a vector database as a developer, the first stage is the input, which is essentially building the index. I go into the indexing methods in a bit more detail. That’s not really a trade-off, it’s more about knowing what the indexing even does under the hood. But what indexing means is you have data that you need to encode into a vector. Now, it’s not as simple as just dumping a vector, which is like an array of numbers onto your database. You have to be able to search through those vectors.
So the goal of indexing is to design efficient data structures, and store the vectors using those efficient index data structures in a way that they can be queried efficiently, and at scale. So that is an upstream process, and you do that once upfront; you bring all your data in, it’s indexed, and now you have a bunch of vectors in there that are searchable.
The downstream portion of that is querying. It’s basically like inference in NLP. The query stage involves you taking the user input, transforming that into a vector just like you did your raw data, and the vector embedding that you use there is an embedding model that you use to obviously transform your data, so that they are compatible. So that’s a downstream step. You’re clearly separating the indexing step from the query step.
So the trade-off here is, is your database optimizing for indexing speed, or query speed? Or is it mature enough that it has optimized for both? And if you look through all the offerings out there, many of the existing vendors have focused more on one end of the pipeline, and not so much on the other. Some of them are faster at indexing, and not so much at querying, but some of them are way better at querying, and much, much slower during indexing.
[30:05] So generating that index actually can be a very expensive step, because it’s not only about using a sentence embedding model or a transformer, it’s also about the database being able to translate those vectors into an index that it can actually query. So depending on the size of your data, this could take hours or even days. It’s not unheard of to hear of indexing periods of the order of days. And of course, depending on the amount of money you’re throwing at it, you could go use GPUs to speed up the vectorization, and use multiple parallel instances of the database to scale that portion up. But that’s exactly – the trade-off here is how important is indexing speed? If your data is coming in in a stream, at a very rapid rate, it’s important to consider indexing speed as an important criteria. But then, if you’re not so interested in dumping large amounts of data very quickly, but more interested in serving results to a very large number of users asynchronously, then queries become very, very important.
I know, we don’t want to necessarily call out certain players in this space, but I think a lot of people are already familiar with a lot of the names here… So maybe if you could just highlight, from your perspective, what are maybe some of the ones that are maybe more, like you’re saying, mature in how they’re thinking about both of those phases, whereas maybe certain ones that are optimizing more on one side or the other? …which, like you said, depending on your use case, it’s going to be a good thing, or it might be a bad thing. So it’s really about use case, it’s not so much about the goodness or how amazing a certain offering is, but more about use case.
Yeah, absolutely. So as you said, I’m not going to call out specific players… I mean, to be fair, everyone, every vendor makes trade-offs. They themselves are obviously juggling a lot of their own trade-offs when they build these things. But I obviously haven’t used every single one out there, but the ones I have, I’ve worked with the most mature ones. I think Milvus is an open source purpose-built database. It’s been around the longest, among the longest, I think, in the vector database market. It’s extremely scalable. I mean, I’ve written in my blog, I call it “Milvus throws the kitchen sink and the refrigerator at the vector problem.” So it can really handle billions of data points. It’s designed for that, and obviously it has had time; it’s been around for about four or five years. I wouldn’t say that that would be my go-to first choice, but that’s my own personal preference, to be honest. It’s more about, I guess, usability, how accessible their Python client is, and so on, than other vendors [unintelligible 00:32:29.25]
So you could say that these are very, very powerful solutions, they scale really well, they ingest data really quickly, and they also supply query results very quickly, and relatively accurately as well.
To be fair, I think the existing database vendors like Elasticsearch, Postgres - they’re not there yet in terms of the speed, and that’s partly because their general-purpose databases; they’re not specialized vector databases. So it makes sense that they have to deal with other priorities, and they cannot optimize for all of these things with the laser focus that purpose-built vendors have.
Thank you so much, Prashanth, for helping us start to pick apart some of these trade-offs. I’m starting to structure things in my mind in a useful way, which is really great, because I’ve also been exploring a lot of these, and I agree with you, there’s a lot of also new entrants into the field that show a lot of promise, even the ones that aren’t quite as mature yet… What are some of the other – you mentioned eight. I don’t know, if we’ve been through three or four yet, I wasn’t keeping track…
I might have to speed things up a bit.
Just kind of list them off, at least, Yeah.
Okay. So maybe I quickly go through at a high-level, and then we can go into the finer details of which ones you think are the most interesting. So okay, let me summarize the first three… It’s basically purpose-built versus existing solutions. That’s number one. Number two was external embedding pipeline versus built-in hosting pipeline. Number three is indexing speed versus querying speed. So I think the others are going more into the actual indexes and generation of those indexes in more detail. I’ll go through them.
[34:07] Number four is recall versus latency. That’s more related to how accurate are the results, versus how fast am I retrieving those results. Number five is in-memory index versus on-disk index. I think this is a very big one for the future, so we definitely want to go into, I think, some of the details of that. Number six is sparse versus dense vectors. The kind of vectors themselves that are underlying the index. Number seven is the importance of hybrid search, where it’s full-text search combined with vector search. They both have their own trade-offs. And I think the last one is the importance of filtering. So pre-filtering versus post-filtering to decide the quality of your search result.
Yeah. I am very curious about this in-memory or on-disk one… Well, I’m interested in all of them, but I know one of the things that has come up in several of the applications that I’ve worked on has been “Okay, do we self-host one of these things? Do we use the managed service, because they’re going to be able to scale up and optimize things?” There’s also the choice of, “Oh, well, I could just load one in-memory, on-the-fly, ephemerally.” I could have an embedded case where I load a bunch of vectors in, and then there’s some persistent file that I can pass around… And then there’s I think more of what you were getting at, which is “Is this index represented on-disk or in-memory?” Could you maybe help us parse through some of those things and go into a little bit more detail of what you mean there?
So yeah, now that you mentioned self-hosted versus cloud, I think that’s the number nine that I will add eventually. It’s a very good point that you brought that up.
Perfect. Yeah, maybe we can find a number ten to round it out before the end of the episode…
I’m sure there’s way more, yeah. I could go on all day. So yeah, going back to your in-memory, I think it’s a very important one. So I think this is one of the things that is defining what you would call the race towards vector supremacy. I don’t think the term is very accurate, but anyway. I think the challenge with most of the vector indexes out there - I think the most popular one by far is called HNSW, hierarchical navigable small world graphs. And I go into the details of the algorithm in part three of my blog, so I’d be happy to discuss more with anyone else outside of this, if required… But HNSW index is known for its relatively good trade-off between recall and latency. It’s fast, and it’s relatively accurate, but it is also memory-hungry. And where this becomes an issue is as datasets get larger and larger and larger - this is called the trillion scale vector problem now. A lot of vendors are talking about it; it’s not too far away to imagine that you’re going to have to, at one day, at one point, index a trillion vectors. And that is by no means a mean feat. It’s a very challenging problem.
So the dataset in that situation would be way too large to fit in memory. Now, HNSW already does a lot of optimizations under the hood. The algorithm is designed to store a sparse graph in memory. Essentially, you search through the sparse graph, and then through the layers of that graph you narrow down on the nearest neighbor to the query that you input. But as we go and get larger and larger into data, even that sparse graph does not fit in memory.
Databases have come up with different solutions as to how to deal with this out of memory issue. One example would be Quadrant - they use this thing called menmap It’s like a sort of static RAM option where you don’t actually store the vectors in memory, but you persist it to the page cache. And it’s still better than directly storing it on solid-state drive, which is one level below.
So in terms of latency hit, it’s not as bad, so you don’t lose that much performance, and you’ll notice that a lot of vendors fight really hard to avoid persisting any vectors of the index to disk. Because the moment you go onto solid-state drive, there is a massive performance hit in terms of retrieval. Because the speed at which you’re able to retrieve things from memory is, as you know, much, much faster than what you could do on disk. That’s a general trend, I think, across the board right now. Most vendors are largely working with storing the HNSW index in memory, and then adding some sort of caching layer to avoid having to repeat the queries and waste time in that sense.
[38:15] There’s this is entirely new index called Vamana. I’ve written about that on my blog. It’s optimized for solid-state disk retrievals. And the algorithm they use is called DiskANN. Not every database vendor has implemented this, it’s still in the early days… But if I look at where the future is going, there are many options that vendors could go down the road of. They could choose to implement HNSW on disk, but record suggests that that’s not a great idea, because its performance would drastically reduce. It would not perform as quickly as it does. DiskANN seems to be the agreed-upon standard across many vendors, but the challenge of DiskANN is the original research paper that implemented it, the Microsoft team that implemented it - their implementation does not directly translate into the database internals. Depending on the language that the database uses - many of these are written in Go or Rust - the standard implementation was written in C++. So it’s not a direct transplant of the algorithm from the source to the database. It required a lot of rewrites and a custom approach towards optimizing for that speed.
But that being said, I have to point out one particular vendor that I think really stands out from everyone else on this trend. They’re called LanceDB. They are, I believe, the youngest database out there. They’ve just come about I think at the end of 2022, early 2023. And they are the only solution, as far as I know, who only support on-disk indexes. They don’t do an in-memory index at all. And I was initially very surprised as to how they even do this, how can you go about this… But as I dug into it - and I’ve spoken to some of the team as well; they’re really, really open about their research that they’re doing and all the models they’re building… But essentially, they innovate on multiple fronts, but the biggest innovation is the underlying storage layer, storage format. They built this format called Lance, which is essentially optimized for on-disk reads of data. And the database itself is built on top of this open source format, Lance. So the whole thing is open source, it’s built in Rust, so the performance there is already close to bare metal, it’s relatively fast… They have already built an experimental DiskANN implementation.
So when it comes to these on-disk versus in-memory trade-offs, Quadrant is going about their own path in terms of how they achieve on-disk rather than memory data. [unintelligible 00:40:22.21] LanceDB is innovating on a different front… I feel like these are the three vendors who I’ve interacted with more and used, and I think the future is heading towards one where on-disk becomes a requirement and the standard way of implementing an index… But the engineering challenges are still ongoing.
Let me ask you a slightly different question; it’s not completely unrelated. The things that you’ve been kind of addressing there kind of are leading me to the next step on that. So when you’re thinking about kind of environments that you want to put in… Like, I know if you look at the other database types before vector, you would have some that are scaled massively in the cloud, you’d have others, as we’ve moved more and more intelligence and data out onto edge devices, and they’re either embedded or they’re designed to serve in a very constrained environment… What are the options for vector databases in that? I’m assuming that there’s obviously the cloud capability, because that’s kind of always the baseline… Do you also have - you know, as we’re moving into an increasingly autonomous world out there, and more and more things are being pushed out outside of the data centers in the clouds, or at least the central parts of the clouds - are their options for either embedded or micro-serving, if you will, on the vector side?
That’s an amazing point, yeah. I covered this in my blog post number one, in terms of the architectures of these databases. And you’re absolutely right, I think there is a lot of room for embedded databases to become the norm. I know DuckDB is making waves in the SQL market on this front. I think a lot of vendors are emulating what DuckDB has done in SQL. As you know, DuckDB is an embedded database, unlike Postgres, which is a client-server architecture database.
[42:08] So what happened in the SQL world is now translating into, we could say, the vector world. Two databases that are following this embedded approach, LanceDB, as I mentioned, and ChromaDB. These are the most – ChromaDB is quite well-known; people have been talking about it for a while… But between the two of these, I do think that LanceDB stands out more in the underlying technology, because Chroma, from what I understand right now, is it still building out its underlying layer. It was kind of wrapped around an existing underlying internal database itself. It did have its own purpose-built offering to begin with, but they’re kind of building that out as we speak. So I think between these two vendors, it’ll be interesting to see how each of them rolls out their own features, and kind of target a specific part of the market.
Going back to the point of cloud versus on-prem, that’s another big thing, I think, that’s going to come up. Honestly, Pinecone and services like that, that are completely on cloud, they could be real potential bottlenecks for companies to be okay with just sending all the data to some cloud… Even if Pinecone says they would deploy on your infrastructure, at the end of the day it is still a purely cloud-based solution. There’s a lot of infrastructure-related hurdles around that.
Self-hosted is, I think, as you say, going to become more and more common, and certain options, like [unintelligible 00:43:15.25] quadrant they offer self-hosted options in their licensing as well. So the question for me that remains unanswered is “Which model in vectors database, vector search, will dominate in the longer-term?” Embedded versus client server. We are so used to the model of client server; that’s been working for more than a decade right now. Pretty much every database we’ve used is based on the client server architecture, where the server sits remotely, and I don’t have to have the server running anywhere near where the application’s running. But I think embedded databases, especially with LLMs in the picture, it makes a lot of sense in terms of data privacy, and things like that. And the scalability of these have, I guess, not truly been tested. DuckDB is just three or four years old, LanceDB is less than a year old, Chrome as well… So it’d be interesting to see how embedded databases compete on that front, and how well adopted they are… Because I think industry generally tends to favor things that are tried and tested. At scale, for this sort of thing to catch on, it will have to offer real, real business value. And the way these databases monetize their offering, I think that’s gonna be interesting to see.
And I guess we’ve already started moving this direction a little bit, but as we draw closer to an end here, I’m curious - you have explored probably more than, well, many people, certainly myself, in terms of how all of these offerings compare, what the trade-offs are related to vector databases… I’m curious, as you look towards the future, what are you excited to try that you haven’t yet tried? And then maybe what excites you about this space? I know you mentioned certainly there’s things that are hyped, or maybe different marketing that plays into this… But what are you actually practically excited about as a practitioner in the future of this vector database space?
I think the low-hanging fruit is the immediately obvious one, so I’ll start with that. I think in the past, when it came to search, we imagined the Google search bar and idea that to build something like that was inconceivable a few years ago. Having a scalable, reliable search engine that you could build in-house on your own proprietary data was really difficult to do at scale. But today, I think with the combination of vector databases and LLMs, with GPT-4 now, and all the other models out there, I really think that it’s kind of become available to the masses. The average company who does not have massive compute is still able to build very, very valuable search solutions, information retrieval solutions, on top of their existing data.
There are additional offerings, like Haystack, and there’s search engines that build on top of vector databases. But I think the foundation layers are actually being enabled by vector databases, which is why I’m so interested in those use cases. So those applications are very interesting, at first.
[46:06] The other thing is retrieval-augmented generation. This is a term that came about – I think it was introduced by Meta in one of their recent papers. Essentially, the idea behind retrieval-augmented generation is typically information retrieval involved - you send a query, and you receive a response that retrieves information relevant to your query. Where the generation comes in is now LLMs add an additional layer on top of that. You could send a query in natural language, and you retrieve the most similar documents to that query, but rather than just retrieving the document itself, you could have the language model go through the document, look at your query, and then retrieve only the part of the document that is relevant to that query, and then generate a response that could potentially answer a question that you had. Like, “What is the birthday of so and so person who runs this company?”
So these kinds of things were really, really – like, almost impossible to do before, but now I think it’s really actually achievable with the kinds of tools and technologies that are available today.
I think retrieval-augmented generation is really skyrocketing right now as a term. I think everyone’s talking about it. But what I want to add to that is, I want to throw this out here to any of the listeners, and potentially I’m going to talk about this to other people in industry as well… Can we add another layer to retrieval-augmented generation? And what I’m really interested in is how the two worlds of graph databases and vector databases come together. And I posted about this a couple of times… But what’s really interesting right now is most graph databases, like Neo4j, for example - they use declarative query language interfaces, like Cypher. Cypher is – you could say they’re a SQL equivalent for graphs. The good thing about knowledge graphs is the encode factual information, and in a very human-interpretable way. So the things that form nodes and edges in a knowledge graph - they are something that we as humans put in there and encoded our knowledge of the real world into the data. Where vector databases sit complementary to this is, in many cases I might have connected data, where let’s say a person knows another person, like a social network situation; person follows another person, person lives in a city, and so on… These are all meaningful, connected entities in the real world. But you add some layers of data on top of this - you know data about a city, you know data about a person, where they worked, you know what company information that has… There’s so much additional unstructured data that attach onto the node in a graph, that is actually hard to query using conventional graphs algorithms or graph languages. So I think vector databases are uniquely placed to add new value in that space, in terms of - I call it factual knowledge retrieval.
Now, the problem with knowledge retrieval is sometimes the queries that you have need to be exact. The ability to submit a fuzzy query, that does not exactly match your terms in the graph, is something that you didn’t have before. It’s very difficult to actually generalize your query in a way that retrieves useful information. So I’m very interested to see how the power of natural language querying interfaces enabled by LLMs can be built on top of vector databases that store all the information related to an entity, and then encode that entity into a knowledge graph. And then you tie all these things together in a way that you can actually retrieve information, and explore and discover aspects about your data that you couldn’t otherwise, in a way that actually ties all these tools and technologies together. So I call it like an enhanced retrieval-augmented generation sort of model. And this would obviously require tools like Langchain, or LlamaIndex… I mean, these additional frameworks that allow you to compose these different tools together, and pass data and instructions back and forth between the human and the different underlying databases themselves. So I’m super-excited about those technologies.
Yeah, I think it’s great to hear that perspective. Also, usually the answer is not like “Only this technology, and nothing else is the solution”, but a strategic combination of things often is where things end up. I think those are really interesting topics to explore, and I look forward to your mini-part blog post as you explore those. I’m definitely going to be following your writing now.
Thank you from the community and from us for your work on this topic, and sharing that work with the community. It’s super-practical, and we’re very privileged and happy to have you on the show. Thank you so much, Prashanth.
I appreciate it. Thank you so much.
Our transcripts are open source on GitHub. Improvements are welcome. 💚