Johnny is joined by Marty Schoch, creator of the full-text search and indexing engine Bleve, to talk about the art and science of building capable search tools in Go. You get a mix of deep technical considerations as well as some of the challenges around running a popular open source project.
Datadog – Cloud monitoring as a service. See inside any stack, any app, at any scale, anywhere. Datadog is cloud-scale monitoring that tracks your dynamic infrastructure and applications. Plus next-generation APM. Monitor, troubleshoot, and optimize end-to-end application performance. Start your free trial, install the agent, and get a free t-shirt!
strongDM – Manage access to any database, server, and environment. strongDM makes it easy for DevOps to enforce the controls InfoSec teams require.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Click here to listen along while you enjoy the transcript. 🎧
Hello, and welcome to this week’s episode of Go Time. I am your host, Johnny Boursiquot, and joining me today is none other than Marty Schoch, best known for Bleve, the full-text search and indexing library, of course built in Go. Welcome, Marty! How have you been?
Thank you, Johnny. Thank you for having me.
Yeah, absolutely. I’m very surprised that you have not been on a podcast, talking about Bleve for this long.
This is a first for me.
I know, right? You’ve been at conferences, you’ve talked about it, and the trials and tribulations of working on that project at times… I was watching a talk you gave more than a year or so ago now, at GopherCon U.K.
That was GopherCon U.K, right?
Right, yeah. And I really appreciated how you went through this journey of re-envisioning the indexing engine behind the project… And we’ll get into the reasoning and why you did that. That’s something I wish more talks were given about - the process, the journey of actually creating, of going back to the drawing board and saying “You know what - we’ve run out of time.” Being faced with those difficult times in a project, whether they be open source projects, or things at work… I really appreciated that, and this is something that I hope we’re gonna get into as well.
For those who don’t know, you are also on the East Coast, yeah?
I am on the East Coast. I live just outside of Washington DC, in Vienna, Virginia.
That’s right, that’s right. We’ve run into each other a few times at the Go DC. Is it Go DC…?
It is. Well, I think they might go by Golang DC still, I haven’t gotten to retire than name… But the group is still active, and I think they had a meeting here in September. I unfortunately couldn’t attend that one, but… Yeah, it’s alive and well.
Good, good. I’m always happy to hear of meetups that are thriving, that are serving the local communities; that’s something that’s near and dear to my heart. But yeah, the last time we saw each other at a meetup was like a couple years ago maybe.
It’s been a while.
It’s been a little while, yeah. So I’m glad to see you, and to see that you’re still doing your thing. So you work for Couchbase, yeah?
No, I left Couchbase last year…
[00:03:59.07] …in October, 2018 I left Couchbase. Some of your listeners may know I’ve been working there, and working on Bleve, a search library. You know, the time had come, and one of the exciting things about working on an open source library is that – you know, the project was started by Couchbase, but it gets adopted by these other companies… So what I decided to pursue was an opportunity to work with some of the other companies that were out there using Bleve in a different way. It’s always eye-opening when you get that chance to see your same codebase, but being used for some whole different application. So I was very fortunate…
I did some contract work with two different companies, and that ultimately led me to where I am now - I’ve just actually started a company called Bluge Labs. What we’re trying to do is get companies that are using Bleve on board to support Bleve, in sort of a new way. You could probably have a whole separate podcast on the economics of open source… So here we are, trying to support Bleve and open source search in Go in a slightly different way… And that’s just getting off the ground now, so you guys will all have to stay tuned for more information on that later, but… That’s been keeping me busy - figuring out how to take Bleve to the next level in terms of successful open source.
Yeah, yeah. I mean, more power to you, man. The startup game is – you have to have the right mindset, the right patience, and a boatload of energy to really give it your all in the day-to-day… There’s like a string of failures until you hit success, right?
Sure, sure… And I should clarify - we’re really not approaching this from the perspective of like a startup that’s got some hot, new product that’s gonna get VC investment… What we’re really saying is “Hey, if we have these libraries that have this community interest, and companies are using them, can we all pull together our effort enough that we all get what we want out of it?” Just be sustainable is really what we’re focusing on. So it’s a little different mindset, and stay tuned; you guys will hopefully hear more about how that goes.
Yeah, I’m looking forward to it. You touched a little bit on the relationship between Couchbase and Bleve… So what brought that on? Obviously, Couchbase is the maker of a popular database technology, so where does Bleve fit into that?
Sure. This was all the way back in 2014. Couchbase has this – obviously, storing data is the primary thing that databases do, right? But then people need to access their data, and you’re always looking for different ways to express the kinds of things that they’re looking for. It could be a key-value lookup, where you already know the key; it could be a SQL query, where you’re writing a query to describe the sets of records that you want returned, or it could be now this new thing, search, where you’re able to do full-text search capabilities across your document.
So Couchbase was in this position of looking to add that capability to their product. We were already adopting Go at that point, and had been successful using Go to – from our perspective, the value-add of Go was really faster development time. Maybe we could write a higher-performing thing in C, but there was also a chance that it crashes all the time, and the code quality is no good, and it takes maybe twice as long to get it to that same point. Go has always been a very – to me it’s like it’s an engineer’s mindset; it’s the right trade-offs for what you need right now.
So again, we set out to write what we needed in Go, but also we had this vision from the very beginning of making it open source. And I don’t mean open source in name only, which is what you see a lot of companies initiate, or they write something first and then they open-source it later, but there’s not really that community working on something together approach. We really set out to build a true open source community around it.
[00:08:02.23] And again, you can debate how successful or not successful we are, but it’s a tough thing to set out to do, and I’m pretty proud of what we accomplished, as led by Couchbase.
Yeah, that’s pretty cool. I’ve been aware of Bleve for quite a while… How old is the project at this point?
Like I said, 2014 is probably the oldest commit you will see. It could have been late 2013 when some of the first draft versions were coming together… But that’s roughly about the right timeframe.
Okay. So we’ve mentioned some terminology already that we were definitely gonna need to ground our users in. We talked about a full-text search, we talked about indexing… As a developer, usually when I need to find a string inside of a larger body of strings, the naive approach would be to say “Well, let me just import the strings package and do an index, and look at the position where it shows up.” But obviously, it’s not as simple as that. What is full-text search? What is indexing at its core?
Sure. The basic way I think about it is you really have the overall process divided between the two phases. We think of indexing, which is the process by which you take your sets of things that you wanna work with - “documents” is another word you’ll hear us use a lot - and you’re gonna ingest those and build the index. You’re gonna spend the CPU time to crunch some things around in the documents, and ultimately create some representation that we call the index. That could be in memory only, or more commonly you also wanna be able to persist that to disk, so that you can sort of stop your process, start it again later, and so forth. All of that is what we talk about as the indexing phase.
Then once you have an index built, you often want to then use it to run searches. The idea is, like you said, I have some notion of “I wanna find all the documents that have this word, or this set of strings.” So that’s your search phase of operation there.
The basic idea is you wanna think in advance about what kinds of searches you wanna run, and ultimately that will help you decide what the right index to build is. It’s not like a one-size-fits-all solution; you do need to give thought to what kinds of searches do I wanna run, and then make sure that I build the appropriate index to serve those kinds of queries later.
When you were first envisioning what Bleve was going to be, what could be, what prompted you to build your own, versus look for some other maybe open source popular project out there, to do what you needed to do?
To be completely honest, everyone would probably agree that Lucene is probably considered to be the state of the art in terms of this space, for full-text search. It’s been around for a long time, it’s open source as well, it’s written in Java, and it has a lot of people that have used it. Elasticsearch is really a whole company, starting with a server, and now a whole suite of things that started by building on top of Lucene.
There’s Solr, which is another product out there that’s (again) built on top of Lucene… And Lucene has contributors from both Elasticsearch and Solr pouring improvements into it… So that’s really what I would say is the state of the art. And when I say that, what I mean is it’s proven, it’s been around for a while, and it’s like - you’re not just gonna sit down and say “Let’s just rewrite Lucene. Let’s just port Lucene.” Those are big efforts, just because of the sheer number of 15-20 years of effort going into these projects.
Now, in Couchbase’s position it was sort of a unique situation. Again, I put on the engineering hat - could we have just used Lucene? Yes. But at the time, nothing else inside of the Couchbase server world was using Java. So it would have been this first thing pulling in like “Oh, now we need to have a JVM available. Oh, well now we need to think about how do we distribute the product. Now the Oracle licensing might mean there’s some complexities to how we distribute things.” So at the time, it was sort of a reluctance to pull down the full thing of Java…
[00:12:13.12] And also, our goal was really – again, taking this sort of 80/20 approach, can we deliver the most important 80% of Elasticsearch or Lucene? Can we pull in that kind of capability? We don’t need to build the whole thing; there’s this long tail of features we may never get to… If we could just build that most important part, that ought to be enough to meet our customers’ needs. And then let’s learn and iterate from that. If customers say “Hey, this is great. We really do need the other 20%”, then we’ll make the investment and keep building it out. But that was the approach that led us to building it.
I would say also, just to peruse all around the Go ecosystem, there wasn’t really a good full-text solution at that time. So again, the notion of “Could we have just used something else in Go?”, we didn’t see what we were looking for at that time. And again, we perceived that to be an opportunity. That was a chance for us to contribute back to the Go community and create some value, and share that with other people.
Right. Kind of like a right place/right time kind of situation.
Yeah, I would definitely say timing was key.
Let’s talk a little bit more about the mechanics behind searching and indexing. The simplest example we can think of is one where if I say “Go find me a word/term/phrase in a dictionary”, generally you might be able to flip through the pages and find the appropriate letter, and peruse, and do a sort of linear scan kind of thing to find what you’re looking for… But often you flip back to the back of the book, to the index, as it were, and you identify the term you’re looking for, or something that closely matches it, and then you jump to where you need to approximately, and then you’re doing another scan. So there’s a multi-step process to this.
The naive way of thinking is that “Well, let me just toss some terms inside of a map, and then do look-ups”, but it definitely – to me, I don’t know a ton about this way of building software, but there’s some complexity, there’s more involved with it. Can you talk a little bit more about the process of indexing? What is that about?
Sure. As you mentioned, the notion of an index at the back of a book really is a great mental model for people to have, to think about how the search index works. The first data structure is something we sort of loosely call the “term dictionary”, and that’s just the list of all the terms that your documents use. Again, if you were to think of the back of the book index analogy, the “term dictionary” would just be the list of all those words. Every term or word that was used in the book, that’s what we call the term dictionary.
[00:16:16.16] Now, I would say that’s a logical data structure, and what I mean is there’s all kinds of different computer science data structures we could use to actually implement that, but for now let’s leave that aside. Let’s talk about, like, logically, we start with that term dictionary, which is all of the terms that are used. So if you think about it, if you get some new book and you say “Hey, I want you to index this document as well”, one of the first things we do in the indexing phase is we have to go through that whole document, find all of the unique terms that are used, and keep track of not just where they occurred, like in the case of a book on what page it would have occurred on, but in terms of the index we’re building, we’re also gonna keep track of byte offsets or position offsets inside of your document.
And again, that’s not needed for the simple search of just “Which page did this thing happen on?”, but if you wanna get into phrase searches and more advanced searches later on, you need additional information about where those documents occur. So that’s the first, I would say, logical data structure, is what I call the term dictionary.
The second one that’s important for search is something we call the postings list. The idea is that for each one of those terms, we now need the set of documents which happen to use that term. Again, in the book analogy, the postings list is that list of page numbers that use the term. But in our index, that’s gonna be the list of document IDEs or identifiers for the documents. And again, at the logical level, once we get to the next level, there’s all kinds of computer science data structures we could use to see what’s gonna be an efficient postings list. And there’s different technological choices that we can go into there.
So the key is really like a two-phased thing. If you say find all the documents that use the term “johnny”, what I’m gonna do is I’m gonna start by going to that term dictionary, find johnny, and then that’s gonna give me the postings list, and I can iterate that postings list and get “Okay, now I know all the documents that use that term.” And that’s really the building block.
If you think about more advanced searches, they’re all composed by doing one or more of those other simpler searches that we’ve just talked about.
Okay, so that still sounds – great explanation, but it still sounds like there’s a lot of machinery going on there. So when you’re building such an engine, what are the primary concerns that you’re grappling with? Obviously, performance has definitely gotta be something you have to keep in mind. You mentioned also about writing things to disk… So what are the concerns you must always have at the forefront when you’re building something like this?
Performance, I would say, is front of mind for most people building any sort of indexing solution, mainly because you’re focused on utilizing the equipment that you have in an efficient way. At the end of the day, even if you say “Well, it’s fast enough for me”, there’s always somebody who might say “Well, but you’re using five machines. Could you improve it a little bit and only use four? Then we could save a little money.” So performance is this sort of endless game, and it’s really about figuring out where to stop, at times… It’s often important.
Now, with the search index in particular, I would say there’s a couple things going on. I mentioned that when we’re ingesting these documents into the index, we’re gonna figure out what those terms are… But you alluded to this earlier - sometimes you wanna find maybe not exact matches, but similar terms. So one of the things that we do in full-text search is we’re gonna mutate and modify the terms that come in… And there’s various reasons you do this. A simple example would be we put everything in lower case, because typically when you’re matching these terms you don’t care about the case… So in our index we’re just gonna put everything there in lower-case.
A second example that we use occasionally is something called stemming. In languages like English you have various root forms of words, and then plural versions, or adjective versions that have extra letters, so what we do is we do something we call stemming, to take all those terms that are similar and basically transform them into a single term that ends up in the index.
[00:20:09.25] Now, the reason I mention this in the context of performance is those kinds of transformations are CPU-bound things. There’s some string in memory, we’re gonna run some algorithm on it, and then we’re gonna have some new string… So keeping the CPU busy is one aspect of what we’re doing. But it’s not the only one. If you think about it, we’re also writing this index to disk, so one of the things that you also wanna do is say “Well, I wanna keep my IO channels busy writing to disk. If I can’t saturate the disk, then what am I doing here? I should be indexing faster.” So one of the things you’re also trying to do is keep your disk busy. And generally, I would say in most of the situations we encounter, that should be ultimately the limit. You wanna try and be able to – again, depending on your application, it should be possible to saturate the disk while you’re building this index.
Now, that’s just at the indexing time. The second thing you have to deal with - oftentimes the same systems that are building these indexes are then answering queries for these indexes… So you have some query time performance as well. A good example there would be if you think about how Google works - you run a search, and they only show you the top ten or whatever results on that first page. They’re not giving you every document on the internet that uses that term. Similarly, full-text takes that approach of “I’m trying to give you the most relevant information.”
Now, that’s just one kind of query that we can answer though. People also use this same technology for different kinds of queries, that are not really full-text. You can use the same system to support more like relational style queries, where you’re trying to find complex logical things of A and B or C, and so forth. The reason I mention that is at query time how many results are matching your query is gonna consume memory. If you just think about it, if you have some grand system and it’s written to disk and you can page things in and out. If you’re building a results set, that’s now again gonna take up millions of records, or whatever - that’s something you have to consider as well. I’m drifting off-topic here, but basically you’re trying to balance several things from the performance perspective - CPU utilization, IO utilization…
You also need to think about space. If you just think about text, a lot of text is repeated. A lot of these strings are repeated, a lot of the strings have overlapping substrings… So the ability to compress your data while you’re building the index is also important. And like I said, one of the benefits there - it’s sort of non-obvious, but if you just think about it, by making the index smaller, you can make it faster to answer queries later, because more of the data is gonna fit into memory, more of the data is gonna fit into cache, and so forth. All of those things sort of compound in the best-case scenarios, where you’re really achieving that optimal performance.
I have two sets of questions, or two ways I could look at this - one from sort of an operator standpoint and one from a user standpoint. From a user standpoint, I know I’m looking for something in particular; the word maybe is something that could be misinterpreted, or that has multiple meanings… So I’m gonna know - kind of like when you search on Google, you’re putting something there, you’re kind of half expecting to have to tweak it a little bit to get finer-grained results, or something that is closer to what you’re looking for. So when you’re doing this ranking, this prioritization of what you assume is the best match, or the best guess what the users are looking for, how are you deciding what is most likely to be what the user wants?
[00:23:46.10] Sure. That aspect of full-text search relates to what we call the scoring of the results. Once we determine that a particular document matches your search, then the question is “How do we score it and ultimately rank it, so that we can compare it with the other documents that matched?” And our goal is to show you what we perceived to be the highest-ranking or most relevant documents for what you search for.
The model that Bleve uses, which is the library that I’ve worked on the most, uses a model called tf-idf. The tf stands for term frequency. The way to think about that is in one of the documents that we’ve found, how often did that term occur? You search for “johnny”, and if the word johnny occurred five times in the document, that’s gonna be more relevant than another document where it only occurred once. So that’s one component to it.
The other part of it was called idf, which stands for inverse document frequency. The idea here is if every document in the dataset contained the term “johnny”, what we can conclude is it’s just not a very useful term for search.
For example, if I indexed – let’s say you have PDF scans of all of your bills, and they’re all addressed to you, they’re all gonna match “johnny”, so just searching for the word “johnny” doesn’t help us discriminate one document from another, because it occurs in all of them. So what we do is we sort of penalize terms where they occur in a large segment on the population, because it’s not contributing to the score being high.
Now, that’s why if you go back to the process you described, when you run a search, users are sort of conditioned “Okay, I’m gonna run my search, though that’s not quite what I’m looking for… Let me change this term from this word to this other similar word”, and what you’re doing is you’re actually sort of gaming the system to try and – by adding or removing words, what you’re trying to do is help the computer understand what’s relevant to what you’re looking for. In this case, it’s a human being trying to tweak the inputs to get the computer to do what you want, which is to find that thing that you happen to be looking for.
In more advanced systems, that’s where you try and understand what the user wants. A good example I always come back to - I used to use a library called Selenium; it was an end-to-end testing framework, I think… Or an automated testing framework. So when I would go to Google and I would type in Selenium, Google figured out that “Okay, he means the testing framework, not the metal or the medication or whatever else the word selenium could mean to someone else, in a completely different context.”
So in more advanced search systems, what you’re actually trying to do is go beyond just the textual analysis, but you’re gonna sort of like learn and have some deeper sense of the words. That gets beyond what Bleve can do out of the box, but it’s important to understand, that’s really the game you’re playing. The computer doesn’t understand the terms, doesn’t understand that that same term might mean two different things in a different context, but you’re sort of – by adding additional terms, you’re providing clarity.
If I search for “selenium test framework”, then even Bleve is gonna figure out “Okay, he means the testing framework.” Because what you’ll find is the documents that happen to use all three of those terms and then get boosted appropriately are gonna be the ones that match, and in my case would be the ones I’m looking for.
So one way that I think - and probably others, too - is that I tend to relate certain terms with other terms… It’s almost like in my head I’m creating sort of a graph of how one document relates to another document, that perhaps I may not be thinking of right now, or may not be remembering right now, but I expect the system that I’m querying, that I’m asking for it, to be able to tell me “Hey, maybe you also meant this other thing, which is not an exact match for the term you put in, but I know these things are related, and therefore you might find these other documents interesting.”
Right. So one of the things you can do is – if you go back to what is our search… When we’re typing in a search, as we’ve described it, we’re just typing words in a box. But if you think about it, you could imagine the documents that come in - you could think of those as also a list of terms. Not search terms, but just terms that occur in the document. But they have this added dimension that they’re weighted by their frequency.
[00:28:05.15] Again, if the word “johnny” occurred in this document five times, you could imagine it almost being like a vector, the term “johnny”, and then the magnitude 5. So now you can say “Okay, every document I can kind of think if as this vector, or set of vectors.” I’m trying not to get too mathematical here… But the reason I bring that up is now there’s sort of this parity between – like, a search is just a list of terms; that could be that same vector, but all of my frequency is just one. Now, the twist I’m gonna make is there’s a type of search called a “more like this” search, which is similar to what you’ve just said. “If you like this document, you might also like these other sets of documents.” And the way we can do that is we can take that document and turn it into a search by taking that document set of terms - which is, again, just a list of terms, and it could be weighted by the frequency of those terms…
So we can basically turn any document into a search for similar documents by just interpreting the list of terms a different way. So that’s exactly how you would implement a “more like this” search, it’s just by saying “Oh, that list of terms in the document - that could just be my search terms”, and you could make it hidden from the user, and just sort of a really elegant way of saying “Oh, if you like this one, show me more that are similar to that.”
Yeah, that’s kind of clever, actually. So the other way to look at this and I mentioning from an operational standpoint. When building my index, what is the expected mechanism, what is the capacity in which I’m supposed to use Bleve as a library? I’m gonna be importing my library, and I’m gonna be feeding it all the document, the entire body, whatever it is that I wanna be able to search - I’m gonna be feeding it a ton of documents. Basically, I have to have a repository of things to search for, obviously, for you to be able to make this process “Hey, this term - anything I can match in any sets of documents.” Are we feeding all of that raw text in?
Yeah, so the interface exposed by Bleve is actually very simple. We have an index method which takes an interface - so you can literally take any other object you’ve constructed in Go and pass it in to the index method. Now, that’s both a good thing and a bad thing. It’s a very simple interface; anybody can do that. But the thing is now you have to think about what is Bleve gonna do behind the scenes with that random object you passed in?
As you might guess, since it’s an interface, we do use reflection to walk your document and try and build the right thing. I’m open to admitting, this an aspect of Bleve I would change in the future. One of the things we’ve found is we emulated Elasticsearch’s model of “Just throw me a JSON object and I’ll just do my best to consume it and make sense of it.” And you can refine that later, but the goal is you can just hand me something and it’ll try and do the right thing.
We have a lot of rules and magic, if you will, and that’s ultimately (I would say) a challenge to new users. But the reality is we have this object called a mapping. And the mapping is this sort of like side document, if you will, which describes how you want to take documents that you passed in and put them into the index.
The mapping is really where a lot of that logic gets expressed in Bleve today, and that’s what allows us to say “Okay, you have a field called Name, and we wanna also have a field in the index called Name.” Or “You have a field called Description. Let’s also have a field in the index called Description.” Again, the mapping allows you to do more exotic and complicated things… But again, one of the goals of Bleve was that default mapping - we take what you give us and try and do something intelligent. So you can, in large part, take a simple map with strings as keys and values, and it’ll do the right thing, in large part.
[00:31:55.05] Okay. So to effectively use a library like Bleve, from a developer standpoint, what prerequisite knowledge do I have to bring to bear? Is it just saying “Hey, you know what - maybe I have a pool of documents, PDFs, whatever, that I’m just gonna be feeding into the index”, and that’s it? It’s as simple as that? Or do I have to really know how to feed the data in to really use it the right way?
I always recommend users start at the end. Think about your users and think about what types of searches are they going to be running. And in particular, think about not just what kinds of searches, but what’s the data type of the result? Let me make that concrete. If I have a collection of books, and you run a search, the results that you get back could be books themselves, it could be authors, it could be comments about a book, it could be not the book as a whole, but it could be pages within the book. Those are all possible things a user might want back, and you need to think through “How do users wanna think about their results? What’s that unit of result?” Because Bleve search results always come back in that unit, like “This page matched. This book matched”, and so forth. So that’s one of the first things… You just wanna think about your data and the data model. Again, you can do complex, hybrid things, but you just wanna know that upfront, that that’s what you wanna do.
Once you’ve done that, you’re gonna sort of have a sense of what fields you need. If I’m indexing books, books generally have titles, or they might have the full content, depending on what your dataset is. Maybe you have book reviews, so you have comments about the books. All those - that will dictate what fields you’re gonna use. Then once you’ve determined what kinds of queries people are gonna run, and you have a sense of what the fields that you want in your index are going to look like, now you can sort of work back and say “Okay, what’s the right index to build?” In particular, you would need to know additional things like “If my titles are all in English, I could take advantage of that and index a certain way. If I have titles in a bunch of different languages, I might need to bring some different approaches to the table, to make search work well in that case.”
I would say certainly the language of your text would be an important detail that you would wanna think through in advance. And again, if it’s heterogeneous, then you need to plan and budget even more, because it’s gonna be more complex to handle.
And then the other thing would be to think through how you wanna combine full-text with other things. A good example would be oftentimes in your dataset you have other strings that you still wanna index, but you don’t wanna do full-text-like things on them. A good example would be identifiers. Maybe there’s an ISBN number for every book, which looks a lot like a string, and there’s a lot of benefits to indexing it as a string, but you don’t generally do partial matches on those; you just wanna do an exact lookup or nothing at all. So Bleve has support for those types of strings as well.
And then, again, other ancillary things - we support indexing numbers, we support indexing dates, and we support indexing geo points. Those are, I would say – I mean, they can be used on their own, as a core capability, but what we find is they’re really useful to use in conjunction with full text. So you might say “If all the things I’m indexing are newspaper articles and they all have a date associated with them, I might wanna limit my date to… Okay, I wanna search for ‘Clinton’, but I wanna search just in the last year, not in the last four years.” So that’s an additional thing that you would be able to filter on.
And then even more powerful is when you wanna use those additional data points to adjust the score. Maybe what I really want is not to limit it to the most recent year, but I wanna boost the score of documents that are within the last year. So that won’t preclude an older document from coming back at all, but it means newer documents are gonna rank higher.
So you need to give some thought to how you wanna incorporate other types of data. Like I mentioned, we have numeric range, date range, and then geo boundaries as well.
One of the things I’ve found interesting was that you could decide, as a developer, which storage mechanism you could use for storing things like the index. I remember BoltDB was one of the options, and there were others… But recently, it sounds like you’re sort of navigating away from that interchangeability, for some reason.
Let’s talk about that a little bit.
Sure. As you pointed out, when we first conceived of Bleve, one of the things that was new and different that we were bringing to the table was this idea that we had this notion of an indexing scheme which would take all of the index and be able to represent it as keys and values. Now, if we could represent the entire index as just keys and values, what it meant was any key-value store - and at the time, 2014, was like a hotbed of key-value stores; there’s LevelDB, RocksDB, all this excitement going on about key-value stores… So we thought “This is great. Even if we choose wrong now, we could just plug in a faster key-value store later and that will solve all of our problems.” That was the initial idea that we conceived. And to be fair, it did allow a lot of flexibility early on in the project.
A good example was at the time BoltDB was one of the only pure Go key-value stores. And pure Go was, again, a benefit to us, because we’d already been burned by cgo and some other projects. So the idea that there was this pure Go – you could use the go get command without having to set up a bunch of other C libraries first, and it would work. So the fact that we had support for BoltDB was huge early on.
But as I alluded to, it all revolved around the fact that the index could be distilled down to sets of keys and values. And what learned over time was it didn’t matter which key-value store we used, it was that encoding itself, that representation of all the index as keys and values - that in and of itself was not a particularly good encoding, either for storage size in terms of writing the index, but also in terms of query time, being able to answer queries quickly.
So as I said, we learned, basically – because Couchbase ultimately wrote another key-value store called Moss; I spoke about Moss at GopherCon… Moss is great for everything that it is, but it was still just another faster key-value store that ultimately didn’t solve that problem. So coming out of Moss in the 2017-2018 timeframe - as you said, we started our new indexing scheme called Scorch. The insight was basically – the project had grown up. In the beginning, people loved the flexibility “I can just pick and choose whatever key-value store I want”, but what we’ve found later was users didn’t care what key-value store. They wanted it to work; it should do everything it says on the box, and it should be as fast as you can make it go, and it should be as small as you can make it go. People want us to own the implementation of the bytes on disk; they don’t wanna worry about that, they don’t wanna have to upgrade to a new version of LevelDB in the future to fix some issue… They want us to own those problems.
[00:40:17.02] So the approach basically involved - okay, let’s set this old index scheme aside; we’re gonna have a new index scheme, which is not built on top of a key-value store, it’s gonna just write its own representation of the bytes directly to disk. Yeah, we have to own that piece now, and that was something we were comfortable with doing… And we had to sort of engineer that.
You mentioned that talk I gave at GopherCon UK… I really enjoyed giving that talk, because as you said, I tried to not just sugarcoat it and show you the finished product and say “Look, we went off to rewrite this thing, and here it is. It’s awesome.” In a nutshell, that’s how a lot of tech talks are… And I felt that just wasn’t honest. It was hard getting to where we got, and I thought the more interesting story was sort of going through all those things. Again, if anybody who’s interested, it is a talk worth going back. I hope that holds up over time, and people still enjoy it.
So that did lead us to bringing in Scorch. At the time I gave that talk, Scorch was still pretty new… But Scorch is production-ready today. It’s still not the default with Bleve, for reasons that are, again, disappointing… Bleve has a lot of early Go projects. It got popular before there was good versioning, and even vendoring. It predated even vendoring.
The trouble we have now is there’s a lot of people that have adopted it that are using the old index scheme, so we need to be mindful of them, we need to have an upgrade path that doesn’t break things… So again, Go modules is like a hot topic for Bleve right now, and that’s one of the things that at Bluge Labs I hope to spend a lot of time working on for Bleve. Anyway, that’s where we are today. Again, we all recommend people using Bleve to use the Scorch index scheme, even though it’s not the default yet as of today.
Yeah, let’s dive a little bit deeper into the whole module thing and how that has affected the project. Is it more of sort of having to make sure you don’t break other people’s worlds?
That’s a big concern of ours. Again, we’ve taken this approach for a long time… Go’s model initially was “You don’t change it, you don’t break APIs. You just never change it. Once it’s popular, that’s it. You can add new methods, but pretty much any other change is gonna be a breaking change for somebody”, and so you’ll see that in the Bleve codebase. We have the function named “advanced”, with some new signature, or all kinds of naming schemes that aren’t even consistent across time now in terms of how we’ve attempted to do that.
So the first thing with Go modules is - yeah, we’re mindful that people have adopted Bleve without any notion of Go modules, without any notion of versions… They’re all just sort of living off of master, or some commit that they’ve checked out at some point in time. And we know we wanna graduate from that. But it can be a difficult challenge, because as I mentioned, Bleve has been supported by Couchbase and a handful of other companies over the years, so it’d be crazy to break it for the people that have put their money into it.
First and foremost, the people that have financially supported Bleve - we need to make sure they are happy using Bleve. That’s one of the things. But the Go community as a whole has moved forward to modules, so we can’t ignore that as well. It’s one of those things where we’re trying to balance multiple needs. I think we have a plan going forward now.
What we’ve done, just to be open with you, is we actually have a fork at the moment, where we’re able to sort of experiment with modules. So we have a fork that is sort of more modules-ready…
[00:43:49.23] I should go a little deeper into why modules is problematic. The simple one that a lot of people are already aware of is once you have a version 2, you can start having some additional challenges… And the reason is with Go modules the version becomes a part of the package identifier in the URL space. In a project like Bleve we have a lot of nested subpackages, which if you think about it, means all of our internal imports have to be rewritten when the major version changes. And that then, again, for people that have not adopted modules, now there’s this issue… Because if you’re not using modules, you have import paths that are referring to things… And I know that they’ve added a bunch of stuff to the Go tooling to mitigate that. I don’t wanna reopen this whole can of worms… Let’s just leave it that ideally, with Bleve, what we would do is we would release version one today with the old index scheme, and we would release version two tomorrow, with the new index scheme. That’s our vision of how this would work, so that everybody with backwards-compatibility issues stays on 1.0, everybody who wants to use Scorch and the new index scheme starts with 2.0 and goes forward. That’s where we’re headed.
But what we’ve found is one of the challenges is all of those nested subpackages are a little bit of a liability. A good example would be one of the recommendations from the Go community is “Oh, just copy your module over into a v2 folder.” Well, first of all, I’m glad you’re laughing, because I find that suggestion just laughable on the face of it… But then if you just look at Bleve and the number of packages and submodules… It would be like hundreds of files a second copy of. It’s a complete non-starter.
But on that topic, that also is partly some stuff that we need to clean up. Our package was fine as it was conceived, but as a Go module it’s now too many things in one module… And we would benefit from the ability to version those independently.
I mentioned the Scorch index scheme - that’s gonna be broken out as a separate module. And the benefit there is that we’ll be able to version that independently of the top level of Bleve. Second, there’s another layer – if you peel back the onion even more, inside of Scorch there’s the actual disk file format; we call it zap. That is gonna be broken out as a separate module.
So just by having these three independent pieces that can be versioned independently is going to be a huge benefit for the Bleve project. I can give you a very concrete example. If you’re someone like Couchbase and you’ve shipped a version of your product, and it’s out there – it’s not the cloud world entirely, right? There’s customers running it on their actual hardware somewhere, right?
And you’ve told the customer “Yeah, we’re gonna support that for three years”, or whatever their promises are… You’re in the position of actually having to support that, and stand behind that. What that means is when you ship the next version of Couchbase, you’ve still gotta be able to read that old format, even if you have some new, faster, even more efficient format; you’ve gotta keep being able to read and serve queries from those older data, or at least have the ability to migrate it if you choose to. That’s a capability that Bleve really lacks today; you can’t do a single build of Bleve, a single executable that reads and writes two different formats, even though the format has evolved over time.
The good news is modules actually can be a part of the solution for that. We can import multiple versions of Scorch, multiple versions of zap. That’s supported by Go modules as one of its core tenets. Again, that makes it sound simple, but there’s still some engineering behind it to make that work, too… But that’s our vision. That’s gonna allow some of the really important adopters of Bleve to gain an important feature, and it sort of gets us all on board with Go modules, and gets things going forward.
By having that fork that I mentioned, we’re able to experiment, and if we break things, we just try and unbreak them, and go from there. And then once we have that final picture of like “This is our desired end state”, then we’re gonna – we don’t want that fork to be long-term; this is an experimental thing that we wanna then merge back in and have a healthy Bleve project going forward.
So it sounds like you’re working out some of the issues you’ve had under the hood, but from a feature set standpoint… Which kind of ties into my next question around sustainability - what is it that you’re looking to do? Where are you looking to take Bleve next? …be it in terms of features, as an offering… How are you looking to support the project and keep it maintainable and sustainable? How are you planning on doing that?
[00:48:12.07] One of the things, as I said, would be the versioning of it. We’ve never had a 1.0 release, and I mentioned this started back in 2014… We’re not a healthy project in terms of having regular releases, so that’s one of my main goals. Like, let’s get on the release train model, let’s have two releases a year that are well thought out and planned. What that will allow adopters to do is stop running off of master, which is what everyone’s doing today. They find some bug, we’ve gotta fix it on master, and they re-roll their new release… It’s just not a healthy state. So once we have regular releases, that enables the adopters to say “Let’s stick to released versions” and then “Let’s backport bug fixes to released versions, and approach it in a sane way.”
And that maintenance is expensive. At Bluge Labs, what we’re looking for is the companies that ultimately sponsor the work, that’s some of the things that make sense to pay money for - maintaining older releases for a period of time, because that’s where there’s the value-add for the company, whereas the bleeding edge stuff is really what gets developers excited.
So regular releases is one of the main things that we wanna do. By adopting Go modules, I think that will get us more approachable. One of the big issues we have today is because we’ve sort of lagged with Go modules, I would say half of all new users come in and they say “Oh, when I do a Go build, it’s broken.” And I say, “No, Go build works fine”, and I show them and they’re like “Oh, well I have modules turned on”, and it pulls in some arbitrary, older version of one other library… It gets the right version of everything except one thing, and it’s because of a tag; it chooses the latest tag version… Anyway.
So that’ll be a big help as well, for getting new people on board - just having proper module support I think is important. And there’s a handful of other things - documentation, tutorials are all things that… I guess every open source project probably would list those on things that they could/should do better on.
So that’s just a handful of things… I would say we’ve been successful in the full-text use cases. Another thing that we’ve identified is sometimes people are using Bleve not for full-text. I mentioned that ability to do exact string matching… People sometimes push that to the limit. They say “Well, I’ve got 100 fields that I wanna do exact matching on, but I wanna do complex ands and ors across a hundred different fields.” No text analysis, so none of the interesting full-text stuff, but just the core actually works really well for that. But what we’ve found is there’s additional optimizations we can put in to make the index even smaller and even faster.
One of the projects I worked on this summer - I can’t go too much into details, but one of the benefits we got out of it is we were able to just by tweaking the customer’s settings and a few code changes inside of Bleve, we were able to cut their index size just in half… And they were already talking about terabytes, so this was a useful thing, to cut that number in half.
So that’s an area where we’d like to find out how people are really using Bleve. Like I said, early on I was only seeing how Couchbase used Bleve; I’ve gotten a little bit of a wider vision to see how other people are using it… And we wanna just take that further. I’m sure there are other companies out here that are using Bleve - and please, reach out to me; it’s a great time to get involved and help me understand your use case.
That’s really where we’re headed - just to make sure that the whole community of people using Bleve are all being heard, and we build this thing that’s useful for people.
That’s awesome. Yeah, absolutely, folks should definitely be reaching out to Marty… So how many other folks work with you on this project at this point?
I’m the only one that’s full-time working on this through Bluge Labs at the moment. Other companies that use Bleve all contribute and support in a way. Couchbase has two additional programmers that are full-time; I would say they’re full-time on their product, which heavily uses Bleve. So that’s still not the same as being 100% on Bleve. They make significant contributions to the project, but even they would be honest and say “We’re not full-time on Bleve.”
[00:52:11.11] The other companies that I’ve contracted with as well - I would say they have varying degrees of expertise in Bleve, like many companies. When you use a library, whether you intended to or not, you end up learning a little bit more about it than you probably wanted, because you had to support it, you had to figure out some corner case, some issue.
So we do have other developers contributing… And we have a good amount of random, one-off contributions that we get from the community. I don’t have the page open in front of me, but there’s a long list of contributors to the project over time at this point. I think we’d be well-served to clean that up a bit is what I would say. And what I mean is oftentimes you get a one-off contribution from someone; it doesn’t meet any of the guidelines that we generally follow for the code quality, it’s not designed the way we would have designed it… But it does work. You know what I mean? This is a very common problem, and I’m sure all open source developers face this. It’s like “What do we do?” We could say no, because it doesn’t really fit with everything else. We’d love to say yes, because it does add another useful feature… And then there’s the middle ground of what ideally would be like “Well, maybe we could clean it up a little bit, and then massage it.” That’s really the hard politics of open source, in terms of how you cope with that, how you deal with that… There’s all kinds of different philosophies there, and it’s an area where we could still improve.
Like many projects, you have contributions that come in that are very successful, and you have other contributions that come in and people will feel burned, or don’t feel like they got their change in, and that’s the reality of it, too.
Right. So having done all of this really beautiful work on this project with Go, have you ever had an instance where you thought “Maybe Go is just not the right tool for the job”? Or have you been completely happy with the language for this particular task?
I’ve given a talk at the DC meetup that we spoke about earlier, about – in particular with memory management in Go, and how that relates to optimizing application performance… And I think Bleve is maybe somewhat unique in this regard, but if you think about all the stuff we’ve talked about earlier about how this search engine works, one of the things you realize is it’s a lot of just loops within loops within loops. What I mean by the is “Okay, indexing? Well, I’m just gonna loop over all of my documents as the first loop. Now, my documents all have fields; I’ve gotta loop over all those fields… Now, for each term that I find inside of those fields - I’ve gotta loop over that and do some work.” So you have this loop structure. And what the downside is - you write your API in a really clean way the first time.
You just sit down at your editor and you write the API, and this is nice and clean, and then you find out “Oh, it turns out every time I call this function, it has to allocate something to return it.” So your simple API that you just thought up in the head ends up being one that not only performs poorly, but it’s magnified across all these loops. So when you look at your profile, your CPU profile, you see “Oh, I’m spending all of my time allocating memory and handing it back.”
Now, there are techniques… You find a million Go talks on techniques to avoid that. And in my experience, what gets a little frustrating is when I have to, in my mind, clutter the API, I have to add a new argument to my API, so that instead of always allocating something, it could optionally reuse the thing that I’ve passed in. So now my API has gotten cluttered, and it just hurts readability. The thing I love about Go is how readable the code is. I can pull up code I wrote two years ago, and it’s pretty straightforward. The code does what it says; that’s what’s to like about Go. And my concern is the memory management hurts that. It’s code that we write, that ends up – I have to remember “Why did we do it this complex way? Oh, that’s right, it was a performance optimization.”
[00:55:58.09] And the pushback is like “Oh, well in so many cases that’s just premature”, but what we find is for some reason in all the code I need it’s not premature optimization. It’s what we have to do to get it to the place where it’s meeting the metrics. So if I have one concern with Go, for me it’s – I don’t mind garbage collection in principle, and I think the improvements to the garbage collector could actually address some of the issues, or some of the mid-stack in-lining, it would solve some other concerns… But it’s just one of those minor gripes, where it’s like “I have to write code in a less clear way in order to get it to perform well”, and that’s not the spirit of Go. The spirit of Go is you just write it in this really clear way, and you just run it on faster hardware. That’s more of the spirit of how Go programmers think about things.
Cool. I was gonna ask what’s next for you, but you’ve been touching on that a couple times… Obviously, with the new startup, and you being pretty much full-time on this project now. That is in effect what you’re gonna be sticking with in the short-term.
Yeah, my goal is to really make Bleve – I feel like it’s an open source project that… We’ve made a really good technology, and that’s evidenced by the fact that companies have adopted it and are using it. To me, the technology is sound. But the community, the project part of it has sort of just lagged a little bit… And it’s mainly because - and this is my opinion, obviously - when you have companies that are sponsoring open source, they always have a little bit of a selfish pull. They’re more concerned with their issues more than anyone else’s.
When I was at Couchbase, you would find some people saying like “Oh, when I use Bleve in this use case, it’s slow.” Or even better, the performance optimization that Couchbase put in makes it even slower for someone else using it in a different use case. So that’s a good example of how I just felt like to get Bleve to that next level, it needs a little bit of independence, it needs a little bit of someone focused just on the project as a whole, really clean up the issues, clean up the pull requests… We have a lot of backlog of stuff - which is great; it’s evidence of how much interest there is. But if we can’t have a system in place where we continually make progress and people have confidence that we’re making progress, at some point we’re gonna find we’re behind the curve, and that people have switched off. Someone will fork Bleve and be doing a better job of it than we are.
So my goal is really to try and step up and provide some of the stuff that’s been missing. Again, the biggest challenge for me is those regular releases. That’s difficult to do, and I say that because I keep promising people Bleve 1.0 and I keep not delivering it. So that’s how I’m asking people to measure our progress - “Hey are we making regular releases?” And then how good are those releases in terms of the features and bug fixes that you want? I think if we hit those marks, then the people that invest in Bluge Labs are gonna be happy with what we’re doing.
Yeah, I think it’s really important to shine a light on this, because a lot of projects - especially the popular ones - are often run by folks who are not full-time on those things. Maybe they have a day job, maybe the employer is supporting the project, maybe not, maybe they have to get work from elsewhere… It’s rarely that you have folks who are dedicated to the project and its longevity. With you being full-time on Bleve, I think that gives this a much greater chance of success, as you envision it. So yeah, I wish you the best of luck with that project, for sure.
Thank you. It’s very exciting, and…. Yeah. Just getting started, so it’s all bright future in front of me right now.
Awesome, awesome. Well, Marty, it’s been a pleasure having you on the show. As always, we always get into some interesting conversations. I hope to see you face-to-face again at some point in the coming weeks and months…
It’s been great having you on the show, and I hope you had a good time, and I hope our audience had a good time listening to this awesome project.
Yeah, thanks again for having me, Johnny.
Our transcripts are open source on GitHub. Improvements are welcome. 💚