Johnny is joined by Marty Schoch, creator of the full-text search and indexing engine Bleve, to talk about the art and science of building capable search tools in Go. You get a mix of deep technical considerations as well as some of the challenges around running a popular open source project.
Marty Schoch: [23:46] Sure. That aspect of full-text search relates to what we call the scoring of the results. Once we determine that a particular document matches your search, then the question is “How do we score it and ultimately rank it, so that we can compare it with the other documents that matched?” And our goal is to show you what we perceived to be the highest-ranking or most relevant documents for what you search for.
The model that Bleve uses, which is the library that I’ve worked on the most, uses a model called tf-idf. The tf stands for term frequency. The way to think about that is in one of the documents that we’ve found, how often did that term occur? You search for “johnny”, and if the word johnny occurred five times in the document, that’s gonna be more relevant than another document where it only occurred once. So that’s one component to it.
The other part of it was called idf, which stands for inverse document frequency. The idea here is if every document in the dataset contained the term “johnny”, what we can conclude is it’s just not a very useful term for search.
For example, if I indexed – let’s say you have PDF scans of all of your bills, and they’re all addressed to you, they’re all gonna match “johnny”, so just searching for the word “johnny” doesn’t help us discriminate one document from another, because it occurs in all of them. So what we do is we sort of penalize terms where they occur in a large segment on the population, because it’s not contributing to the score being high.
Now, that’s why if you go back to the process you described, when you run a search, users are sort of conditioned “Okay, I’m gonna run my search, though that’s not quite what I’m looking for… Let me change this term from this word to this other similar word”, and what you’re doing is you’re actually sort of gaming the system to try and – by adding or removing words, what you’re trying to do is help the computer understand what’s relevant to what you’re looking for. In this case, it’s a human being trying to tweak the inputs to get the computer to do what you want, which is to find that thing that you happen to be looking for.
In more advanced systems, that’s where you try and understand what the user wants. A good example I always come back to - I used to use a library called Selenium; it was an end-to-end testing framework, I think… Or an automated testing framework. So when I would go to Google and I would type in Selenium, Google figured out that “Okay, he means the testing framework, not the metal or the medication or whatever else the word selenium could mean to someone else, in a completely different context.”
So in more advanced search systems, what you’re actually trying to do is go beyond just the textual analysis, but you’re gonna sort of like learn and have some deeper sense of the words. That gets beyond what Bleve can do out of the box, but it’s important to understand, that’s really the game you’re playing. The computer doesn’t understand the terms, doesn’t understand that that same term might mean two different things in a different context, but you’re sort of – by adding additional terms, you’re providing clarity.
If I search for “selenium test framework”, then even Bleve is gonna figure out “Okay, he means the testing framework.” Because what you’ll find is the documents that happen to use all three of those terms and then get boosted appropriately are gonna be the ones that match, and in my case would be the ones I’m looking for.