What's up, DocQuery? with Ankur Goyal, founder & CEO of Impira (Practical AI #196)

All Episodes

Chris sits down with Ankur Goyal to talk about DocQuery, Impira’s new open source ML model. DocQuery lets you ask questions about semi-structured data (like invoices) and unstructured documents (like contracts) using Large Language Models (LLMs). Ankur illustrates many of the ways DocQuery can help people tame documents, and references Chris’s real life tasks as a non-profit director to demonstrate that DocQuery is indeed practical AI.

Changelog++ members support our work, get closer to the metal, and make the ads disappear. Join!

42 minutes
Recorded Oct 5, 2022
Published Oct 12, 2022
Download (41MB)
Transcript
🎧 19,920

Featuring

Ankur Goyal – LinkedIn, X
Chris Benson – Website, GitHub, LinkedIn, X

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is the podcast that likes to bring practical issues in artificial intelligence, and learn as we go. I am your co-host, Chris Benson. Daniel Whitenack is unfortunately traveling right now, so he’s gonna miss what I’m sure is gonna be a pretty cool conversation. And without further ado, I would like to introduce our guest today, Ankur Goyal, who is the founder and CEO of Impira. Welcome to the show.

Thank you so much for having me. I’m really excited.

Yeah, absolutely. So we’ve got a bunch of cool things to dive into today… I guess if you could just, as a start, before we actually dive into the topic, tell us how you got here. Who are you, what’s your story, and how did you arrive so that you could tell the world about what you’re going to be talking about today, which is your company, and DocQuery in particular?

Awesome. Yeah, so I actually don’t have a mathematical background in machine learning or AI. I’ve been working on relational databases for a really long time. I actually started doing research on them in school, and worked at a company called Single Store; I joined as the second employee, and was the VP of engineering there for some time. And what got me into the space is actually talking to our customers, who were able to make use of data that is structured, but really struggled when the data that they wanted to work with didn’t fit inside of the relational database that we built. And so I thought, there has to be a better way. And looking around me, it was clear that the progress – and this was back in 2017, and a lot has changed since then, but even back then it was really clear that the progress on the machine learning side would make it possible for people to work with any data, no matter how structured or messy or complicated it is. And that’s what we’re all about at Impira - we’re one part database and one part machine learning technology that basically makes it really easy to work with unstructured data.

Very cool. As you came into the industry, and getting ready to set up your company, and looking at that with unstructured data… Could you tell us a little bit about what you were walking into and why you chose that particular path in the industry that you did? what was it that attracted you down the path that you did as an entrepreneur?

Yeah, the first thing I’ll say - it was definitely a windy road, and we didn’t know exactly what we were getting ourselves into when we started. So actually, when we started, we thought that the really big problems in helping companies work with unstructured data would be in helping them work with image and video content. And I think as it’s becoming really clear now with images and videos, the bottleneck is actually creation, it’s not understanding. And so we learned that just purely on the market side of things a few years ago.

[04:09] And as a funny coincidence, because one of the models that we ran on data that people uploaded was OCR, which is optical character recognition, some of our customers started asking, “You can do this stuff with images and videos, but can you also analyze the data that is in my invoices, and my forms, and other documents?” And we realized that there was actually a really exciting opportunity for us to help companies work with this unstructured data. And so kind of a happy accident, we discovered together with our customers.

It’s interesting that you mentioned that particular example, because I know when I think of things like invoices, and – separate from this I run a nonprofit, so I have that business hat I have to wear separately… I’m thinking of things like PDFs and things that are not typically what we’re thinking of when we’re training models. It’s not the form that we’re usually – we’re not going and pulling a bunch of data out of a database to train on, or sources off the internet, or whatever… So that’s a little bit of a different take from your typical avenue into machine learning off the bat. As you started recognizing that was a challenge, did that worry you at all, in terms of recognizing that you were going to take a different approach?

It probably should have… But as usual with myself and co-founders, and how we think, it didn’t. Actually, Richard, who is our CTO, came up with a really powerful approach to solving this problem that uses primarily computer vision, actually, to reason about PDF files. And so for a long time - and we’re foreshadowing a little bit, but DocQuery, which brings these worlds together… But for a long time, actually, a lot of the work that we did use computer vision. And so we thought a PDF s like a hybrid of text and visual stuff; we leaned on the side of the visual stuff. And that has a number of advantages and disadvantages, which we’ve learned over time as well.

So you’ve mentioned PDFs… Do you focus strictly on PDFs, or are there other file formats that you end up working with as well?

What we do is we take almost any file you could throw at the system that self identifies as a document, anything from PDF files, to emails, HTML files, scanned images, pictures from your phone - just about anything… And we do a bunch of pre-processing upfront that basically normalizes anything you upload into a fairly consistent data structure. So from whatever you put into the system, we normalize it into a bunch of pixels, a bunch of text, and a bunch of bounding boxes that tell you where the pieces of text are, as well as a few kinds of other things.

Gotcha. So before we dive fully into how you’re approaching it at this point, what was in place, both from the early machine learning days, as we’re going back a few years talking about that, but also, you mentioned OCR - what were the approaches people were taking and– What was the mental model around that, that you were looking at and saying, “That’s not good enough” based on what you were starting to think? What was the world looking like at that point?

What’s really interesting about this is OCR is not a new thing, neither is reading data from invoices, or other kinds of documents. But for some reason, most businesses don’t take advantage of it. And I think that’s because the solutions out there are just not easy enough to use. And so we’ve always thought about this from the standpoint of “What does it take to make something that’s actually so easy to use that it provides value for someone?”

[07:58] The solutions that existed prior - they fell into a few different buckets. One is something called an OCR template, where basically you take OCR text, and then you draw a box of XY coordinates around exactly where the text needs to be. And if you’re working maybe at the DMV or something, and taking identical documents and scanning them with an identical scanner every time, that approach can actually work really well. In reality, I’m sure with the invoices that you’re working with in your business it’s never that simple, right? And so that’s an example where the user experience and cost barrier in practice can be just prohibitively high.

Another technique that was really emerging as more popular when we started is this really big pretrained model approach. So AWS has a product called Textract, for example, which is actually a great product. And what it allows you to do is upload any document into it, and it will give you back some data structure about what’s in the document. And the nice thing about this approach is you don’t need to do any of that template definition, or anything like that. But the challenging thing about it is that if the results aren’t what you expected, then you don’t really have any recourse to solve for it. A number of our early customers were using Textract and building machine learning models on top of Textract to normalize the data to be consistent, and they realized “This is just not – what are we doing here?”

So it was essentially a bandaid that they were creating on top of the product they were using, or the service they were using.

A very fancy bandaid, yeah.

I know that we have seen evolutions over quite a long time in OCR, in terms of that… You mentioned something though that made me curious. You were talking about like if you are using one of those early models that were pretrained, and then you didn’t get what you wanted out of it… Can you talk a little bit about what kinds of problems might arise in terms of like why weren’t they getting it out of those models, to define a little bit about the space that you’re fixing going forward?

Absolutely. Yeah, so there are two or three classes of problems. I think there are three. So the first problem is, let’s say you take a relatively low-quality image, like a scan that maybe is actually hard for a human to even decipher - maybe it’s really bad handwriting, or something like that - and you upload it into one of these products. If it can’t read the handwriting, or it can’t read past the quality, there’s really nothing you can do about it. And so that’s one class of problem.

Another class of problem is if you just consider a single document, and you upload it into a service like this, it may not actually pick up all of the fields that are in the document. So one of the problems that we see - it’s almost like bald spots or something in the document; it’ll just miss things. And if it misses something, there’s no way of telling it like, “Hey, please don’t miss this field next time.” There’s no input like that that you can provide.

Because it’s all pre-trained and you’ve got what you’ve got to work with at that point. Right?

Exactly, exactly. And the third thing is that, if you imagine working with many, many documents, they all might have different bald spots. And so you might have two documents, which for a user have the same schema, meaning they have the same fields that you want to extract, but you upload them into a schemaless service, and you get back two different schemas. And that’s actually where some of our early users were implementing their own machine learning models to try to translate from the schema that the pretrained model produces to the schema that they actually want to work with.

That is not a problem I had really considered. That’s an interesting side effect that you get on that. So you end up training in those early models – having to train a model, you’re running the document through the model, it comes up with both the whitespace issues, and it also leaves you with the problem of an inferred schema that was not intended… And then I assume that at the end of that you’re trying to get it all corrected back to what it needs to be.

[12:18] That’s right.

So that’s a lot of manual effort there. You may have some tooling to help you along, but there’s kind of a manual cleanup process that you’re having to go through.

Yeah.

Definitely interesting. One of the things I wanted to ask about as far as that goes is you talked about OCR, but we’re also talking about language models here… And you said that you were starting with the visual model, so we’re not yet talking about any NLP, natural language processing or anything like that, I’m assuming.

That’s right.

We’re talking about some sort of early visual model that’s pretrained.

That’s right. Although in Impira, the model that we had early on was actually not pretrained. Because of how it works, it would actually learn just on the user’s documents that they uploaded.

Interesting.

Yeah.

[13:08] to [13:44]

So having laid the landscape there of what you were walking into in terms of problems to solve and ways of making a better experience for people that needed this, can you describe how you started thinking about that process, in terms of specifically where could you see things that needed improvement, so that we get a sense of how we would ultimately get to what I’m gonna get to in a moment, which is where DocQuery has landed? Tell us a little bit about what that pathway from “I’ve identified the landscape” to “Here’s a much better way of doing it.”

Yeah, so we kind of set ourselves up with a few constraints early on. One of them was that we wanted to make the product completely self-service. And our definition of that was that a user can sign up on our website without talking to anyone, onboard onto the product, and then evaluate whether it works on their documents or not. The second thing is that we wanted to support documents of any schema. So if we hadn’t seen that particular document type before, that’s totally fine. We’d be able to learn about it on the fly. And the third thing was that we wanted the product to be incredibly easy for or a non-technical user to use and work with.

And so what we did after performing a lot of user research is realized that most of our users are either beginner or advanced Excel users, meaning we could safely assume that our users were able to work with Excel at a basic level, like entering data, and some basic formulas, and stuff. And then we could also assume that some of our more advanced users are really, really powerful Excel users. And so in Impira, even from the very start, you’ve been able to create these really complex expressions, and formulas, and stuff. And we realized the reason for all of this is that – and if you sort of tie it back to what I was saying about pre-trained models not evolving when you notice something is wrong, we really wanted to create an experience where users could easily see whether the predictions were right or wrong, and then if the predictions are wrong, or if they feel compelled to give us feedback that they’re right, they could correct or confirm things. And every time they do that, we drive the feedback into the model and incrementally train it.

[16:06] And so because of that design, we basically structured the machine learning approach to be one that is very, very lightweight, and something that can train and evaluate really, really quickly. And so that’s the overall approach for how we tackled it.

So I have what seems to me like maybe an odd question… But as you were kind of talking your way through that, it’s what came to mind. What are the things that you need to really be able to do with the document? With DocQuery being called DocQuery, for instance, what does it mean to query a document? Because that can be interpreted in so many ways. It started with something as simple as people doing Ctrl+F to do a find on a document…

Oh my God, I love this question.

Yeah, what are the things that matter? Because it occurred to me, I don’t know what those are.

Yeah, so I’d say from a user standpoint, there are a few different things that they’re really interested in. And then we can talk a little bit about Impira’s technology, and what part of that we hit, and what part of it we missed until we introduced DocQuery. But you know, users care about – one is integration. So a really common workflow for a lot of different types of documents - and I’m sure you’ll relate to this from your nonprofit business as well - you receive documents through email. You have to interpret them, to some extent. And that could mean reading the whole document, or just eyeballing something and figuring out where it should go next. And then you need to take that information and shove it somewhere. And what that looks like in a workflow, like accounts payable, for example, is receiving an invoice through email, opening the invoice on your screen, and then manually keying in the information into your ERP system. And there’s usually some judgment or interpretation that goes in as well. So these things are never totally literal; you might be making sure that the purchase order number that’s on the invoice is actually one that’s in your database. You might check that shipping plus subtotal plus tax equals the total, and sending an email back to the vendor if it doesn’t, or doing some other stuff, as well. So that’s like the basic workflow.

The other thing that people really want to do is ask questions. So not just sort of run the formula of like, “This plus this plus this equals that” but say like, “Are these two numbers equal?” Or “Of these 100 invoices, which ones are due next week?” Or “What was the most expensive line item on this invoice?” And that overlaps with search, although what we see is that people - they’re looking for answers to questions that are fairly analytical in nature. And a lot of this is done very, very manually today.

It is. So it’s kind of funny, and it’s funny that you that you referenced me doing the nonprofit thing, because these are agonies, they’re little things that I know for a fact, because – just to bring in my own experience into the conversation, my wife and I are doing these administrative things that we have to do; we have a group of volunteers, and all, but most of the admin falls to us. And there are tasks that neither one of us is particularly trained in, nor particularly are they things we love to do. And so as you’re describing that, I was like, “Oh yeah, that was a pain. Oh, yeah, that’s painful. Yeah, that’s painful as well.” So it’s interesting that you’ve identified all of these pain points. And I realize you’re not specifically talking about nonprofits or small organizations, but indeed, they are things that definitely impact us as users.

We do actually have quite a few nonprofit users and customers of Impira, so we’ve heard this feedback very directly from them as well, yeah.

So as you’ve recognized all this, can you talk a little bit about what Impira has done, and how DocQuery fits into that, and within the scope of – you’ve laid out the problem and you’ve laid out kind of an approach to a solution… Could you talk a little bit about how that is realized in Impira, and broad, and specifically in DocQuery?

[20:12] Absolutely, yeah. So if you think about what I mentioned with Impira, there are a few things that really stand out. One is that users can work with any field that they want, they can create any schema that they want. And the second is that we really care about ease of use and simplicity. And so if you rewind back a few months, we were in a state where you could create whatever field that you wanted, but you had to provide at least one label on the document; you had to highlight and click something to teach the model. And even though you didn’t have to do it for every single format that you uploaded, you had to do it for most of the formats that you uploaded at least one label.

So if you imagine, with invoices, if you had like a hundred different vendors, you might need to provide like 50 or 60 labels to teach the model about the breadth of vendors that you had. And so what we started thinking is, “Okay, how do we solve the problem of making it so you don’t need to provide any labels?” in this case. And not only would that provide a much better user experience, but it also would mean that we’d be able to address the long tail of variety a lot better. And that means that if you upload something that we haven’t seen before and it doesn’t look like something that you’ve trained your model on, it still has a fighting chance at at extracting the data correctly. And so we started open-endedly exploring, like pull our head out of the sand of all of our Impira context, and open-endedly started exploring what else was out there. And actually, the first thing I did - I remember doing it on the car ride to the airport from the New York back to San Francisco - was copy-paste manually the text out of a bunch of invoices… To your point earlier, PDFs have all this structure, but I was just copying it out, and basically ignoring all the structure. And on Hugging Face’s website trying out a few different models that are pure text question answering models, and pasting the text into the website and asking questions like “What is the invoice number, and what is the total?” and I was just blown away by how accurate it was. It wasn’t even like 60% accurate, but still, with no context about this problem, nothing to do with invoices, no training data about invoices, no PDF structure or anything like that, it was that accurate. And so that kind of blew my mind. I mean, if it was that accurate with something that was so distant from what we were doing, it meant a few things. One is we could probably do better if we put in a little bit of effort. Two, we had this epiphany that the framework of question answering allows a sort of infinite canvas of any fields or any questions that you want, which is very in-line with our product’s philosophy. And then three, because something that has never looked at any documents, like the ones I was pasting into the textbox - because it was working so well with that, that probably meant that it would solve that generalization problem that I mentioned earlier. And so that sort of experience - I still remember the car ride, and I still remember working on my hotspot and stuff, and furiously playing with it. That sort of kicked off this whole idea.

That’s very interesting. And I know this is fairly recent. You’ve actually hit a whole bunch of things that I want to touch on with a couple of follow-up questions… First of all, this was a recent announcement. It was only on September 1st that you announced DocQuery. And another thing that you mentioned just now was Hugging Face, and stuff. So I’m curious about several things; I’ll throw several out to you. How has that model evolved, that you’ve had, as you’ve done this; you had started in the visual, you talk about large language models in your Twitter… There’s obviously an evolution of deep learning technologies that you’re applying here… And as you did that, how did Hugging Face fit into that? We have a habit of talking about Hugging Face quite a lot on this show; we’re big fans. So how did all that come together? …the evolution, Hugging Face, everything.

[24:25] We’re also big fans of them, and we’ve actually had the distinct pleasure of collaborating with them on this problem. So essentially, what happened is they have this cool thing called a pipeline. And a pipeline, for people like me who are not machine learning experts and barely understand what [unintelligible 00:24:44.17]like any of this kind stuff, abstracts away all of that complex machinery and makes it really easy to work with models. And so the pipeline that I was experimenting with is called the question answering pipeline, and it’s all over their website, and any model that fits the question answering framework works with it.

So after we saw this, Richard and I chatted, and we were aware of some work out of Microsoft for a project called Layout LM, which is a language model that in addition to taking text as input, it also takes bounding boxes for each word of text. And so that introduces the geometric information into the model that is actually super-relevant to our problem. And just to give you an example, you might have the text invoice number, and then the actual invoice number might be to the right of it. And if you turn that into plain text, then even a plain text model could pick up on that relationship.

On the other hand, you might have the word invoice number, and then the text beneath it. And then you might have some other text to the right of the word invoice number. And without the bounding box information, it’s actually really hard for a model to be able to pick up on that relationship. And so Layout LM seemed like a really promising approach to solving that. But for some reason, when we dug around Hugging Face, and scoured GitHub, and Google at large, to see if there was a question answering pipeline that worked with Layout LM, we just couldn’t find anything… And it seemed to us like, wow, if we had this awesome experience working with text-based question answering, and we know we’re not the only people trying to work with documents, but there’s nothing quite that easy out there, maybe we should take the lead on this and make it just that easy to do document-based question answering as well.

And so we reached out to the team at Hugging Face, actually just by filing a GitHub issue, and they were incredibly receptive to the idea. And over a month of collaboration and working with them, we actually contributed the document question answering pipeline that’s now in Hugging Face, and a model that’s pre-trained and MIT-licensed and everything that you can play with, and work with, and even put into production, that works with it and actually makes it that easy.

Break: [27:11] to [27:24]

So that’s really cool. What motivated you to make this an open source distribution? From the business – as you have put together, you’ve identified the problem, you have a new approach that you want to take, you’re taking advantage of really, really leading edge technologies from Hugging Face in terms of their pipeline… What made you decide as an entrepreneur to release DocQuery as open source? What was the business motivation there?

Yeah, so I think there may be three reasons for it. The first is not the business motivation, but just the personal motivation. When things are open source and they’re easy to work with, it removes all barriers to innovation. And I think selfishly, as someone who cares a lot about innovation, but also as a member of the tech community at large, I think being able to contribute to people innovating and making it easier for them to innovate and play with ideas - it’s just very important to me.

[28:21] From a business standpoint, the second thing is you could think about it in terms of distribution. So in exchange for providing something that’s generally useful to a large community of people, we have the opportunity to get some mindshare, and for them to familiarize themselves with us as a company, to experience technology that we create and form an opinion about how credible we are as product builders, and so on, in a way that doesn’t require them to give us email, or talk to a salesperson, or anything like that. So just purely from the standpoint of distribution, it’s actually really valuable to us as a company to have the mindshare and attention associated with it.

And then the last thing is being confident about what our sort of proprietary strategy can be in the context of having open sourced DocQuery. And there are a couple of things that make me really confident that we can still be a really successful proprietary product. The most important one is that when you as a customer use our product, you have this really real-time data flywheel, which allows you to correct things, review things, integrate things, and the models will keep improving just for you to be able to do that. And time and time again, we’ve seen how important that is, for people to put models into production in commercial settings. And we know that the ease of use UI security integrations workflow involved is something that is actually really hard to build and engineer yourself. And so we know that that’s extremely valuable, and we feel confident in that. And so for things outside of that, that kind of opens up the possibility of open sourcing them and still being able to derive a lot of value from this core, proprietary product.

You said something that struck me right there about having that level of confidence and the fact that you already knew it was hard to build those things out… That is something that stops a lot of would-be entrepreneurs right in their tracks. You’ve dived into the deep end of the pool… You’ve said a couple of times in our conversation that you were not coming into this as a world-class deep learning expert yourself. You’ve built a team obviously around, but you were coming in at someone with an idea. What gave you the confidence or the bravery to dive into the deep end of the pool and do something that we normally associate with people who might have a different background? … you know, have all that heavy math, and years of deep learning, modeling and stuff like that. How did you get past that? Because there are probably 1,000 people listening right now that want to be entrepreneurs, they’ve tried it, maybe they’ve tried and failed… How did you get past these hurdles?

Yeah. So I’ll give you the real answer, and the inspiring answer.

Okay, fair enough.

The real answer is just stupid naiveté. Like, I didn’t even think about that, and I’ve learned and been humbled so many times, by so many smart people over the past 10 years… And I’m still pretty stupid, and still pretty naive, and I hope I am that way for some time… But that’s the real answer. Now, I think the more hopefully inspiring version of that is that as someone who is not deeply familiar with the math, and deeply entrenched in the existing workflow for how things operate, it gives you a really unique perspective on what it would take to make something easy to use, and simple enough that non-experts can take advantage of it. And I think a lot of what you’re doing as an entrepreneur is bringing together two perspectives.

[32:21] The one perspective are the people who you can feel need something, and the other perspective is the perspective of the people who feel they can build that thing. And as someone who’s not a machine learning person, it’s very easy for me to go on to Hugging Face’s website and play with the question answering model, and then try to read the documentation about the Layout LM model, which had no examples and nothing that easy to use, and see the difference… Simply because I just didn’t understand enough about the model complexity and so on to actually understand, and so I was able to see that difference. And I think actually knowing more than I had at the time would have prevented me from doing so. And now that I’ve actually learned a decent amount about this stuff, I don’t have that same experience when I’m reading through papers about models or documentation and I almost miss it.

I’m curious, as you’ve done that - and by the way, I really think that what you said was quite wise, in terms of having that always willing to learn, knowing that you’re never there… So you’ve had several really great insights in your process, one of which was the benefit of doing it as open source, which scares a lot of people off, obviously, in terms of as a business model. But one of the things that we know is that when you have a great product, you’re solving a problem well, and you put it out there like that, it makes it very accessible, as you mentioned earlier, so adoption tends to be much higher when you do that, because people can dive in at whatever level they’re comfortable with and give it a shot and figure out how to engage you going forward. As you do that, what do you think are the next steps for DocQuery and Impira at large there? And then I’ll ask you a broader question after that… But I’m curious, very specific to DocQuery, where do you think it’s going to go over the next year or so from an adoption standpoint, and in terms of like what’s your short-term vision for that?

Yeah. So in the very near term, thanks to like just a fantastic flood of feedback of users, both through GitHub and Discord among other channels, we have a pretty good sense of the types of questions that people want to ask about a document that they can’t currently ask with DocQuery. And the really beautiful thing about the question answering framework is that it actually encourages that creativity. People can read really easily type whatever question they want, and either get an answer or not get an answer.

And so the two kinds of questions that people keep trying to ask, that we’re not able to answer about a single document with DocQuery, are what document something is… So like, for example, is this an invoice? Or is this a purchase order? Or is this an invoice from this vendor? And the other thing they’re trying to ask are questions about tables. And an example of a question about tables would be like, “Give me all the line items on this invoice.” Or “What are all of the descriptions?” Or “What is the first, or second, or third description?” Or “What is the highest total value?” or something like that. And these are things that we’re actually fortunate to have a good amount of data for, and in the very near term are basically expanding the question answering model to be able to support.

[35:57] We have looked at other model frameworks, for example things like document classification as a framework or visual table detection, and stuff… We have a lot of experience trying these things out within the Impira product. But we feel pretty confident that we can basically expand the question answering framework to support them. And we just love the fact that it’s an infinite canvas.

The next step from there, which I’m extremely excited about, is allowing people to ask natural language questions over multiple documents, or a pile of documents, if you will. And that can be things as simple as like, “What are all the invoices?” or “Find me all the invoices in my Google Drive folder”, or things that are more complicated, like “What are the invoices that are due next month?” Or “Which invoices am I past due on?” Or “Which invoice from this vendor is the one that’s most relevant to this contract?” Or something like that.

Is that farther out, though? Is that something you think is – like, are you close to that, or do you think it’s going to take a little while to get to that point?

Well, the model is training right now.

Okay… [laughs]

There’s a few moving parts that we’re trying to figure out.

That was a great answer right there…

Yeah… I mean, like literally training right now. I kicked off the most recent run right before the podcast. I’ll give you the teaser for how that works, which is that – actually, we’ve studied this problem a lot through Impira s product, because long story short, people actually do this kind of stuff with Impira. You can extract fields, and then you can write queries over the fields as well. And we actually have a pretty powerful query language that makes all this possible. And what we’ve realized is that you can take natural language and basically compile it into a query which consists of both relational algebra and other models or questions to ask of documents. And so we’re cooking this framework and making it work, and we’ve seen some really exciting initial results… But I don’t think it’s going to be too long before that’s possible.

And then as we think about it further, one of the things that we did - and I encourage anyone who’s interested in the space to throw any idea you have at it - is we opened up discussion on GitHub about like what are things that you’d like to be able to type, that have to do with documents? And what’s interesting is a lot of the questions or things that people want to type are also actions. So things like “Organize all of these documents into folders by their document type”, or “Forward along all of these things to this email address.”

And so I’m not exactly sure how we’re going to tackle this in the open source part of the equation, versus our product, versus integrations with other products, because even our product doesn’t do all of these things… But I think purely from the machine learning standpoint, we’re starting to think about what the right framework looks like, both on the machine learning side and on the application side, to make it possible to type things like that.

And then the last thing I’ll say is that as we push further into DocQuery, it’s become increasingly clear to us that even though this question answering approach is incredibly relevant to working with documents, and it happens to work really well, this framework of having one or more things of data and asking questions about it is an incredibly powerful paradigm for people to work with data. And so our vision is increasingly becoming making it really easy for anyone to ask anything, of any data. And how we sequence those parts together, we’re still learning. I suspect, one of the really great benefits of open sourcing DocQuery is going to be engaging people in the community who have different flavors of this use case to apply it in different domains. We probably won’t build models that analyze video, but you could use like 75% of DocQuery to manage getting the question, semantically representing it, turning it into relational algebra, yadda-yadda-yadda, and someone really smart in the community could plug in the video aspect of it.

And so that’s where I see the future of this… And I think open source in particular is going to be a really powerful vector for us to engage a much larger audience than our limited engineering bandwidth has the capacity to support over the long term.

Well, Ankur, that is very inspiring. It’s funny, because on a day to day basis many of us would think of just document management as a fairly mundane thing… But it’s such a huge impact on people’s lives in a billion small ways, in terms of making that better…

Oh, yeah.

It’s definitely something that brings a lot of value to a lot of people around the globe. So thank you so much for coming on the show. It was a fascinating conversation. Thank you for what you’re doing. Thank you for taking the approach that you’ve taken, and I’m looking forward to finishing up as this little nonprofit manager. I’m excited to use that to make my life just a little bit better going forward. Thanks a lot.

Awesome. And send us any feedback you have. We’d love it.

Absolutely.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art