While scaling up machine learning at Instacart, Montana Low and Lev Kokotov discovered just how much you can do with the Postgres database. They are building on that work with PostgresML, an extension to the database that lets you train and deploy models to make online predictions using only SQL. This is super practical discussion that you don’t want to miss!
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing Chris?
Doing very well, Daniel. How’s it going today?
It’s going great. I missed you last week. I had a good conversation about various interesting topics, but it’s good to have you back with me.
Yeah, sorry I missed that one. The day job got in the way on that one.
Yeah, the actual practical AI of your life got in the way…
The practical AI of my life definitely got in the way on that one. Sorry I missed it. I’m glad to be here today, though. I think we’re gonna have a great conversation.
Yeah, this is definitely something that I think fits very well within our theme of practical AI, and something that I’m really excited to talk about, because I think it might solve some of my own struggles, in my own development life. So today we have with us Montana and Lev, who are the founders, creators of PostgresML. Welcome, Montana and Lev.
Hi. Thanks for having us, Daniel.
Thanks for having us.
Yeah, definitely. So before we get into all the details about PostgresML, and what that means, and what it is, do you want to give us just maybe a little bit of back-story about how two people sort of find their way together into connecting machine learning things into a popular database? Maybe we’ll start with you, Montana. Do you want to give your side of that?
Yeah, I mean, it’s kind of a long-winded story. It’s definitely not the first time that I’ve taken a stab at machine learning infrastructure and trying to make things simpler. I joined Instacart about seven years ago. I had been a chief data scientist prior to that, and mostly it was all self-taught; I didn’t really deserve the title. But at small startups you get to pick your own, and that’s what I wanted to be when I grew up, so to speak.
Anyway, when I joined Instacart, it was a really exciting time. There were a couple dozen engineers in the company, and we were getting large enough that we needed to move out of a monolithic Rails app into more of a distributed architecture that would be horizontally scalable. One of the first projects that I did when I started there was pulling all of the product catalog data out of the single Postgres database that we had, moving it into a new Postgres database, but then fronting that with Elasticsearch, so that we would have this horizontally scalable, catalog system that could power the whole eCommerce website as we added thousands and thousands of stores to Instacart platform.
[04:16] And that was really exciting, that was really fun. I learned a ton. I had worked with natural language processing and search prior to this. I had some experience with Lucene and distributing that, so it was cool to get some new technology and to really leverage that and to start unlocking data scientists and how they could impact the product in a more direct way.
But data science was very nascent at Instacart at that point. Fast-forward a couple of years, I got to sort of lead several SWAT team initiatives around the company to pull out more systems into more distributed architectures to help stitch these things together. And as our team grew, we brought on a VP of engineering, Jeremy Stanley, who’s a brilliant data scientist; one of the best people that I’ve ever worked with. And he sort of put out a call for help of, “Hey, if anybody can help us get some of these models that we’re building on our laptops to actually impact the product somehow, we’d love to talk.” I got to work very closely with him to help figure out how we would productionize a lot of these systems, and to help build a lot of the tooling that data scientists need. If they’re going to use Python, should they be using Conda? What does a pip install actually look like? How do you get that into production? The whole nightmare of dependency management and lifecycle management of models when they’re not just built once, but they have to be rebuilt continuously, with new data as it comes in. And then you have to get the feature data to the actual model, but it can’t be the data that’s coming out of your snowflake warehouse - at the time we were on Redshift - because that’s too slow and latent.
We were learning everything and building it on the fly, and it was chaotic, but fun. Actually, we published a lot of that work, and a library called Lore, which was Instacart’s open source platform. Now, as the ecosystem evolved around us really quickly, over the last, five or six years, things have changed at breakneck speed; there’s a new platform library company coming out every day that’s doing something really cool… And so we grew that internally, but it didn’t really make sense to keep a lot of the stuff that we had built, because original libraries built better embedded solutions; they actually built the bridges, and we could take some of our tape and glue out of the middle and things got simpler.
But fast forward another couple of years at Instacart, the original system that we had built with Elasticsearch as the heart of our data architecture, our data infrastructure - it really became like the beating heart. If anybody had any data, and this included all of our machine learning feature data, we would just shove it into the Elasticsearch document. And then anybody who needed data would just get it right out of the Elasticsearch document. And so our documents grew to several 100 fields, and many of these fields themselves were nested JSON documents; they could be tens of kilobytes of additional payload data, and so our Elastic document size blossomed.
And Instacart, I think, is somewhat unique in a couple of its constraints. One is the real-time nature of the business. Instacart is not like Amazon, where if it’s not in stock, Amazon tells you it’s not in stock, and it’ll be two days late. Instacart, can’t do that. If we say, “Sorry, we’re not gonna be there in 45 minutes; we don’t have your entree that you’re planning to cook for your family”, your family is gonna go hungry, and that’s going to be a really, really bad customer experience.
[07:52] And so everything at Instacart is built, from the product, from the machine learning, it has to be rapid, and online, and responsive. It can’t be an offline 24-hour batch job that we get around to eventually. And so I think that that is a really challenging technology problem. It’s a really challenging business and product problem as well. At the same time, Instacart is a platform for hundreds of different grocers throughout the country, that have tens of thousands of stores, all that have different inventory. So it’s true that we have one product, one box of Captain Crunch cereal; it has an image, it has nutrition facts, it has this universally true data about it. But then it also has little facets of data that are specific to every single store, like what is it actually being sold for in that store? Is there a manager special that day? Is it in stock on the shelves, or did it just they just sell the last box? And so if you think about this from a data architecture perspective, it’s a pretty classic - you have two tables, one is your product, one is your offerings, you join those two tables together, you denormalize that data into Elasticsearch… Easy. Except, we actually have a million products, we have 10,000 stores; you multiply that together, you get ten billion. And so all of a sudden, this is an incredibly large Elasticsearch cluster, and it’s growing very, very rapidly… Because Instacart was at the time expanding into whole new verticals beyond grocery. It was basically all of retail. And it’s like, “Oh, now we have like this whole other dimension, and we want to join whole other things… How are we going to scale the cluster?”
I remember seeing a graph plotted of like our Elasticsearch capacity increase per node added to the cluster. There were some diminishing returns there. You don’t get perfectly linear scaling when you add nodes to a cluster. At the same time, - well, that curve is asymptoting and flattening. There’s another curve that’s coming up exponentially, which is Instacart’s growth rate, both in terms of writes to Elasticsearch in times of timely ingestion into the system… And this is another thing that Instacart had contractually agreed to providing updates to the website on behalf of retailers in very short amounts of time.
One of the things that I’ve heard that Walmart for instance does is they have a green/blue deployment for Elasticsearch, and they will spend 24 hours filling up their green cluster with updates, they’ll flop over to it, all traffic gets that, and then for the next 24 hours, they will refresh their blue cluster, and then they’ll flip over to it. So you can just rebuild your cluster every night, flip back and forth between the two, and that way you avoid a lot of the incremental update penalties you get in a Lucene index, in this inverted index world. That’s not a strategy, for example, that Instacart can employ, because of the tight time constraints.
And so we were all sitting around, kind of pulling our hair out, trying to figure out how we were going to out-scale the business with our technology, and getting a little desperate, honestly. I think Postgres was not the first idea, but eventually, we did decide that, fundamentally, this is a joint problem. If we could do the join at read time, rather than index time, then that would potentially eliminate a huge amount of work, because many of the documents that we were joining and indexing were actually never read before they were reindexed again. And so we could actually, by not doing those useless amounts of work, we could reduce the amount of work in the system substantially.
And so we built a prototype for this system of what would it look like – you know, people have been sharding Postgres for decades; it’s something that people know how to do. It’s a little finicky, you have to get it right… But even more recently, with things like TimescaleDB and Sidus Data - they make sharding a lot more manageable, a lot more tolerable. And so we started looking at some of those options and we started to look at the – Postgres also has these full-text search capabilities built into it. They don’t nearly have the bells and whistles of Elastic, but the basics are there.
[12:14] So I started talking with our NLP guys and our search engineering team, and saying like, “What are we actually using in Elastic? What machine learning functionality in Elastic do you have?” And like “Oh, we tried, but it fell over, so we couldn’t do it. We couldn’t actually use a lot of it. It’s too much load on the cluster, it’s already on fire…” And so what we learned is that most things happen at the application layer, and most things are like joins between various microsystem data stores, feature stores that have gotten kicked out of Elasticsearch because they were creating too much load on the heart of the company. And then we would join those all at request time at the application layer; sometimes that would take eight seconds for our P90. It could be quite slow. And sometimes what we would find is like we would do this eight seconds of work, and then at the last step we would find everything, all of our high candidate high quality results were out of stock, so we would have nothing to show, because we had to implement several constraint layers upstream. When we really got into the nature of the system that we had built, that was this distributed machine learning beautiful beast, it was not it was not a pretty picture. It was a very complicated picture.
And so we just said, “Well, we don’t really have any other options. We’re going to try to do this in Postgres.” We stood up our prototype, we had it running… We were shadow-testing its search results against Elasticsearch, what we were getting back and forth. We were finding lots of data ingestion, bugs, bugs that had been in our data pipeline going into Elasticsearch for years; we discovered several of those… Because we had to rebuild the pipeline in parallel for Postgres; it was a totally different pipeline. Obviously, we found several bugs in our Postgres implementation as well; when you’re doing a second system rewrite, that’s never a fun thing. I don’t typically recommend people go that route.
But things were looking okay, until the pandemic started. We had plotted out the intersection of those two curves looking something like a year out that we would have to figure this thing out, and to kind of experiment and prototype. And we went through that year of growth in about a week, the first week of the pandemic. I remember getting paged the next Sunday - everybody does their grocery shopping on Sunday morning, so if there’s a new load issue, it’s going to be Sunday morning when we get paged. And so I remember getting paged, and Elasticsearch was timing out 30-second requests… We had stopped indexing, so we were in danger of not meeting SLAs unless we could get indexing going again, and traffic would go away.
We did all kinds of things. We thought about putting up a Stop sign on the website and saying, “Sorry, Instacart is full. You have to come back another time.” Luckily, we never had to actually deploy that. Luckily, we were able to scale our way out of the pandemic… But it was a lot of work.
So while we were in the middle of this incident, we said “Well, we’ve got this other cluster over here. We think that the results are about the same as the cluster that we’re using, that’s kind of dead right now, because it’s just timing out, a hundred percent load…” So we just flipped the switch, put all the load against Postgres, and started using it. Of course, it immediately went to 100% CPU utilization, and also caught on fire… But we were able to find a few missing indexes for some long tail queries that we hadn’t really optimized, and within a couple of hours, we got that cluster to a point where we could actually serve traffic again. And so that was that was really exciting, to really get the system bolted down to a couple of months after that… But for the most part, we had sort of shifted what was the primary system and what was the secondary system. Elasticsearch from that point, going forward, was really the backup to this new system that we had. And we had a couple of incidents with the new system, as we started throwing more and more data into it. Because after we did the original optimizations, we got down to like 4% CPU usage or something in the Postgres cluster. And it was it was vastly underscaled compared to our Elasticsearch cluster. I mean, it was just tiny compared to it is. It was really amazing.
[16:22] But at the same time, I mentioned all these other features, stores, model stores, everything else that we had… And all of those, whether they were Redis, or whether they were Postgres or Cassandra, those were all melting down as well. Those were not horizontally scalable systems; we learned a lot about scaling every system we had, whether that was RabbiMQ, or Redis, or… If you can name a database, we probably tried it at some point at Instacart.
So we had lots of fun, but our solution in this case was basically like “Figure out what your database is that has the most CPU usage, pull all the data out of it, and dump it into this new, horizontally-scalable Postgres cluster that we have.” And so we just did that over and over again, and we barely kept ahead of our doubling week-over-week growth curve for the next eight weeks. Like I mentioned, sometimes we missed optimizations, and we didn’t really have the time to vet and test the system that we were building like we should have, or could have… But I think that we did the best that we could with the resources that we had.
And we spent at least a year after that iterating, adding more, really unlocking some new machine learning for our search team that we could now do. And we didn’t get as far as I really wanted though, because at the time there was – there is a library called Madlib, which is an Apache Foundation project; I think it’s ten years old, it’s been going for a while… But there were some constraints at the time. They were locked to a specific older version of TensorFlow, I think. My memory is fuzzy, I didn’t get a lot of sleep back then. But we weren’t able to actually take a lot of our deep learning models and put them into Madlib and run them inside of the database to eliminate some of the microservices… So we actually kept quite a bit of the microservice architecture, and kept building around that. But it kind of bothered me, because we were able to clean up so much of the distributed system. I felt good about – the system that we ended up with was much better than the system that we started with. And it was kind of full-circle for me, coming from like – you know, I joined Instacart, and I was all about distributing everything… But by the end, I was all about pulling everything back into one fairly monolithic system.
So that was kind of eye-opening for me, about the complexity, both organizational, but also technological, that these systems can develop, and how powerful it can be if you can simplify the system. For example, when we were on the Elasticsearch pipeline we had a dedicated infrastructure team, we had a dedicated catalog team just to the pipeline. We have a dedicated search team, dedicated machine learning engineering - all of those resources… And we had upstream of that catalog data acquisition specialists that would get new kinds of data to do new kinds of products and services, or add new features to the website. But it took multiple quarters of planning and execution from the – like, you can sit a few product managers around the room, they conceptualize… Like, “Hey, we want to add this feature to the website.” Like, “Okay, we’ll go source the data, we’ll get the data into the pipeline, we’ll ingest it into Elasticsearch, the search team will start using it, and then they’ll display it. And they’ll start work this quarter, the search team will get to it the next quarter…” “Oh, wait, we don’t have the data in the right format. Let’s circle back for another quarter of this whole process.” And so it was really, really problematic.
Montana, fascinating to hear about this sort of progression at Instacart, this sort of scale-up, and the issues, especially around the pandemic and having to respond in that way, and how that path, that journey led you into into Postgres. Before continue into the PostgresML story, I’m curious - Lev, were you experiencing this also with Montana? Or were you coming from a different a different side of things? How did you experience your sort journey into thinking more deeply about Postgres, and where it intersects with data science, machine learning, all of these things?
I’m laughing, because that system that Montana is talking about - I might have been the guy who built it. [laughter]
The Elasticsearch system, or the Postgres system?
The Postgres system. I came in as a true believer. I was told that Elastic was wrong, and Postgres was right. And I’m, like, “Sounds good. Let’s build Postgres.” And it’s not like we’ve built it RDS, or anything. We actually built it straight up on EC2, so I had to learn things like “Oh, how do I install Postgres on Ubuntu? Should I pick Ubuntu? What kernel version do I need?” And it wasn’t because we were kind of like “Oh yeah, self-hosting is the way.” It was because RDS was too slow to power our workloads. If your new database is on RDS, you know that the disks are network disk, right? So latency is at least like 10 milliseconds, depending on the day. Io2 is probably a little bit better, but still not quite there.
And RDS, for those that don’t know, is a managed database service, right?
Yeah, that’s the one, yeah. The AWS relational database service. They have Postgres, MySQL, all the fun ones. Anyway, so instead, we picked SSDs; the raw NVMs, the cool ones, that could do like four, five, six, ten times more workload than the network drives. And that’s when Postgres really came alive. And all of a sudden, I could scan like a 100-gigabit table in like 30 seconds. I’m like, “Oh, wait, so this thing actually scales?” [laughs] Because before we’d just spend, like – cutting these tables, partitioning them, making them as small as possible… And I’m like, “Wait, but why? I could just read this whole thing from a disk in like two gigabytes a second. What’s the big deal?” You know, technology came a long way.
Yeah, so those Sundays that Montana talked about - he was getting paged. I was also getting paged. I worked at Instacart infrastructure for about three years cumulatively. I started as an app engineer, like him, and I was supposed to build widgets, and the checkout page that we might have used at some point, but… You know, a week later they told me “Hey, we need to fix our Postgres setup. Our databases are falling over.” I’m like, “Alright, sounds good. I think I’ve heard of Postgres before. Let’s see what I can do.” And then I wasn’t an app engineer anymore. [laughs] I kind of learned that on the fly, I started memorizing the documentation and everything. And then a year later I got sat down in a room and Montana was presenting his new Elasticsearch replacement, and I’m like, “Whoa, we’re gonna shard this ourselves? Is that what’s going on? I think I’m might be able to do that, but okay…” [laughs] And yeah, we spent quite some time building that thing.
So as you get into PostgresML – this is a great story you guys are telling together. I’m really into it. But as you’re looking at this, how does that lead down to that? Without losing the thread, where are you going that’s going to arrive at that moment where you’re starting to really think about what you need to be doing next in that capacity?
[24:07] Yeah, I think Lev should tell you about pgcat and I’ll tell you about PostgresML, because I think they’re two different pieces of the puzzle. But to answer your question - like I mentioned, we weren’t able to get a lot of the deep learning models into Postgres. And so because we couldn’t get that, we didn’t even really start with all of the XGBoost models, or even simpler models that we were running.
I think there’s a big disconnect between what is flashy and hype in Academia, or coming out of OpenAI. There’s a lot of reinforcement learning, or unsupervised applications, or whatever. But in the business context that we saw across at least 20 different models and machine learning applications at Instacart - they all basically boil down to you’ve got some relational data and you’re trying to predict a single column; and you’re gonna do some joins, and that’s it. First thing is you pump it through a linear regression, you work your way up through the scikit-learn algorithms, you hit XGBoost, and your predictions are gold standard. You don’t even need deep learning 90% of the time.
And so thinking through like the convoluted data architectures, and data engineering, and everything else that we had to do to get features to models, I was like - well, the data is in Postgres right there; you can just write a join, and you can just do a select, and you don’t even need to bring a lot of it to the application layer if you can actually do your ranking, or whatever it is that you’re trying to do.
Ranking is obviously a big application of machine learning. You need to know what your popular things are, your trending things are, your relevant things are, your possible alternatives to this thing are… There’s all these different way that things are associated to things in this high-dimensional space that data scientists love. So being able to pull things, only the most relevant of any of those dimensions out of the database, the application layer… Which is actually a really expensive operation, compared to things like deep learning. People think deep learning inference is expensive; it’s really not, relative to like taking thousands of rows out of a database, serializing them, sending them over a copper wire that’s multiple feet long, reading them into a JSON blob on the other side in some dynamic language that’s allocating a ton of memory to do all of these operations, so that you can operate on this in Python or Ruby or whatever it is… That’s where actually most of the latency in the system comes from. It’s not from the models themselves; the models themselves are like highly-optimized C code that, sure, it may have millions or even billions of parameters, but it’s relatively fast and optimal.
And so I was just thinking, what if we could cut all of that complexity, all of that latency out, and keep things in the data layer? And I was talking to Lev about this, and I was like, “I’m gonna go on vacation, but when I get back, I’m gonna start on this project.” On vacation, on day one, Lev emails me and he’s got a new commit. He’s like, “I’ve got deep learning in Postgres!” Lev is really phenomenal this way, in that he’s very competitive. If you tell him about something, he’ll try to beat you at it. [laughter]
We all have a Lev in our life, you know?
No, it’s fantastic, to be challenged by somebody that way… I really appreciate it. It’s a lot of fun working with Lev. But I think that was sort of my itch, which I would consider my thesis defense at Instacart, was like “Can we do this with Postgres? Can we actually go all the way? And can we get to a data architecture that doesn’t really involve any ETL (or ELT, whichever you prefer)? It’s just the database, and the data just sits there until you know that you want it.” And that was what really drove the creation of PostgresML. But I think Lev had a different itch with the system that we built, and so he actually went off and built another solution that he should tell you about.
Shame on you for going on vacation and he beat you to it. [laughs]
[28:02] Yeah, I guess that’s a good segue into sharding and load balancing and running Postgres at scale. It’s funny, because Postgres itself doesn’t have any sharding capabilities; you can just spin up a single primary, and that’s what you get. You can have some partitions or some foreign tables, or FTWs, foreign data wrappers, if you’ve ever heard about that… But by the end of that sentence, you’re like, “I don’t know what you are talking about. Please just do this for me.” And I’m like, “Yeah, sure, I can do this for you.”
So I took the sharding logic that we kind of invented at Instacart, and it kind of already existed, but you know, you always reinvent the same thing over and over… You put it into the proxy, in like a pooler essentially, and you put that in front of your database, and then your clients - they’re just connecting to Postgres. They don’t know about sharding. They don’t know about replicas. They don’t know about load balancing, they don’t know failover, they don’t know anything about it. And they just get the data, whatever they want. I called pgcat because I’m obsessed with cats. I just said that on the internet, so about 50% of people are like, “This guy’s amazing.” The other 50% are like “Dogs are the best. I hate this guy. Unsubscribe, unsubscribe!” [laughs] That’s a good number, I hear…
Hey, at least you get the 50%.
Yeah, I mean… It’s a fun project that we kind of wrote at the application layer… You know, all the try-catch, Ruby-Python logic to talk to like five different replicas we just implemented at the end infra layer, and now the database is like magically sharded, magically load balanced, magically highly available. It’s everything that we wanted, but couldn’t have. Yeah, I don’t know. I liked it. Yeah.
So the two sides - you sort of have like the pgcat stuff, and then you have the idea “Can we run some of this sort of like pull thousands of rows out, do some sort of like ranking or search or ML operation on them?” There’s sort of like these ideas floating around. Is PostgresML - is it sort of leveraging both of those things, and that sort of like is some of what makes it what it is? Or how did those things influence how you think about PostgresML and what it is?
Yeah, I think we’re very early with PostgresML. I think we only released it a couple of months ago. We started working on it maybe ten weeks ago or something. So it’s still what we would consider public alpha, basically. The point was like, “Does anybody care about doing machine learning in Postgres? Is anybody interested in this idea at all? Or are we crazy?” Because I have a lot of qualms about putting more load on the primary data store. That is something that I think any time you can avoid doing that, that’s probably a good thing… At least that’s my naive take on it.
But when I start seeing Lev’s work with pgcat and sharding and pooling and failover and load balancing, then I think, “Well, actually, maybe the simplicity that you can get from having a single data store instead of every database technology in the world, and the expertise that you can build and the muscle that you can build around that single technology, will lead you to a much better place in the end.”
PostgresML doesn’t combine any of the pgcat stuff. Right now, these are two separate pieces. But we’re actually working to put them together in an online service offering. We’ve started a company together so that we could go full-time. I know Chris was mentioning he’s got those real-life obligations that sometimes get in the way of all the fun stuff that we’re doing… So Lev and I also have real-life obligations that get in the way of these fun projects that we love talking about. So with those, and with going full-time on a new venture together - just put these two pieces together and really get to Postgres at scale with machine learning capabilities. That’s the goal for what we want to want to build and offer and make it easy for other people.
So to clarify, you’ve moved on from Instacart, if I’m understanding that correctly.
I’m sort of curious, I think, about, like, when someone comes to PostgresML - I know it’s sort of new, it’s beta… What is the experience? Could you just sort of describe what is the experience like of doing ML in Postgres? Maybe you could give an example of like a training type of thing, and then maybe like a deployment or inferencing type of workflow, just to give people a sense of, like “Hey, I might know how to run SQL against Postgres, but what does it mean to “do” machine learning in Postgres?”
Yeah, so Python is really, I think, the dominant language in the machine learning ecosystem these days… And so what PostgresML is right now in this alpha public release is a wrapper around all of your favorite Python libraries. We defined a little PL Python function that calls out to scikit-learn, it calls out to XGBoost, or whatever your favorite Python library is… And we defined these Postgres functions to take in all of the parameters that you could possibly pass to scikit-learn and just forward them on, so that you get all of the scikit functionality for training these models. In PostgresML, training is a single function call, so you would select-star from postgresml.train, and then you pass it a few arguments. You pass it the name of the algorithm that you want to use, and that’s either linear regression, or that’s XGBoost, or anything on the menu.
I think this is actually a really interesting rabbit trail to go down, is why I think that’s the right approach. I think most business uses of ML can safely treat ML as a black box that they put inputs in and they get inputs out. Now, you need to be very careful about the outputs of that black box, and you need to watch it closely, and make sure that it’s doing the correct thing for your business, but you don’t need to understand the math behind how these algorithms actually work. Compute is cheap enough now that you can train your data with 50 different algorithms and just pick the best. You don’t need to have this – I’ve seen a lot of theorizing from a lot of people about why a model is doing what it’s doing, and how they’re going to tweak something, and it’s Pretty much a crapshoot of whether that actually makes it any better or not. It’s always better to just test a bunch of different stuff.
And so that’s another feature of PostgresML, is that we have hyper-parameter search, and in this train you feed it a bunch of different configurations that you want, and then it will build all of the models for you. And then you can just compare which one is the best, and that will be the one that’s automatically deployed in your database.
So people always focus on the math behind the algorithms, because I think those are intellectually interesting. But what they don’t focus on, but which is actually a lot of data science work, is the curation of the data. So that data cleaning, that data curation, the feature engineering work that data scientists do day-to-day - Postgres is fantastic at that. You have SQL, you can manipulate your data in just about any way you want… Now, it may not have all of the typical functions that data scientists might be used to for treating data in a particular way, like if you want to impute a value; actually, Postgres can coalesce [unintelligible 00:35:52.24] to anything you want. It can coalesce it to an average, it can coalesce it to a min, a max, or some random value.
[36:02] As you’re telling me that, could you also walk us through a little bit of like what a simple workflow would look like? I wasn’t trying to cut you off, I was wanting to see if you would add that in, just so people kind of know like “I start here, and I go, bam-bam-bam, and I end up here, with that output.” And just to give me a sense, because as someone who hasn’t used it yet, I’m really curious about this, because so many of us in the development world, aside from just the data science world, are using Postgres every day. So I’m pretty excited about that… If you could just kind of fit that into what you were telling me there.
Yeah, absolutely. So you start with getting your data into a relation. And that’s either a table or a view. And this is gonna be your training data. So however you want to create that table or create that view - whether you want it to be a view with a bunch of joins out to your application tables - that’s fine. If you want to suck the data up into the application, munge it with a bunch of Python, dump it back out into your Feature Table in Postgres - that’s fine, too. But that is the I think the bread and butter of a lot of data science, is that kind of feature engineering, creating that table. I think it’s really magical if you can create it with a view… Because if you can create the view of your training data, you should be able to reuse the view for your inference calls. And you can get the same features out. You have to be very careful in an application database, where the application is updating rows, and so you have to make sure that you’re not training with a false view of the past, but you actually have the true append-only log somewhere that you can train from, with what the values were at the time. But if you can get that, and if you can build your OLTP database in such a way that it handles that, then you can get this very magical, “I have this view, I pass that view to my training function along with an algorithm name, whatever hyperparameters I want to pass to that algorithm.”
What happens is PostgresML calls out to Python, runs the whole pipeline, you end up with a trained model, it serializes that model back into the database, and so it’s just stored in a PostgresML models table… And then later on, you can call the ml predict function, and you pass it the model name, and you pass it the parameters that you want to make an inference on, it loads that model from the model store, it makes your prediction… It’s very similar to what you would do at the application layer with online inference.
So let me ask you this… If you are coming at this as someone who has been in Postgres for a while, like so many of us have been, but you aren’t necessarily really strong on the ML side. You know, the idea of models is kind of new to you, you’re not the person necessarily that was naturally jumping into TensorFlow or PyTorch, or one of the other options out there… What is the delta between what you know in the Postgres world and what it takes to be productive with PostgresML, so that you’re getting model output and you’re like you’re making that leap? What’s that delta of learning or leveling up that the practitioner needs?
Yeah, absolutely. Again, we’re at an alpha level of functionality, so there’s a pretty big gap of where I want to take it and where it is now. But we’ve started work on what we call the dashboard. And the dashboard actually has a click button wizard that you can go through and you can select your algorithm from a drop down list, you can select your source table data from a dropdown list, you can hit the Train button, it will do it for you… And you can do this with as many different algorithms as you want, and you just compare the out – and all of them are ranked by… You know, I’ve gone in and I’ve selected what is the way that you should compare the outputs of these algorithms. There’s a key metric for everything.
[39:52] So right now, there’s actually two main tasks that supervised learning is really good at, and that’s either classification, which is you have some fixed numbers of examples, you wanna know if it’s a hot dog or not a hot dog, whatever it is, or it’s a regression, where you’re predicting some floating point value of whether it’s zero or one, or some gray area in between.
And actually, regression is probably a more advanced implementation [unintelligible 00:40:17.17] You can build a classification on top of regression by just rounding to zero or one in a lot of cases. It’s not completely true. Some algorithms are not amenable to that. But generally, I think that’s a useful way for us to think about it, is I’m either trying to predict some number, or I’m trying to predict some class of thing, in most business cases.
And so you can do that through a UI, we will tell you how good your predictions are for every different algorithm, and you just pick the best one. You don’t really need to know what the difference is between a support vector machine and a gradient boosted tree model is. Maybe it’s fun for some people to learn about those things, but most people shouldn’t care. There should be three-letter acronyms, but they should have a score next to them, you pick the one with the best score, and you move on with whatever your business is.
So one of the things that I was pretty excited to see in this sort of initial release, which I think you mention thinking about how do I run deep learning or advanced things in Postgres… What I’ve found in a lot of the cases where I’m applying – especially for NLP type of things, I’m applying a sort of sophisticated model, but really what I’m doing is a bunch of operations on embeddings, like word embeddings, or sentence embeddings, I’m doing similarity calculations and all those things… And I see that there’s this element within PostgresML of vector operations, which is really important for so much of my own work… Lev, I’m wondering if you could comment on why that was important to include in terms of like when you’re thinking through initial features, and also maybe like the future of kind of the set of models that you want to support in PostgresML, and how you would go about deciding the roadmap on that… Because there’s just such an amazing diversity of things out there.
Oh, yeah, for sure. I mean, I worked for about a year close to machine learning engineers and data scientists, and I used to look at their Python code every day, and it was never just a call to a trade and never just a call to predict. It wasn’t just like just a lot of CSV files magically trained and then it predicted something. There were so many transformations, so many different averages calculations… It was never straightforward. It was never like just one SQL query and you get all the data magically you want.
The data, first of all, is never clean, right? Whatever business you’re running, you’re always gonna have kind of dirty, unvetted data. So data scientists, like yourself - you’re probably gonna tell me I’m telling you something obvious, but you always have to massage it and clean it up, and add averages… And then the actual value itself is just garbage; you just end up throwing it away or just adding something to it, right?
Daniel loves data cleaning, just to warn you, okay? Just letting you know…
[laughs] If you don’t, I don’t see what the point is of being a data scientist; because it’s all you do.
Yeah, I mean, one day deep neural nets are gonna clean their own data… But today, we still have to do it ourselves. [laughs] So you’ve got to have some kind of mathematical operations on your data. You have to be able to transform things. That’s the main talking points - somebody says, “Oh, but I can’t do my important defensive transformation in SQL” and my answer is “Well, yes, you can. Of course, you can.” Like, we’re not going to limit you to just pick whatever view you want. Like, pick a view. You can transform and clean the data, and then pass it to the models and then get the results back. So that’s why that’s important.
[43:52] I think the important thing there, Daniel, is that me and Lev are not data scientists by trade. We work closely with them, but I think a big ask for your listeners would be like kick the tires on PostgresML, tell us what sucks and what’s missing, so that we can cater to – because I have experience with NLP and embeddings, and so that’s why I think vector operations are important. But there’s a ton of other feature engineering that needs to be done in the world. I’m sure there are huge holes in the functionality right now. So just filing GitHub issues of, “You know, I would rather have this function available” - that would be super, super-helpful feedback for us.
Well, I mean, the call has gone out. Many listeners will hear this, and I’m sure at least a good portion of those listeners have Postgres running in their company’s infrastructure or they’re working with it in some way… I know that I’m definitely going to jump in, and we’ll make sure to include links in the show notes. So listeners, go and find those links, kick the tires with PostgresML, see how it works out for you.
As we kind of wrap up on this discussion, I wonder if maybe you could both just briefly mention something that like really excites you about where this is headed. Maybe it’s something that’s not implemented yet, or like a reality that you want to see happen because something like this exists. What is it that like really kind of excites you and keeps you motivated to make sure that something like this exists and grows? Let’s start out with Montana maybe.
For me, I think it’s about the simplicity that we can bring back to workflows, and we can get to the parts that really matter and are really valuable. We used to believe a lot in end-to-end machine learning at Instacart in the early days, where a data scientist would need to become Python proficient, and production proficient, and be able to maintain and monitor their models in production. And it’s really unrealistic at scale to expect one person to have all of the skill sets necessary across data engineering, data science, machine learning engineering, infra operations, and just good software engineering, and expect them to go through the checklist of… You know, there’s probably a hundred items on a decent ML deployment checklist that you need to make sure that you’re covered on. It really requires a team of people right now.
And so being able to simplify a lot of that work, abstract a lot of that work, so that smaller teams like we were at Instacart at the time, like we started out with, can reasonably get back into production at a very high level of quality at the same time, without dropping some of those things.
Montana kind of stole that line for me… I was gonna say simplicity, because I spend so much time reading complicated Python code, and then literally, PhD level like mathematicians were talking to me like, “Hey, what’s the HTTP–How am I supposed to launch? What’s the service? How am I supposed to launch my model into production?” And I’m like, “I’m sorry. I don’t know. I’ll just do it for you, don’t worry about it. Just give me your code. I’ll rewrite it and I’ll launch it.” That was the main sticking point.
I wish you could just run a query and deploy play everything immediately. The simplicity of it is and the ergonomics, I think that’s something that’s really exciting. I’m really motivated by making people’s lives easier. I want machine learning engineers to do machine learning that they actually enjoy, as opposed to figuring out how to load-balance a service. It doesn’t make any sense, right? [laughs] So I think like the impact that’s gonna have on a lot of people, hopefully - that that really excites me, honestly.
Awesome. Well, thank you both for such a such a great description and a story kind of behind PostgresML. I know I’m really excited to see this materialize, and excited to get hands-on with it. Like I say, we’ll include links in our show notes, so that our listeners can find their way… And make sure and engage with the team, open some issues, open some discussions with the PostgresML team. Thank you both, Montana and Lev.
Yeah, thank you, guys. I appreciate it.
Our transcripts are open source on GitHub. Improvements are welcome. 💚