While scaling up machine learning at Instacart, Montana Low and Lev Kokotov discovered just how much you can do with the Postgres database. They are building on that work with PostgresML, an extension to the database that lets you train and deploy models to make online predictions using only SQL. This is super practical discussion that you donât want to miss!
Montana Low: Yeah, I mean, itâs kind of a long-winded story. Itâs definitely not the first time that Iâve taken a stab at machine learning infrastructure and trying to make things simpler. I joined Instacart about seven years ago. I had been a chief data scientist prior to that, and mostly it was all self-taught; I didnât really deserve the title. But at small startups you get to pick your own, and thatâs what I wanted to be when I grew up, so to speak.
Anyway, when I joined Instacart, it was a really exciting time. There were a couple dozen engineers in the company, and we were getting large enough that we needed to move out of a monolithic Rails app into more of a distributed architecture that would be horizontally scalable. One of the first projects that I did when I started there was pulling all of the product catalog data out of the single Postgres database that we had, moving it into a new Postgres database, but then fronting that with Elasticsearch, so that we would have this horizontally scalable, catalog system that could power the whole eCommerce website as we added thousands and thousands of stores to Instacart platform.
[04:16] And that was really exciting, that was really fun. I learned a ton. I had worked with natural language processing and search prior to this. I had some experience with Lucene and distributing that, so it was cool to get some new technology and to really leverage that and to start unlocking data scientists and how they could impact the product in a more direct way.
But data science was very nascent at Instacart at that point. Fast-forward a couple of years, I got to sort of lead several SWAT team initiatives around the company to pull out more systems into more distributed architectures to help stitch these things together. And as our team grew, we brought on a VP of engineering, Jeremy Stanley, whoâs a brilliant data scientist; one of the best people that Iâve ever worked with. And he sort of put out a call for help of, âHey, if anybody can help us get some of these models that weâre building on our laptops to actually impact the product somehow, weâd love to talk.â I got to work very closely with him to help figure out how we would productionize a lot of these systems, and to help build a lot of the tooling that data scientists need. If theyâre going to use Python, should they be using Conda? What does a pip install actually look like? How do you get that into production? The whole nightmare of dependency management and lifecycle management of models when theyâre not just built once, but they have to be rebuilt continuously, with new data as it comes in. And then you have to get the feature data to the actual model, but it canât be the data thatâs coming out of your snowflake warehouse - at the time we were on Redshift - because thatâs too slow and latent.
We were learning everything and building it on the fly, and it was chaotic, but fun. Actually, we published a lot of that work, and a library called Lore, which was Instacartâs open source platform. Now, as the ecosystem evolved around us really quickly, over the last, five or six years, things have changed at breakneck speed; thereâs a new platform library company coming out every day thatâs doing something really cool⌠And so we grew that internally, but it didnât really make sense to keep a lot of the stuff that we had built, because original libraries built better embedded solutions; they actually built the bridges, and we could take some of our tape and glue out of the middle and things got simpler.
But fast forward another couple of years at Instacart, the original system that we had built with Elasticsearch as the heart of our data architecture, our data infrastructure - it really became like the beating heart. If anybody had any data, and this included all of our machine learning feature data, we would just shove it into the Elasticsearch document. And then anybody who needed data would just get it right out of the Elasticsearch document. And so our documents grew to several 100 fields, and many of these fields themselves were nested JSON documents; they could be tens of kilobytes of additional payload data, and so our Elastic document size blossomed.
And Instacart, I think, is somewhat unique in a couple of its constraints. One is the real-time nature of the business. Instacart is not like Amazon, where if itâs not in stock, Amazon tells you itâs not in stock, and itâll be two days late. Instacart, canât do that. If we say, âSorry, weâre not gonna be there in 45 minutes; we donât have your entree that youâre planning to cook for your familyâ, your family is gonna go hungry, and thatâs going to be a really, really bad customer experience.
[07:52] And so everything at Instacart is built, from the product, from the machine learning, it has to be rapid, and online, and responsive. It canât be an offline 24-hour batch job that we get around to eventually. And so I think that that is a really challenging technology problem. Itâs a really challenging business and product problem as well. At the same time, Instacart is a platform for hundreds of different grocers throughout the country, that have tens of thousands of stores, all that have different inventory. So itâs true that we have one product, one box of Captain Crunch cereal; it has an image, it has nutrition facts, it has this universally true data about it. But then it also has little facets of data that are specific to every single store, like what is it actually being sold for in that store? Is there a manager special that day? Is it in stock on the shelves, or did it just they just sell the last box? And so if you think about this from a data architecture perspective, itâs a pretty classic - you have two tables, one is your product, one is your offerings, you join those two tables together, you denormalize that data into Elasticsearch⌠Easy. Except, we actually have a million products, we have 10,000 stores; you multiply that together, you get ten billion. And so all of a sudden, this is an incredibly large Elasticsearch cluster, and itâs growing very, very rapidly⌠Because Instacart was at the time expanding into whole new verticals beyond grocery. It was basically all of retail. And itâs like, âOh, now we have like this whole other dimension, and we want to join whole other things⌠How are we going to scale the cluster?â
I remember seeing a graph plotted of like our Elasticsearch capacity increase per node added to the cluster. There were some diminishing returns there. You donât get perfectly linear scaling when you add nodes to a cluster. At the same time, - well, that curve is asymptoting and flattening. Thereâs another curve thatâs coming up exponentially, which is Instacartâs growth rate, both in terms of writes to Elasticsearch in times of timely ingestion into the system⌠And this is another thing that Instacart had contractually agreed to providing updates to the website on behalf of retailers in very short amounts of time.
One of the things that Iâve heard that Walmart for instance does is they have a green/blue deployment for Elasticsearch, and they will spend 24 hours filling up their green cluster with updates, theyâll flop over to it, all traffic gets that, and then for the next 24 hours, they will refresh their blue cluster, and then theyâll flip over to it. So you can just rebuild your cluster every night, flip back and forth between the two, and that way you avoid a lot of the incremental update penalties you get in a Lucene index, in this inverted index world. Thatâs not a strategy, for example, that Instacart can employ, because of the tight time constraints.
And so we were all sitting around, kind of pulling our hair out, trying to figure out how we were going to out-scale the business with our technology, and getting a little desperate, honestly. I think Postgres was not the first idea, but eventually, we did decide that, fundamentally, this is a joint problem. If we could do the join at read time, rather than index time, then that would potentially eliminate a huge amount of work, because many of the documents that we were joining and indexing were actually never read before they were reindexed again. And so we could actually, by not doing those useless amounts of work, we could reduce the amount of work in the system substantially.
And so we built a prototype for this system of what would it look like â you know, people have been sharding Postgres for decades; itâs something that people know how to do. Itâs a little finicky, you have to get it right⌠But even more recently, with things like TimescaleDB and Sidus Data - they make sharding a lot more manageable, a lot more tolerable. And so we started looking at some of those options and we started to look at the â Postgres also has these full-text search capabilities built into it. They donât nearly have the bells and whistles of Elastic, but the basics are there.
[12:14] So I started talking with our NLP guys and our search engineering team, and saying like, âWhat are we actually using in Elastic? What machine learning functionality in Elastic do you have?â And like âOh, we tried, but it fell over, so we couldnât do it. We couldnât actually use a lot of it. Itâs too much load on the cluster, itâs already on fireâŚâ And so what we learned is that most things happen at the application layer, and most things are like joins between various microsystem data stores, feature stores that have gotten kicked out of Elasticsearch because they were creating too much load on the heart of the company. And then we would join those all at request time at the application layer; sometimes that would take eight seconds for our P90. It could be quite slow. And sometimes what we would find is like we would do this eight seconds of work, and then at the last step we would find everything, all of our high candidate high quality results were out of stock, so we would have nothing to show, because we had to implement several constraint layers upstream. When we really got into the nature of the system that we had built, that was this distributed machine learning beautiful beast, it was not it was not a pretty picture. It was a very complicated picture.
And so we just said, âWell, we donât really have any other options. Weâre going to try to do this in Postgres.â We stood up our prototype, we had it running⌠We were shadow-testing its search results against Elasticsearch, what we were getting back and forth. We were finding lots of data ingestion, bugs, bugs that had been in our data pipeline going into Elasticsearch for years; we discovered several of those⌠Because we had to rebuild the pipeline in parallel for Postgres; it was a totally different pipeline. Obviously, we found several bugs in our Postgres implementation as well; when youâre doing a second system rewrite, thatâs never a fun thing. I donât typically recommend people go that route.
But things were looking okay, until the pandemic started. We had plotted out the intersection of those two curves looking something like a year out that we would have to figure this thing out, and to kind of experiment and prototype. And we went through that year of growth in about a week, the first week of the pandemic. I remember getting paged the next Sunday - everybody does their grocery shopping on Sunday morning, so if thereâs a new load issue, itâs going to be Sunday morning when we get paged. And so I remember getting paged, and Elasticsearch was timing out 30-second requests⌠We had stopped indexing, so we were in danger of not meeting SLAs unless we could get indexing going again, and traffic would go away.
We did all kinds of things. We thought about putting up a Stop sign on the website and saying, âSorry, Instacart is full. You have to come back another time.â Luckily, we never had to actually deploy that. Luckily, we were able to scale our way out of the pandemic⌠But it was a lot of work.
So while we were in the middle of this incident, we said âWell, weâve got this other cluster over here. We think that the results are about the same as the cluster that weâre using, thatâs kind of dead right now, because itâs just timing out, a hundred percent loadâŚâ So we just flipped the switch, put all the load against Postgres, and started using it. Of course, it immediately went to 100% CPU utilization, and also caught on fire⌠But we were able to find a few missing indexes for some long tail queries that we hadnât really optimized, and within a couple of hours, we got that cluster to a point where we could actually serve traffic again. And so that was that was really exciting, to really get the system bolted down to a couple of months after that⌠But for the most part, we had sort of shifted what was the primary system and what was the secondary system. Elasticsearch from that point, going forward, was really the backup to this new system that we had. And we had a couple of incidents with the new system, as we started throwing more and more data into it. Because after we did the original optimizations, we got down to like 4% CPU usage or something in the Postgres cluster. And it was it was vastly underscaled compared to our Elasticsearch cluster. I mean, it was just tiny compared to it is. It was really amazing.
[16:22] But at the same time, I mentioned all these other features, stores, model stores, everything else that we had⌠And all of those, whether they were Redis, or whether they were Postgres or Cassandra, those were all melting down as well. Those were not horizontally scalable systems; we learned a lot about scaling every system we had, whether that was RabbiMQ, or Redis, or⌠If you can name a database, we probably tried it at some point at Instacart.
So we had lots of fun, but our solution in this case was basically like âFigure out what your database is that has the most CPU usage, pull all the data out of it, and dump it into this new, horizontally-scalable Postgres cluster that we have.â And so we just did that over and over again, and we barely kept ahead of our doubling week-over-week growth curve for the next eight weeks. Like I mentioned, sometimes we missed optimizations, and we didnât really have the time to vet and test the system that we were building like we should have, or could have⌠But I think that we did the best that we could with the resources that we had.
And we spent at least a year after that iterating, adding more, really unlocking some new machine learning for our search team that we could now do. And we didnât get as far as I really wanted though, because at the time there was â there is a library called Madlib, which is an Apache Foundation project; I think itâs ten years old, itâs been going for a while⌠But there were some constraints at the time. They were locked to a specific older version of TensorFlow, I think. My memory is fuzzy, I didnât get a lot of sleep back then. But we werenât able to actually take a lot of our deep learning models and put them into Madlib and run them inside of the database to eliminate some of the microservices⌠So we actually kept quite a bit of the microservice architecture, and kept building around that. But it kind of bothered me, because we were able to clean up so much of the distributed system. I felt good about â the system that we ended up with was much better than the system that we started with. And it was kind of full-circle for me, coming from like â you know, I joined Instacart, and I was all about distributing everything⌠But by the end, I was all about pulling everything back into one fairly monolithic system.
So that was kind of eye-opening for me, about the complexity, both organizational, but also technological, that these systems can develop, and how powerful it can be if you can simplify the system. For example, when we were on the Elasticsearch pipeline we had a dedicated infrastructure team, we had a dedicated catalog team just to the pipeline. We have a dedicated search team, dedicated machine learning engineering - all of those resources⌠And we had upstream of that catalog data acquisition specialists that would get new kinds of data to do new kinds of products and services, or add new features to the website. But it took multiple quarters of planning and execution from the â like, you can sit a few product managers around the room, they conceptualize⌠Like, âHey, we want to add this feature to the website.â Like, âOkay, weâll go source the data, weâll get the data into the pipeline, weâll ingest it into Elasticsearch, the search team will start using it, and then theyâll display it. And theyâll start work this quarter, the search team will get to it the next quarterâŚâ âOh, wait, we donât have the data in the right format. Letâs circle back for another quarter of this whole process.â And so it was really, really problematic.