We have all used web and product search technologies for quite some time, but how do they actually work and how is AI impacting search? Andrew Stanton from Etsy joins us to dive into AI-based search methods and to talk about neuroevolution. He also gives us an introduction to Rust for production ML/AI and explains how that community is developing.
DigitalOcean – The simplest cloud platform for developers and teams Whether you’re running one virtual machine or ten thousand, makes managing your infrastructure too easy. Get started for free with a $50 credit. Learn more at do.co/changelog.
The Brave Browser – Browse the web up to 8x faster than Chrome and Safari, block ads and trackers by default, and reward your favorite creators with the built-in Basic Attention Token. Download Brave for free and give tipping a try right here on changelog.com.
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a principal AI strategist at Lockheed Martin. How are you doing this week, Chris?
Hey, I’m doing fine, Daniel. What’s up?
Not much. Busy day. It’s submission day for ACL, and I’m trying to get something ready; we’ll see if I actually make it. By the time this goes live I will have either failed or not made the deadline.
And ACL is, for the listeners…?
It’s a large computational linguistics conference, but it’s one of the larger natural language processing community research conferences… So there’s EMNLP, and then there’s ACL, and there’s larger – right now, started maybe today, when we’re recording, is the start of NeurIPS, which is another large AI research conference. Hopefully I’ll be livestreaming some of that later and trying to keep up, because I’m not there… But yeah, it’s one of those sorts of conferences; we’ll see if I make it.
We’ll make it through. I’ve gotta say, I had one of the coolest weeks last week I’ve ever had. Started at Carnegie Mellon University, there was a big conference on the future of AI and STEM in society; I got to do a breakout on AI and ethics and such, and STEM, and what things could be done… That was a really cool conversation; it solved all sorts of world problems right there.
I bet so.
Yeah, I got to sit on a panel called “Protecting AI from threats”, and the guy beside me was General Cartwright, who used to be the vice-chairman of the Joint Chiefs of Staff, and he just had brilliant insights…
He’s not an AI person the way we are, but I was just really impressed with what he had to say. Then I did an opening keynote in Philadelphia later on ethics and AI, and finally, we finished out the week - listeners will probably recognize this - we had the championship for Alpha Pilot, which we had a previous episode on, in Austin, Texas. We handed out a one million dollar prize to team MAVLab; they’re from Holland.
So it was a pretty cool week.
Yeah, that sounds extremely eventful. I imagine that in the midst of all of that travel and logistics and all of those things you were utilizing some form of search in some way to manage your life…
I might have been.
[00:03:53.09] Today on the show we have Andrew Stanton with us, who is a staff product manager of search ranking and platform at Etsy. I’m excited to talk with Andrew about search, but also some other things. And also, this is the first episode, I think, where I told my wife who was coming on the show, and she recognized obviously Etsy and was pretty psyched that I was talking to someone from Etsy. So we’re all excited to talk to you. Welcome, Andrew.
Yeah, thank you so much for having me.
Yeah, definitely. So maybe to start us out, if you could just give us a little bit of your background, how you got into AI/ML-related things, and search, and eventually ended up at Etsy.
Yeah, great question. I’ve been kind of blessed to be working with machine learning and search on and off for about 15 years at this point… And the irony is I never actually intended to go into either when I was in school; I was much more interested in distributed systems. And the funny thing is that as our data grows, I kept running face-first into places where we needed to have more sophisticated search, we needed to have better predictive performance than standard heuristics.
When I was an undergrad, I was actually working full-time for AOL at the same time, and the big focus at that point was working for an online reggae show of all things… But it ended up boiling down to this predictive problem where we were trying to understand how basically our – our listeners would tune in from around the world… So that was my first faceplant into linear regression, and time series prediction.
When I left, I ended up moving into something called entity recognition, which is this process of trying to understand from unstructured data the different types of entities that might be represented in it. It could be people, it could be companies, it could be any type of entity that might be useful, and then building this typo-resistant search on top of it. That was also probably my first real interaction with extremely big data. We were dealing with billions and billions of records… And so how do you build performant search on top of this entity recognition system which is constantly ingesting hundreds of millions of records per day. It turned out to be kind of this scratch and itch sweet spot for me.
From there, I went and worked on a bunch of different problems. I ended up in a startup called Blackbird Technologies, where we were working on e-commerce search in the B2B space. Our big value-add was being able to leverage multi-modal deep learning to basically tease apart a lot of these products that companies had to provide a better search experience on top of it. We were acquired by Etsy back in 2016, and I’ve been making it my home ever since.
And that sort of multi-modal side of search - when you mean that, you’re meaning like images and text sort of thing?
That’s exactly right.
Alright. So like products, if you’re searching products on a website, there’s obviously product photography, right? I guess that could factor in somehow.
Yeah. I would say a good example would be to think about something like Craigslist, or Facebook Marketplace. You have an image, and then you have maybe a sentence or two about what that item is. But somehow you have to understand that when a potential buyer comes in and they type in this highly-specific query such as colors, and materials, and other types of attributes, you have to take this very unstructured piece of information and convert it into something which is both relevant and searchable.
I’m wondering - as we start to dive in and we’re talking about search right off the bat, before we get fully into what Etsy is doing with search, can you talk a little bit about what types of search problems are out there? We tend to use the word search in all sorts of different contexts; there’s full text search, web page search, product search, you name it. Can you give us an idea of the overall landscape of what search problems look like and how they’re related, if at all?
[00:07:53.24] Yeah, that’s a really good question. I would say there’s maybe three major areas of search. There is information search, which you know from things like Google; you go in and you type “CNN”, 99% of the time you’re intending to go to cnn.com; maybe find a Wikipedia page is number two, but largely you’re searching to find pieces of information.
The second type is probably e-commerce search, which I’m most familiar with. Amazon, Walmart, Alibaba etc. - how do you match these buyers who are oftentimes giving these very vague queries (like “jewelry”) and trying to understand what are the latent factors that are actually interesting to the buyer.
The third, which has grown out much more recently, is probably question and answering, so kind of the Stack Overflow problem, where you are asking questions in much more natural language, but you’re trying to tease out this kind of community aspect of retrieval, where the intent is not necessarily on finding a single piece of information, but perhaps finding a collection of pieces of information, and the domain is a little bit more NLP-heavy.
In those different areas – I mean, obviously e-commerce developed at a certain time, and certain things like Stack Overflow have probably been more popular of recent times… Looking back over the history of search, when did machine learning and AI start being applied to these types of search problems? Was it always applied, or did it kind of start out as rule-based algorithms? …and “Oh, this thing includes this word this many more times than this other thing, so it’s ranked higher.” I’m assuming some of those things started earlier. When did AI start being applied to search?
It’s always been somewhat applied. You can think about the nexus as originally starting with catalogs. You had a bunch of records you wanted to retrieve across these old library systems. And I don’t know if you remember, but you used to have to give these kind of boolean queries, and then these kind of quasi-rankers built into it. Through machine learning, we’ve always tried to use and tried to understand relevancy, so we’ve had things like TFIDF, and BM25, which have existed for several decades at this point.
And those are methods that are based on counting instances of tokens, and working off of tokens that are in certain samples and not other samples, and that sort of thing.
Correct. Oftentimes it’s statistical-based. The idea is that rarer tokens contain more information; it’s the primary motivator for it. And there were really the first attempts at understanding relevancy inside of free text search. [unintelligible 00:10:34.22] popularizing this in the e-commerce space called Endeca, that was I believe late ‘90s, early 2000’s.
I used to work with them.
Oh, wonderful. So you know them well. They kind of innovated on things such as filters and facets, and really kind of scaled out the initial ecosystem for e-commerce search. As it came to machine learning, I would say probably the biggest step functions in terms of improvements were kind of the learning to rank work in the early 2000’s, late ’90s, where we started to apply basic machine learning problems such as [unintelligible 00:11:10.01] to improve relevancy across a whole bunch of different signals. At that point it moved away from these kind of generative type models such as TFIDF, to these more discriminative type models.
Awesome. Yeah, and has there been a lot of momentum in new deep neural network, unsupervised – all these sorts of kind of hype things that are happening now, has that impacted the search world a lot?
Oh… So much. [laughs]
Tell us more.
Oh, goodness. Okay, where to start…? I would say search is really an interesting problem space, because it’s really a confluence of a bunch of different technologies. You can think of a pretty standard stack as looking like something like a Solr and Elasticsearch where you index all your documents, you retrieve some type of candidate set based on the input query and other conditionals like filters, and then you rerank them a bunch of times and then spit out some output to the end user.
[00:12:15.04] We have innovated in the industry across every single one of those elements. [unintelligible 00:12:18.24] have improved, the recent hotness in deep learning has really started to have an effect in search in the form of things such as neural IR. This idea that you can build these massive models, these massive neural nets which know how to translate from the query space to the document space and just replace the retrieval systems that were historically just text-based matching.
So I’m kind of curious, what types of data get involved when you’re building out a machine learning model these days? What data is relevant, where do you go for your data for search? It’s just not something that I’m familiar with, and I was rather curious.
Yeah. While Chris was saying that, I was thinking, are there datasets– like, you’re talking about learning to rank, and going from query to document… Are there existing datasets that are standard in that? Or is it still a lot of people using – you have to kind of build up your own internal dataset, and that sort of thing?
I would say it’s a bit of both. There’s definitely learning to rank datasets out there; the Yahoo! learning to rank challenges from the mid-2000’s, then Microsoft had a bunch of different learning to rank datasets over the years, such as the WEB-10K and the WEB-30K, you have [unintelligible 00:13:35.24] and you have a whole bunch of these historical datasets. The problem with them is that they were all universally overfitted to the information piece; that kind of web search element. And the problem is that we’ve learned that search inside e-commerce is actually quite a bit different… When it comes to taking a lot of these benchmark datasets and applying them, they don’t necessarily translate well from one demand to another.
Most companies will build up their own datasets internally, and they will apply a variety of different methods, some of which might be state of the art in the traditional information sense, some which might be bespoke to their own needs.
I’m kind of curious - can you tell us a little bit about why search is relevant to Etsy? Just to cover a little bit about the tie-in on why it is that you’re doing that. What does Etsy need search for in that way?
Etsy has over 16 million results, and most of them - or a good portion of them - are handmade, customized, one of a kind. We have a big vintage basis where you only have an example of one. So unlike Google, or Bing, or Yandex, or any of these massive search engines where you’re constantly returning that top result, we had this constant turnover in inventory, and our inventory is just growing every day.
The other problem with Etsy is that we don’t have skews. Amazon is able to leverage a lot of the structured data that’s provided to them by the manufacturer; we basically rely on our sellers to figure it out, and our machine learning algorithms to try to tease apart the different pieces in there into the type of information which is actually useful to the buyer.
To give you an idea of why search is needed - before, back in the late ‘90s, inventories were so small you could properly navigate through it. You’d go through a dropdown box at the top, and click on Jewelry, and then you’d see all 200 items that were for sale. If you go on Etsy right now and you type in “jewelry”, you’ll get 18 million results. There’s no human out there that is gonna go through all 18 million results. So that’s where ranking and search and personalization - all these different elements kind of come together to try to hone down that 18 million total result set to something that’s actually digestible by the buyer.
[00:15:55.13] Yeah. And I was just thinking, while you were describing things, it seems almost like there’s so many outliers in Etsy, in the sense that – like, for example, I search for “R2-D2”, because we’ve been watching Star Wars stuff recently… [laughter] And I see R2-D2 gift pack of planting pots, and then I see an R2-D2 chalk bag for rock climbing, which… I’m just thinking, an R2-D2 rock climbing chalk bag seems extremely outlier to me in terms of products that you could create some rules around, right? It seems super-challenging.
It’s a really good point. Etsy - we have a lot of very niche niches, and that’s really spectacular for the buyer too, because that means there is likely something out there for you, that almost feels like it was made specifically for you; for that one person, that Star Wars lover who likes to get up on rock walls, that’s a very applicable gift… And understanding when to surface those versus when not to is a big part of the challenge.
So it was mentioned to us that Etsy is using neuroevolution for search, and I guess if you could tell us a little bit about what is neuroevolution… What does that mean? It’s a new term to me, so I’m curious not only to understand what it is, but how it relates to search.
Yeah, so neuroevolution has been kind of a moving definition. It originally started – or at least the first time I heard about it…
It is about evolving. It’s combining these evolutionary algorithms. You might have remembered them as things like genetic algorithms from back in the early 2000’s… But really, neuroevolution is kind of combining these evolutionary algorithms to neural nets, effectively. I first became aware of it from a project called [unintelligible 00:18:53.00] which I believe was mid-2000’s. The idea was that it could actually evolve both network structures, as well as the weights associated with those neural nets, to solve these black box problems.
I know jargon is often a point of confusion, but we’ve talked before on the program about meta learning, and sort of learning to learn, and different things involved with that… Is that how evolutionary algorithms are being applied to neural nets? You mentioned learning architecture and weights and other things, whereas evolutionary algorithms - is that a sort of different piece of the puzzle?
Meta learning is much more about, as you say, learning to learn… So you can either figure out way to learn optimizers, to train models, or you can learn parameter weights, which make fine-tuning on those models a lot faster, such as the case of mammal and reptile. Neuroevolution is more of like a competitor to things like stochastic gradient descent, I would say. It’s more of a way of learning models based on these kind of [unintelligible 00:20:01.23] populations of answers that can compete with each other… And based on a very rough estimation of Darwin, where the best survived, the candidates in the population which end up performing better end up persisting through multiple generations of work.
[00:20:20.29] So I’d say it’s more common to think of it as more of a learning paradigm. It became a little bit more popular recently. Back in 2017 OpenAI published this paper on how they applied this one particular technique from neuroevolution called Evolutionary Strategies to train agents in reinforcement learning. They applied it to the standard datasets, and they found out that it was actually very competitive.
So this field, which was much more popular in the early 2000’s, that kind of got back-burnered when neural nets started really taking off, in I guess 2013 [unintelligible 00:20:53.06] that they were actually useful for solving these problems. We’re starting to see a resurgence, because much of the same reasons that neural nets have become successful, neuroevolution has become successful; the computation is finally there.
For some clarification, can you talk a little bit about – you mentioned kind of as a replacement for stochastic gradient descent… Could you actually talk about where you might use neuroevolution instead of that? Because obviously, as a lot of our listeners, and certainly myself having come into this, we’re very familiar with stochastic gradient descent… Can you say where it would be productive to consider neuroevolution to replace it in kind of a use case?
Sure, absolutely. And I can speak specifically to Etsy’s use case. So whenever you can compute a gradient, it’s almost always better to use SGD. The problem that you have is there are a number of domains where it’s very difficult to compute the gradient of the actual objective function. If you think of reinforcement learning, we have this environment where we send in some actions and we get some rewards, and then we get an updated state, but there’s no real closed-form mathematical equation we can use to try to understand where to step the policy next. So most of the policy gradient methods, and Q-Learning and all of those are really trying to create an assignment and figure out ways to compute gradients, which improve the model.
Neuroevolution is really nice, because it makes very little assumption about the underlying objective function. In fact, all it really needs is to be able to know what you’re putting into your fitness function, and get some type of fitness score out of it, where the higher the fitness score is, the better the model is, or the input space is that you passed into it. So anytime you have a situation where it’s very difficult to compute the gradients and you need to do it based on sampling or some other form of estimation, it can be quite competitive.
In those scenarios where it might be hard to compute a gradient, is that typically when… Like, in the problem with Etsy - I’m trying to connect this to the search problem with Etsy… Is that because there’s so much diversity in your dataset between query and product match or rank, where there’s not recognized categories of things, but there’s so much diversity? Is that what produces that sort of scenario?
In the case of Etsy, one of the challenges we have is that we’re a two-sided marketplace, which really means we have two customers - we have buyers and we have sellers. And one of the places where things like [unintelligible 00:23:26.06] have been very successful is in this particular subfield called multi-objective optimization… And the idea is that relevancy is only one of the factors that go into a healthy marketplace. Let me give you a hypothetical…
Imagine that you want to make your seller successful. Now, you have this kind of problem - you have sellers who have been on the side for a long time, and they’re relying on a consistent form of income from the marketplace… But at the same time, you wanna make sure that new sellers are also successful, so you need to expose them artificially higher in the rankings, to make sure that they can do it. Now, those two needs are naturally in conflict with each other, because there’s only so much space for search.
[00:24:10.15] Yeah, you can’t optimize one, because you would necessarily kill the other.
That’s exactly right. The rate between trade-off between those two objectives is called the Pareto frontier, the Pareto efficiency of the actual problem… And it turns out that it’s really hard to do when you start combining a lot of these different objectives in there, and especially when you start boiling down the things which require relevance to be considered as well, which is a very hard gradient to compute in the best of times. So you have this kind of black box function which have all these different factors, which have these trade-offs between them; how can you, in a very principled way, train a model that’s able to adjust for those trade-offs and learn an optimal balance between all of them.
I’m kind of thinking… When you are implementing this - and I’m still very focused on how neuroevolution can be implemented in a practical way - what kind of challenges did you find yourself facing in implementing that into your algorithm, versus some of the more traditional approaches that might have been more obvious?
Yeah, it’s a good question. I think there are a couple of different challenges. First, neuroevolution is many things, but it is not computationally-efficient, because you’re basically relying on sampling. Basically, you have a model, let’s say, you have parameters with that model… The way you can estimate a gradient step is by sampling slight perturbations of the parameters around that model space, and trying to intelligently combine them into a gradient, which hopefully improves the model’s performance. Unfortunately, we’ve learned from certain types of research in Zeroth-order optimization that as the dimensions of the model increases, you need about a square of that in terms of samples to accurately measure that.
So as your model gets bigger, you need to spend more and more time in that exploratory space, and that gets really expensive. Now, where you can mitigate some of that is by using more efficient languages, being smarter about the size of your space, being smarter about the type of algorithms that you’re using to combine them…
One of these is something called evolutionary strategies, and it’s actually pretty good; you can combine it with these second-order approximators like Adam, or RMSProp, or Momentum, to take advantage of some of the work inside of the classic stochastic gradient descents base to speed up optimization… But it really becomes a question of how do you maintain the efficiency of the search, and at the same time get the results that you’re hoping for.
Before we move on to some of those things that you just mentioned, like the language, which I particularly want to follow up on, I was wondering if you could give an update as far as how did this end up working… Did it improve things by leaps and bounds, was it marginal? And what are your thoughts in terms of after doing these neuroevolution experiments, what’s next in terms of upgrading search at Etsy? …or do you feel like there’s other things maybe that are more important to focus on now?
Yeah, great question. So the way we decide to integrate it - I mentioned before a rough topology of what a search stack looks like; you have an information retrieval system like Solr, Elasticsearch where you get some candidate sets back, and then you go through this cascade ranking system, where you’re constantly reranking and refining down the results set. That means you can go from simple models, which are very fast, but are operating on a fairly large candidate set, down to expansive models, which are slow, but are operating on a much smaller one. We put it at the very end; we call it the business intelligence layer, and it allows us to kind of incorporate both [unintelligible 00:27:56.07] about what would be beneficial for the marketplace, but apply it at the end of the rankings. So we’re always getting the best possible relevance we can out of the systems, but we’re adjusting the ordering at the end, to try to influence these other factors.
[00:28:11.11] From online experiments, it worked about as well as we could have asked for. It is somewhat funny; there were trade-offs as well, as we find the metrics that we’re optimizing for. It’s one of those funny things where you almost have to be like a lawyer when you’re writing the type of fitness functions for these things to evolve… Because it will follow the letter of the law, but it will do it in weird ways.
For example, we were optimizing relevancy [unintelligible 00:28:35.10] where you wanna get the item that was purchased in the top ten results… And what we were finding is that it would oftentimes put that purchase at the tenth position, even though that’s not what we actually wanted; we wanted to move it higher up. Because as far as I cared, all it needed to do was to get into that top ten position, and was able to make balances up there. So it’ll do what you say, not necessarily what you want it to do.
Yeah, so it sounds like part of the future is really now that you have some of these things implemented, really exploring the policies that you’ve put in place and the strategies that you’re using to rein these things in… Is that right?
That’s exactly right. A lot more work on metrics, a lot more work on understanding what the trade-offs in the marketplace are.
So I know that one of the things that we talked about earlier on in the conversation, that I’ve been kind of waiting to get to because I’m pretty fascinated with it, is that you guys are using Rust in your line. You’ve mentioned it a couple of times. And I know Daniel and I are both interested in that; we’re both actually Gophers, and there’s a friendly competition a little bit culturally between the Rust and the Go people. We certainly have a great respect for each other; I’d love to understand how you’re using Rust in productizing your machine learning systems.
It’s a wonderful question… For folks who don’t know, Rust is a language that was developed by Mozilla, and Mozilla - this was back in 2010 - had a problem. Mozilla is the most well-known for its browser, Firefox… And every week it felt like there was some type of security fault that was being found, that they had to release a patch for… And they started looking at the core reasons behind, and they realized that these low-level systems languages which browsers like Chrome and Firefox and Internet Explorer and all those are written in are really not optimized for solving these common problems that you run into, that can result in things like buffer overflows, or Use After Free, or pointer dereferencing - all these kind of problems that you might run into in practice when you’re writing in a language like C or C++.
So they got together and they started looking around at modern programming language theory, and they kind of picked and choose some of the best pieces from languages in the ML space, such as Haskell and OCaml, as well as practical pieces, algo-based systems such as C and C++, and tried to combine it together with really strong static analysis, to produce a language which was both extremely fast and a suitable replacement for C and C++ in systems language, but at the same time had the static analysis you needed to write safe and efficient code.
Just as a quick follow-up, could you describe a little bit about how one would apply Rust in that environment? Does it basically replace the software architectures that are wrapping your machine learning pipeline, or how does that work? Where does it fit into the overall architecture?
I’ll talk about it more generally, and then I’ll talk about the Etsy-specific case. More generally, the kind of frameworks that most people are accustomed to using are actually written in Python… But the lie about Python is that none of the fast bits are actually written in Python. It’s all indexing into C and C++, or Cython, or in some cases Fortran. And what Python really becomes is this kind of domain-specific language for gluing these together.
[00:32:10.22] The ones that are probably the most familiar to everyone on the show are scikit-learn, Tensor Flow, PyTorch, Light GBM, XGBoost… All of those have the core performance pieces written in C and C++, and they also aren’t immune to these problems. You can actually look and find TensorFlow has had to release patches, because they also, by nature of being in C and C++, have these problems with safety and reliability.
The place where Rust tends to have the biggest benefit is by replacing those back-end components with a safer, faster language… And we’ve had a lot of work recently done in this space to make it a little bit easier to integrate with Python. There’s a project called PyO3 out there which makes it much simpler to interface between the back-end and the front-end. That applies to Etsy.
In the learning to rank space we have to do a lot of feature engineering. The state of the art is still gradient boosting models for the most part, and that means that a lot of the benefits you get from neural nets, that [unintelligible 00:33:13.27] being deferred to the algorithm - you have to do it manually. And we were running into this case where every night we were training hundreds of millions or billions of records, and we were trying to [unintelligible 00:33:23.11] through a whole bunch of different features, and it was taking an exorbitant amount of time.
The second piece that’s really challenging in the search space is that the machine learning algorithms are traditionally written in Python, but Solr and Elastic and these kind of [unintelligible 00:33:38.28] systems are actually written in languages like Java. And what we really didn’t wanna have to do was write feature engineering twice, so do something like the hashing trick in Python, and then have to port that same implementation over to something like Java, to get models trained in Python and then deployed on Java.
So we were really looking for a language which would allow us to embed it in both Python and Java at the same time, and that kind of put some restrictions.
You mentioned that you both are gophers… One of the problems with Go is that it actually had managed memory, and Java and Python don’t necessarily work particularly well with managed memory, while managing their own memory. So those types of constraints made it hone down the number of opportunities we had, and we were mostly focused on trying to find one where we felt we could be productive quickly, but at the same time didn’t have t pay a performance penalty.
And with that, as you looked into Rust for those particular problems, like you were talking about with feature engineering, but also considered maybe some of the neuroevolution things that you were exploring - I’m assuming that some of those fundamental or foundational papers like OpenAI paper that you mention, if you went and looked at the implementation, maybe the model is implemented in Python/Pytorch, or TensorFlow, or something like that… How does that piece fit in, along with this sort of feature engineering, pre-processing stuff? Are you taking models from the one frameworks and then doing a lot of the feature engineering and that sort of things with Rust? How does that play together?
We really have two main systems that are written in Rust. Both are powering hundreds of billions of predictions a day. Our first one was Buzzsaw, which we wrote a paper on back in 2018, and it really is kind of a backbone of how we do feature engineering at Etsy at this point; we pump a lot of data through it, it can scale across clusters, we can embed it inside Python and inside Java… And it’s really nice, because when you’re training models, especially in the search space, you wanna make sure that what you’re training against doesn’t change. So you can imagine that by adjusting the implementation of, say, TFIDF just slightly, you can actually have these changes to the prediction space. But because we’re able to ship the library to both our cluster compute which does the preprocessing, and run that exact same code in our Python-based learning to rank prediction services, we don’t have to worry about that gap in terms of implementation.
[00:36:15.13] As for the neuroevolution space, as I’ve mentioned before, it’s not super sample-efficient, so there’s been a lot of work around trying to figure out how to scale up these systems… And one of the ways we were very successful in doing that is by moving the neuroevolution pieces down into Rust, rather than from Python. So when we originally prototyped this out in Python, it worked. It was slow, but it worked. But by moving it into Rust, the main core implementation, we were able to speed it up by some hundred x, and reduce the memory footprint of it, and just scale up both of our data, as well as the size of the models.
Oh, wow. Was that sort of reimplementation overhead - was that high? Or did you find it going fairly smoothly? And I’m not sure about – maybe you had experienced Rust people and that sort of thing, or was it more like you have experienced AI Python people and they’re kind of dipping into Rust?
I would say it’s more the latter than the former. Rust is a new language by the general timelines of language. C++ came out in 1985, I believe, and Java was ‘98, Python was 1990… So there’s been a lot of time to bake engineers there. Rust only hit 1.0 I believe back in 2015, so it’s still new and there’s still kind of this community building that’s going on. So most of our developers are coming from “I know Python, I know a little bit of C++, I had to deal with the Java in school, but that’s largely my experience in language that I use on the day-to-day.”
From an implementation perspective, swapping from Python to Rust there was some cognitive overhead in terms of learning how to work with static analysis, learning how to use these more advanced features that come in the language to your benefit, and then there’s just the nature that when you’re writing code in Python, you’re really gearing for kind of a prototyping space; you’re not really necessarily thinking about performance. But when you move to a language like Rust, and you’re trying to do squeak out performance, you have to think about things such as memory allocation, using SIMD, potentially using CUDA, and how those all kind of play in to building robust systems.
I’m kind of curious… Recognizing that Rust is still a very new language, and kind of scratching that itch of the languages that it’s replacing, as you’re looking further into using it in the AI/ML space, do you think the community - though it be small today - is likely to grow and develop going forward? Do you think it’s a substantial enough use case for Rust to really blossom in that area?
I do. I don’t think we’re necessarily all the way there yet, but I think there’s a bunch of indicators that are really positive for it. First, big companies are starting to depend on it. Dropbox, for example, uses Rust in their storage layer, because they need reliability. Facebook came out earlier this year with their cryptocurrency which is written in Rust, because they need the security. Microsoft a few days ago published the result of a project where they were trying to replace core pieces of their Windows 10 code with Rust, because they got tired of security flaws. So we’re kind of seeing a lot of big companies starting to adopt it, at least in prototyping cases, and more seriously in the case of companies like Facebook.
I think that it is going to be really valuable, because the one thing that I’ve noticed for us is that there’s really kind of three payments that you have to make when it comes to building machine learning. There’s the cost of development, there’s the cost of running, and then there’s the cost of maintenance. And Python does a great job at the cost of development. We have this rich ecosystem, filled with libraries out there, you can just PIP install them, they just kind of work… You can prototype an idea very quickly. But it’s not an efficient language, so that cost to run starts adding up, especially when you have to clone it over hundreds of machines to train these very large models.
[00:40:08.04] The final piece - and this is the more insidious one - is the cost of maintenance. What is the cost of failure in your systems. And when we think about things like production, whenever our learning to rank models go down, we start losing money really quickly. We have 18 million results for jewelry, right? It turns out that there are better answers in there than others. So we need to be very cognizant of what that long-term effect actually is when it comes to building out “production”.
Yeah, that’s something that definitely resonates with us, like we mentioned, having worked a little bit in Go… I’m really curious to start dipping into Rust a little bit and get my hands dirty. Are there good places for people that have some experience doing Python stuff, and machine learning and AI - what are some good ways for them to start getting into Rust a little bit? Are there tutorials, or machine learning parallels that they could go to out there in terms of common machine learning problems, and that sort of thing?
Yeah, it’s a really good question. I will say this about Rust - I feel like I’m marketing right now, but the truth is that it’s the nicest community that I’ve ever run into. The IRC channels, the Rust SubReddit… Just generally, the way Mozilla has run the community is the most welcoming I’ve ever seen.
Yeah, with Mozilla behind it I think one could expect some good things in that sense.
And they’ve done a really good job at baking it, and kind of – I heard a really interesting quote recently, which I think that kind of resonates. Google built Go for Google, they didn’t necessarily build it for the community… Whereas it really does feel like Mozilla kind of built Rust for the community, and it just so happened to also work extremely well for them. That really has kind of permeated through all layers of it, so you can find good books on Rust, on kind of adopting it from other languages, you can find a lot of GitHub repos… I know there’s not much value in popularity contest, but I believe it’s the most loved language of Stack Overflow for like five years running now. So there’s a lot of joy involved in that space, and there’s lots of people very eager to help you with your problems.
[00:42:21.15] Great. Well, as we come near to a close here in our conversation, we’ve talked about search, we’ve talked about evolutionary algorithms, we’ve talked about Rust… I’d be curious just to hear about what you’re excited about in terms of the search space, and how AI is influencing that space as we look forward. What are the biggest open problems that you think are really interesting that people are working on, or what are you just excited about over the next years as the community grows?
Yeah, that’s a wonderful question. We’ve never lived in a more exciting time for search… We have this kind of openness now in industry in terms of publishing state of the art. You have Alibaba, who’s been making a lot of work recently, and then the classics like MSR, and Google and other folks, who are really publishing great research on how to both build production-grade systems and how to push the state of the art in terms of retrieval.
At KDD this year one of the big workshops was from LinkedIn, and they published this fabulous deck on how they’ve integrated machine learning at every level of their search platform, using things like GANs, and using BERT to tease apart the NLP pieces… Then you have great papers talking about how folks like Amazon are embedding billion parameter neural nets inside of their information retrieval stacks… So these real production problems that folks like Chinese e-commerce companies deal with, such as Singles’ Day; how do you handle the scale of those systems.
One of the things that I’ve been starting to see as a trend over the last few years is that the blending between lines of where machine learning starts, and distributed systems and systems engineering starts, is starting to be a little bit fuzzier. And it turns out that the best search systems are really gonna incorporate techniques from both worlds into the code that’s being built, rather than having them segregated apart.
So all the improvements that we get out of conferences like NeurIPS, and ICLR, and SIGIR, and all these wonderful conferences - we’re finding they’re making their way faster into search to actually solve real problems.
Awesome. That’s great to hear, and I definitely resonate with a lot of what you said, like I mentioned. I certainly hope that we do see some of those trends, and of course, we’ll keep looking for great things coming out of Etsy and what you’re contributing to search. I really appreciate you releasing your findings, and other things, like you said, with Buzzsaw, and other things… Great work, and thank you so much for taking time to talk with us.
I really appreciate being here, thank you.
Our transcripts are open source on GitHub. Improvements are welcome. 💚