Practical AI – Episode #97
MLOps and tracking experiments with Allegro AI
featuring Nir Bar-Lev, CEO of Allegro AI
DevOps for deep learning is well… different. You need to track both data and code, and you need to run multiple different versions of your code for long periods of time on accelerated hardware. Allegro AI is helping data scientists manage these workflows with their open source MLOps solution called Trains. Nir Bar-Lev, Allegro’s CEO, joins us to discuss their approach to MLOps and how to make deep learning development more robust.
DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Rollbar – We move fast and fix things because of Rollbar. Resolve errors in minutes. Deploy with confidence. Learn more at rollbar.com/changelog.
Notes & Links
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack, I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a principal AI strategist at Lockheed Martin. How are you doing, Chris?
I am doing okay. It’s summertime here in Georgia, so it is hot, and humid, and so I’m just trying to keep from melting.
Yeah, not unexpected for where you are, I imagine…
It happens on a regular basis. The hot/humid, not the melting part.
I think I mentioned this a couple times on the podcast, but my wife owns a candle business, so during the summer you’ve gotta figure out the right shipping and tracking, so that you minimize the likelihood of candles melting on people’s porch before they actually get into their house, if you’re sending them to Texas, or Arizona, or that sort of thing…
That’s a good point.
Yeah, so it’s an interesting thing. On another shipping front, I’ve got a pile of boxes sitting next to me in a computer case; all the components for a computer are here, at my house. I’m about to build the first AI workstation of my very own, so I’m excited about that…
Yeah, it would be fun to have an episode detailing all of the mishaps that happen along the way as I hopefully don’t ruin it, but get this thing running.
So I’m curious, since you brought it up - I know if the past when we’ve talked we both have typically gone to cloud services, especially for personal things that we’re doing at home for our own interest, what caused you to decide to go this way this time, with a desktop.
I think it was twofold. I think partly it was that I haven’t built a computer since I was in college probably; that’s over 14-15 years probably since I’ve built a computer maybe… I thought it would be fun to just do it again… So that’s partly just fun. But then also, I’m getting into a lot more of audio models, so speech recognition things and spoken language identification, and the datasets associated with those are quite large… So carting those around to various cloud machines, and also running models for maybe days instead of hours starts to get fairly expensive. I think those two things made sense to me.
[04:27] Well, good luck with it. We’ll definitely have to get an update from you to share with us all what happened, and what went wrong, and what went well.
For sure. And today we’re gonna keep the practical train moving with some more topics that are extremely practical. Actually, I had seen what we’re gonna be talking about today, which is some tools from a company called Allegro AI - one of my friends pointed me to that, which I’ll mention maybe a little bit later on… But I also saw PyTorch mentioned recently that Allegro Trains, which is one of the MLOps and experiment managing/versioning things that we’re gonna be talking about today joined the PyTorch Ecosystem project… And I thought that sounded really exciting, also very practical, so today we’ve got with us Nir Bar-Lev, who is the CEO and co-founder of Allegro AI. Welcome, Nir.
Thank you for having me, guys.
Yeah. Before we jump into all of those exciting things about experiment tracking, and versioning, and MLOps, and all of that, it’d be great to hear just a little bit about your background and how you got involved in this field.
Sure. I’ve been in the high tech industry for longer than I’d care to actually say… Probably three decades. I started as an engineer actually, and spent about a decade on large ERP systems, and that kind of thing; this was way back. And then by way of an MBA at Wharton I joined Google, and a decade at Google, doing everything from working on the mobile team… This was right after Google had bought Android, and before the iPhone went out… And to actually helping setting up Google’s Tel Aviv R&D center, to leading Google’s European search advertising, product and strategy, and a number of other roles.
My next role - I was a GM of mobile payments… And yeah, when I decided to look for something else to do, I joined two folks who are actually my partners now, to basically start Allegro AI. The way I came at it is I was looking to do something big, that can impact the world, and that would impact cutting edge technology. After being at Google and doing everything I did - you know, you don’t wanna do anything less than that, really…
Yeah, I was gonna say, being at Google - if you’re thinking of projects that make an impact, worldwide impact, that are innovative, it seems like that sets a pretty good trend for your path, or a high bar to reach, for sure.
That’s correct. I don’t know if I’ll be able to build a company as large as Google, but certainly that’s the target you wanna put for yourself.
Yeah, and it is interesting - I don’t know, maybe you have a perspective on this as a CEO and founder, but it seems like there are a number of really innovative startups in the AI space that are kind of playing at the same level as the major players of OpenAI and Google and Microsoft. At the same level, at major research conferences you see startups - I’m thinking of like Hugging Face, or those sorts of startups that are really right there… And it seems like such a huge impact for a small team.
I don’t know as a CEO if you think about those things, but it seems really interesting to me that there can be these small, really focused teams that make a very large impact on that level.
Yeah… Being at Google, you kind of think that you can probably do anything. The reality is that - and I’ve seen that personally; some of the big projects I was involved with - Google didn’t execute as well as a startup, or as fast. Google ended up acquiring them, for less, or even a lot of money. And there are a number of examples for that.
[08:24] As a company grows larger, the targets get bigger… So doing anything requires a very high bar. I remember at some point - I’m talking about like 2007, I think, I remember pitching something to Susan Wojcicki, who was at the time – I mean, now she’s the CEO of YouTube, and at that time she was the head of advertising… And the bar was “If it’s not about 100 million dollars of revenue, don’t talk to me about it.” And you can imagine, this was back in 2007. So imagine today. And this gives an opportunity for small companies who are very nimble to identify opportunities.
There’s also a different perspective when you’re outside of Google as when you’re in… Especially in the B2B space there are opportunities where – you know, at least Google specifically is still relatively behind companies such as Amazon and Microsoft, for example… So you can identify small companies, niches, and if you understand that those are going to grow, then that’s opportunity.
I’m kind of curious, as you were at Google, how did you come up with this idea for what would become Allegro AI? You had been doing that at Google for a while, moving through your position… So what made you think “I have this idea. I’m gonna make a major change in my life”? What gave you the motivation to go off and do a startup, find partners? Can you give us a little bit of that back-story?
First of all, I can’t really take credit for the original idea behind Allegro AI. That’s actually one of my partners. I guess I can take credit for what we formed out of it, and what it became, because obviously, as in any company, especially startups, we change and we adapt quickly to find product-market fit. So obviously, the vision as it was set or thought of by my partner needed to improve and get better, and that’s something that I was involved in. But the original idea was not mine; from my position, I’d felt like I had – you know, I’m in Tel Aviv, Israel, and it was about relocating my family back to the U.S, and at the time (this was about four years ago) it didn’t make sense… And coupled with that fact that – you know, I joined Google when it was 3,000 people; I think it’s about 100,000 now or so, or on that sort of scale. It’s a different company, in many ways, and I felt like it was an amazing experience, and I learned so much, especially being at Google in that time of growth.
When I left Google, it was a big company, with all the things that we all less like about big companies… And I felt like this was an opportunity to do something different, and really go out and try to build something on my own. As I mentioned, I looked for something really big, that can change the world. Basically, as a potential founder, I started “dating people” to find partners, that we could come up with something that we’d like to do… And through that “dating process”, I met my current partners. We hit it off, as we like to say, really quickly; they’re amazing guys. When you’re in a startup and you have partners, you’re practically in a Catholic marriage for the time, until the exit, so you wanna make sure that you have people that you can trust, and that they’re people that you can work with, and obviously, amazingly capable and talented… And I’ve found all of that with them.
[11:57] Basically, as I mentioned, one of my partners was bringing that idea, and it came to him – he’s a long-time serial entrepreneur, and he has a very interesting profile, where he has both a very strong engineering background, as well as a data science background. So the most prestigious lab today in AI in Israel - it’s run by professor Lior Wolf, who’s actually now at Facebook, and my co-founder, his name is Moses Guttmann, was his first Ph.D. student, which basically means that they set up the lab together. So he’s really one of the pioneers of deep learning, machine learning, computer vision in Israel, and he basically saw what Allegro AI is really all about - the fact that we need to bring in engineering methodologies into the AI process. That was not the way that he said it at the time, but basically that’s the idea - how do we actually scale things up?
And I’m kind of curious on that front - you stated it well, and I think that this has been brought up on our show multiple times, from different perspectives, so I definitely think it is a theme that’s kind of surging through the community, that we need to be more rigorous in terms of the engineering we put into our workflows, and the AI-driven products that we’re building and putting out, and the tools, and all of that sort of thing.
I was wondering, from your perspective, what you see as the challenges to – what are the sort of main challenges to getting people on board, that are currently in data science and AI positions, and kind of convincing them that they need to start doing things differently? What are some of those challenges? Does it have to do with the variety of backgrounds that people come from, that it’s not just engineers, or is it more than that?
Yeah, that’s a great question, and the answer, actually - it’s a moving target… Because our industry, the one that you guys are talking about, and the one that I’m currently in is rapidly evolving and changing as we speak, at an amazing rate. I’ve never experienced that kind of rate before in my career.
Generally, I say this - basically, it’s a very different paradigm; it’s a scientific paradigm. And initially, people thought “Well, I’ll get data scientists or research scientists. That’s what I need, and then we’ll be able to do the job.” Obviously, we now know that’s not enough. The thing is that there’s still a core and critical part of a team that needs to build something… But data scientists, research scientists have a very different mindset and outlook. They’ve been trained differently. At the end of the day, they’re scientists, and if you actually take it to the extreme, think of that mad scientist - nothing is in order, everything is hectic, it’s all about the creativity and finding the solution… And there’s a lot of truth in that. Obviously, that’s an extreme exemplification, but there’s a lot of that. And that’s changing.
But we’ve found throughout the course, in the last three years - data scientists, research scientists have been very much against adopting any tools, because they came out of university, they were focused on the science. Tooling - they didn’t understand the value of tooling, the value of processes… And in some ways, maybe they were even a little bit wary of tools, and is that gonna be good for them or bad for them? It wasn’t even something that they were exposed to during the curriculum of their training.
On the flipside, they felt like, for example “I’m a Ph.D. out of (whatever) Stanford. I should know everything.” A lot of times we saw relatively very junior data scientists leading AI teams. Not just in small companies, but in very large companies. Because if you’re not a Google, or a Microsoft, or a Facebook, you’re not gonna get the cream of the crop.
And the last thing is their bosses didn’t know what the heck they were doing. They didn’t even know how to actually measure what they were doing. And as I mentioned, they thought that bringing those people in would be enough. So a lot of that created this situation in the background of “Why do I need tools?” And a lot of that still exists now, I think… But a lot of people who have an engineering background are actually doing data science, or ML engineering, and data engineering, because it’s new, it’s interesting, the salaries may be higher etc.
[16:26] Companies have realized that they’re not seeing productivity out of the data science teams… And so that shift has been happening in the last year, or year and a half. We’ve seen companies integrate their data scientists and research scientists into a larger product team, that has the engineers, and the product leadership, the DevOps etc. to really push them to ultimately build a product… Because it’s not about coming up with a research paper, right? Ultimately, if you’re sitting in a company, most of the time it’s about building a product or a service.
So I think now what we’ve seen is oftentimes a situation where there’s a very big under-appreciation of what it takes to build a state of the art tool chain, support chain. I remember talking to someone that was way back in the day pushing SQL databases. Imagine that. This is prehistoric times. And he was telling me how he had trouble pushing that into organizations, because they thought they were gonna build it themselves. Obviously, anyone who tried to do that fell flat on their faces. Same thing here. And we’ve had situations – that’s changing a lot, but we’ve had situations where a couple years ago companies would tell us “What are you talking about? I can build this in three weeks.” And they believed that, after we showed them what we’d built. Today, a lot of these companies, because tools didn’t necessarily exist, or they weren’t aware of them, or they felt they could build it, have invested internally and built something. And you know how it is, “Not Invented Here”, and once you’ve built a small toy, you’re enamored with it… But for the sake of us engineers – I remember as an engineer I was enamored with some of the things that I built… So that’s kind of the hurdle that we as an industry that is building and pushing tools need to try to get around or over.
So Nir, as we were starting to get into tools, and you were talking about whether organizations were starting to recognize the need for tools and how did they get productive and measure that productivity, and in a world that already has things like DevOps, and ML engineering, and data engineering and such, we’re kind of moving into that area. I noticed that front and center on your website you have this concept of MLOps, and as you were mentioning DevOps in passing and the tooling before that really kind of triggered that - I’m wondering if you can tell us what MLOps means to you in the organization, and how does that differentiate itself from DevOps on the software side, and other types of ML engineering and data engineering.
[20:05] Absolutely. Actually, that’s a great question, because it touches on one point where – you know, MLOps itself as a term is not something that is set already, and different companies are using it to mean slightly different things. That’s one of the issues with our industry - it’s so early on that terminology is not set.
When we talk about MLOps, we talk about the ability to move from the data scientist on a machine or laptop to training models at scale on some remote machine cluster. We’re talking about the ability to orchestrate that and do that within a larger team, not just a single data scientist, and we’re talking about the ability to automate that process. That in general is what we talk about when we talk about MLOps. How is it different than DevOps? Well, it’s actually very different. Let’s define DevOps very high-level like the – I mean, basically the idea behind DevOps is that you want to make sure that a piece of software, that usually is already tested, QA-ed, and stable, that has left development and is now going into production to serve users or workloads, needs to work at scale and needs to stay up all the time; and you need to make sure that it can scale across many machines etc. That, at the end of the day, is what DevOps has to do.
So basically, what do we say here? We said that there was a single piece of software, that it was tested and it works, and that you need to take that and you need to scale that out and replicate that. And that happens only in production. Well, in AI, everything around that is actually different.
First of all, as you guys know, machine learning/deep learning experiments can be very heavy workloads. You actually mentioned that yourself when you were talking about building your own computer at the beginning; you’re gonna run things that are gonna take hours, or even days. So unlike regular software you need to be able to run stuff on large machines, from day one, in terms of development. So that’s one big difference.
The second thing is what you’re doing is you’re running software that’s not tested, because you’re doing it during development. The third thing is you’re running experiments, and what you’re trying to do obviously is a lot of experiments, because that’s the whole process; you’re doing lots of experimentation, until you reach your goal.
So with experiments you’re basically running pieces of code that are slightly different from each other. And that’s a different thing than running the same piece of code on lots of machines. Basically, this is a very different problem. How do I, as a data science team, manage my workloads on clusters of machines; how do I handle lots of experiments that I need to run, from one or more data scientists, or a team of data scientists, and do that effectively, when we’re talking about pieces of code that continually change. How do I actually take the environment that I built - because in AI, again, the piece of code that you’re running is actually much more complex on one dimension than regular software, because really it’s an amalgamation of the model, the neural network, for example, the code that wraps it, with the data. How do I actually take that environment that I’ve built - me as an ex-researcher - and the model that she built, and then run it on a remote machine that has a different environment? All of these are different challenges, and these are the challenges that we attempt to solve, and this is what we call MLOps.
[24:05] Yeah, that’s a really good summary. I like how you set that up in terms of the comparison to DevOps, because it is maybe a shock for people starting to get into this field, where - like you say, from day one, in order to actually make progress on their things, they might have to know about “Oh, spinning up this GPU instance in the cloud, or CUDA libraries, and running things in a repeatable way…” It seems like a really high barrier for people to overcome from day one to get things working, and also do it in a repeatable way.
I also wonder - you were talking about experiments, and that sort of thing… I know one thing that is definitely true of myself (and my wife could confirm) is that I’m not very good at remembering what I’ve done, or what needs to be done… [laughter] In terms of the experiment tracking side of things - of course, there’s the running of things, which is definitely important, and I think that’s what maybe you focus on mostly, but there’s also this weird documentation almost piece of the puzzle; it’s not quite documentation, because it’s like a very specific type of documentation that’s really documenting “What have I done? What haven’t I done? How successful was that?” and it’s not really like you wanna have a research paper necessarily, especially if you’re developing these things as a product, or maybe even a trade secret.
Especially if you’re on a team, you wanna have that common understanding of what has been done and what hasn’t been done… How soon do you see teams encountering that issue when they start working on this problem, and what are those essential elements, I guess more the documentation or tracking side of things, that need to be in stone somewhere over time?
Yeah… Well, you know, as you were saying about documenting and how it’s not exactly documenting - if you come up with a term, please let me know… [laughter]
Naming things is the hardest thing. Doc ops maybe…
That’s probably already taken. That has to be taken.
That’s a real issue. It’s a real struggle, to name it. So I’ll preface and say actually - you were saying we’re focusing more on the MLOps; actually, one area where we’re pretty unique is that we have a very highly integrated solution. We think that you can’t focus on just one thing if you don’t have a highly integrated platform that actually takes care both of [unintelligible 00:26:40.01] the data management, the versioning and the MLOps, you don’t have the best scalable solution… But we can talk about that later, if you’d like.
The experiment management part of it, the documentation part of it - when do people realize that they need it? The answer is actually when someone on the team that usually has some sort of engineering background says “Stop. This is crazy.” You know?
That’s exactly the point.
I remember, Doug, if you’re out there - his name was Doug; he’s a great engineering at one of the startups I worked with… He was my wake-up call to this.
[laughs] Yeah… You know, it can happen with a team of one, and we’ve seen it happen – well, we’ve actually seen teams of tens of data scientists that didn’t have that. It really depends – if you have that person who realizes that and has the influence and/or power to actually say “We need to change this.”
Yeah, yeah. And I guess this is something that we have kind of talked about in passing, but that’s this interaction between AI developers or data scientists and the rest of an engineering organization. So maybe a follow-up question to Chris’ question about differentiating MLOps and DevOps - what is the integration point, from your perspective, between the two worlds? Because things eventually end up in a product; I’m importing a model into some API handler in some code that is production/product code - there has to be an integration point somewhere… Where does that exist, and what challenges are at that integration point?
[28:21] That’s a great question. Actually, the integration is something that happens continuously if you’re actually running things well. It’s exactly as I said - ultimately, what you wanna do is you want to take this model that you’ve built to predict something or to solve something, and then integrate that into a wrapper, or some larger piece of code that actually carries out the ultimate task of that product.
The thing is oftentimes you could test your model a lot, in an environment that’s kind of clean, but ultimately you’re gonna wanna test it in the field when you add that wrapper.
The other point also is that once you wanna get into automation, and even if you’re still within the data science part, if you wanna get into automation and create lots and lots of experiments, and you want to – maybe you’re actually fielding in continuously new data that’s coming in… Let’s say you’re building an autonomous vehicle and you’re constantly getting new videos from your cars driving around, and you wanna actually improve your models based on that - that also creates an integration point.
So the integration points are on those two levels. One is when you have to hand over the code so that it gets wrapped, and two, when you actually want to integrate those experiments within a larger pipeline that helps improve them. And there’s another point that we actually try to facilitate with our product, which is “How can I lower the barrier to entry?” And I’ll explain.
Let’s say you’re a company and you’re building a solution to - let’s say computer vision; it’s easy. You’re building something to identify cats, right? Let’s take the ultimate example. But you also need to identify dogs, because you’re building a pet detector, or whatever.
You’re speaking right to Chris’ heart. He’s an animal lover.
[laughs] I was keeping my mouth shut this time. Yes, I am. Daniel can’t normally shut me up on that, so… Go for it. Let’s hear it.
Alright. As a data scientist, you understand that if you’ve built a model - and I’m talking about the code right now, that facilitates object detection for cats. Well, if you now want to do the same thing for dogs, what you need to do is you need to take that code that you built for the experiment, and probably the same neural network that is the one that you chose for identifying objects in whatever scenario, and now marry it with a different dataset. That’s it. Why would you necessarily need a data scientist for that? Why couldn’t an engineer do that? And that’s behind a lot of the stuff that we’re doing also - the ability to actually have the data scientist work on the core pet detector model, and then have engineers facilitate optimizing that for the different objects.
Yeah, and I think that example itself illustrates another unique feature of this. I think you’re right in that those later stages could be (the popular word, I guess, is) “democratized” too, to other people within the organization, right? But also, it’s still not quite the same as like a normal DevOps, in that if you’re running with a different dataset, somehow you need to have a kind of unique tracking that’s going on, with like “What dataset was used to train this particular artifact or serialized model, at what time?” Because the code might actually be exactly the same. The difference might be in the data, right?
[32:05] Exactly, exactly.
I see so many people develop really sophisticated naming for their files, and such, which you’d probably need your own documentation to document that… What about the data side? We mostly talked about process and the operations, infrastructure… What about the data side of things?
So the data is the Holy Grail, at the end of the day. And I think that obviously, experienced and senior data scientists get this. It’s all about the data. Novices are focused more on the models. But at the end of the day, the difference between a product that meets the threshold of whatever KPIs you want it to hit and something that doesn’t, is about your ability to train it on the right dataset, and be on top of your data, and be able to fit the exact (what we call) data view to train that model.
So iterating on the data, identifying the skews within the data and handling those, identifying the holes where you need to add more data, or build synthetic data, or augmentations around that - that is the key piece. And as we talked about, that’s why it’s an experiment process, so being able to actually version that and track that… Because as an experiment process, you’re going one track and then you realize “I actually wanna go back to the model I built two months ago, and actually take a different direction.” You have to be able to version that. And not just the model, you wanna be able to version the dataset. If you have enough experience as a data scientist, you know that you’re always going to find datasets that work better for whatever reason, and you don’t even know why. You don’t know why.
There are so many examples of datasets that are “wrong”, because the metadata on them isn’t necessarily correct, but somehow they produce better results than a dataset that’s better. So you have to version your data. You have to version not just the files, but the metadata around that, so that you can effectively go through that process and make sure that you’re building the best solution that you can.
So before we got to the break you were talking about versioning the data, and I wanted to let you finish that thought… And then I actually wanted to also explore how Allegro is moving MLOps in a practical way; what you’re actually focusing on and how you’re implementing MLOps. But if you’d finish your thought on data versioning… I would love to hear it.
Sure. With respect to data versioning, at the end of the day we think that that’s the Holy Grail. Being able to have a set of tools that enables you to effectively manage your datasets, and their versions, and effectively also be able to obfuscate the connection between the code and the data, so that we can facilitate, for example, the ability to move from a cat detector to a dog detector, because now you’re using a different dataset. And again, as a data scientist, you all know that taking one dataset with the code and actually switching it to a different dataset is not as trivial as one would like it to be. So those are some of the goals that we set out to do with Allegro Trains, and the ability to actually switch between the datasets and the code in the models, as easy as Plug-and-Play.
What is the range of things that Allegro focuses on in its actual offering? I know that there’s the Trains project, which was mentioned in that tweet that got me interested to join the PyTorch Ecosystem project… So how does that fit into the wider scheme of what Allegro is offering, and how does a data scientist interact with it, I guess?
Sure. What we provide is a platform, or a toolchain, or a set of tools that basically takes care of the experiment process, the MLOps part of it, the ability to actually scale and actually run things effectively, and the data. And the full platform, which isn’t completely available as open source, basically has all these key pieces together, highly integrated. What we’ve open sourced, or what’s available as an open source project, is the experiment management part of things, which is all about the documentation.
We talked about the ability to document things, version your models, your experiments, your hyper-parameters, everything around that; reproduce, compare etc. and everything that has to do with the basic MLOps… The ability to actually manage a cluster - whether it’s on-prem, or on cloud, or a combination -by a team of data scientists, and really self-help themselves with orchestration, scheduling etc. and automation on top of that… And then some basic – actually, it’s not basic, because it’s on-par with whatever else is out there, but data management, or what we call data tracking (at least) is available in the open source.
The enterprise version adds on top of that much more sophisticated data management, more sophisticated scaling and data pipelining on top of the platform, so that companies can actually build the specific pipeline for what they need… And obviously, the standard enterprise relevant features like user management, permissions, managed services, all that stuff.
I’m curious, as you’re describing this - and I appreciated you talking a little bit toward what was open source versus what was the enterprise offering… As you look at different potential customers out there and there is a variety of ways they may implement how they are allocating resources for their own MLOps prior to you coming into the picture with them - you know, some people are strictly cloud-based, they may be doing Google or AWS or Azure, some organizations are maybe buying a bunch of DGXes from NVIDIA and have a cluster set up locally, or some hybrid form… Which of these scenarios does Allegro fit into? And if multiple, how does it change how you would implement Allegro?
[40:29] Actually, we fit into every one of those scenarios; any hybrid scenario that you can think of. And actually, the more complex your environment is, the more Allegro Trains shines. And I’ll explain. Basically, the way that Allegro Trains is set up is you have a server backend that basically manages the processes and record and logs everything, and then sets up the instructions for the clients, that are basically what the data scientists connect with, as well as the agents that run on the machines that do the actual training. The system is built that you can set it up on any type of machine for training. It could be a DGX, it could be any type of GPU by NVIDIA, it could actually be a CPU; it doesn’t really matter. It can sit on the cloud, on-prem, any combination, on any cloud that you like, and it all works.
In fact, a significant portion of our customers have a hybrid solution where they have on-prem systems, and then they actually burst into the cloud, when they have specific times when they need actually more processing power. And that becomes really effective for them. We have other customers that are completely on the cloud, and everything in between.
Why Allegro Trains actually shines the more that you have a more complex environment is because – so on the first level it’s that the interface to manage these clusters is really simple. You can actually try it out, we have a demo server up on the web. The data scientists actually manage queues, where they can set up the machines. “I want one GPU” or “I need a cluster of eight GPUs”, or whatever, and it’s completely invisible to them where those machines sit.
With the enterprise version we go even further and we provide three layers of software caching, and what we call zero data movement. So if you have a complex system where you have data in multiple locations, we’ll make sure that data goes to the right machine to train that close by to it. We’ll make sure that there’s local caching to it, that it doesn’t have to go back again and again… And so the data moves as little as it can. And we go even further - you can actually do federated learning on the platform. So you can actually have data being trained in multiple locations geographically around the world, and then combined into a single model.
Really interesting. And I think you hinted at some of these things, but just for my own understanding - it sounds like there’s the Allegro Trains server, which kind of aggregates all of this information. Does that experiment management - is it maybe the central brain, is a way to think of it?
In my understanding – let’s say I just have my own machine, and I have some code on it, and I want that to be tracked by the Allegro Train server… I think, based on what I was reading - you just kind of decorate that code with a certain snippet that connects to the centralized Trains server… Is that the workflow for that scenario?
We try to make it as simple as possible. We dubbed it “automagical”. There’s a snippet of code - basically, two lines of code - that you put just once in your code, in a header, and that’s it. You’re done. Everything is on track for you. And you could potentially have the server, the training, and your client be on the same physical machine if you’d like. It doesn’t matter.
[44:09] Actually, I’ll reveal some of my cards now, because a little while ago – so I have a friend here close by, geographically, in Indiana, and we kind of have regular calls to just talk about AI things, because we both work in companies where there’s not that many AI-type people, so we like to share things that we’re learning, and all that stuff… His name is Will - shout-out to Will out there, if you’re listening… But I asked him – one of the first times we were talking about his workflow and all of those things, we got into this topic of MLOps and all of that, and he’s like “Oh, I used this Allegro Trains thing. It’s amazing.” And I was just talking to him earlier today actually, and I was like “Hey, I’m gonna talk to the Allegro AI people later today. What do you want me to say?” and one of the things he said was for him it’s super-easy - like you were saying, pretty low barrier to add this snippet to your code and things happen automagically, like you were talking about… And the other thing he definitely wanted to mention was that the team is super-responsive. He mentioned raising various things on GitHub and all of that, and the team is very responsive… So great job; you’ve got a very happy user in Will here in Indiana.
Well, thank you, Will. [laughter]
He was telling me about how some of that works… And then you mentioned the agents - those have to do with the more automated runs that happen across a set of shared resources, or where does that fit in?
The agents - if you basically want to run your code on a remote machine, you basically set up an agent on that machine, whether it’s a DGX, or a GPU, or whatever you have it… And that agent is then associated with certain queues that you create; it could be associated with one or more queues. So it’s a little piece of code that sits on any machine that is potentially a target for running your experiments on.
One of the things I’m curious about - and I meant to ask you this a while ago when you were touching on it - was some of the motivation you had for going with an open source business model that builds an enterprise business on top of that. Did you always know that that was gonna be the approach you guys were gonna take, or did you consider any others? And how has that model worked out for you?
That’s a very revealing question for us. When we started out, we probably erred on where the market – I guess one of the things you do with a startup is you’re trying to time the market; I saw several articles talking about timing being the number one critical aspect of startup success, and actually one of the hardest to hit, and sometimes even VCs call it luck. But we were trying to time the market, because what we had built initially was around the Holy Grail, around the data. And we basically built a system with the thought in mind of “Well, companies are now doing development, but they’re gonna get to scale, and they’re gonna have to be able to manage huge datasets that constantly change”, you have to version that, you have lots of experiments, you’ve got these things running on multiple clusters… How do you handle all of that?
[47:25] So we actually set out to build this really big, robust system… And then we found out that very few companies were at the stage where they needed this, or realized its value… So we got back and started thinking “Where is the industry now, and how can we help the industry progress?” And we figured that the right thing to do is to meet the industry where it is, which was before that [unintelligible 00:47:48.05] and come up and say “Alright, so what are the low-hanging fruit of things that can bring immediate value to data scientists out there?” and the first thing was the [unintelligible 00:47:56.04] and then immediately after that the MLOps, or at least the MLOps in its later form. Don’t think of a huge conglomerate running hundreds and thousands of experiments, but using small teams.
We thought that the best way to do that, to really contribute to community, help spur that along, make that something that a lot of people can do stuff better, and the way we think it would be better and helpful – and ultimately, obviously, we’re a company, we’re about making money, but being able to do that was something that we felt was the right thing to do, that will be a win/win for everyone. It will be a win for the community, and ultimately it’ll be a win for us, because when the larger companies that do have money to pay, and do feel like they need to get more, they’re gonna come back to us.
Yeah, that’s great. And I think, as evidenced by users, and also attention, and joining in with the PyTorch ecosystem, in that blog post, and other things - I think that that really allows people to solve a pain point that they really have, really quickly… And hopefully it does eventually spur them on to - especially if they’re a part of larger companies or teams - integrate more with your enterprise systems…
But it’s been amazing to talk today. The topic is very close to what I’m super-passionate about, and I think Chris as well. Part of the reason why we do this podcast is to talk about those practicalities of how people do their AI development… So I really appreciate you joining. We’ll link the demo server and the links to Allegro Trains on GitHub, and also your main website, which talks about all of your offerings. We’ll put that in the show notes, for sure, and I encourage people to go there and check out those things and let us know in Slack, or on LinkedIn, or other places, what you think and how you like what they’re doing.
I really appreciate you joining, Nir. It’s been a great conversation.
Thank you so much. It was a pleasure, it was fun, and really, thank you so much for having me.
Our transcripts are open source on GitHub. Improvements are welcome. 💚