Practical AI – Episode #169
Deploying models (to tractors đźšś)
with Alon Klein Orback & Moses Guttmann
Alon from Greeneye and Moses from ClearML blew us away when they said that they are training 1000’s of models a year that get deployed to Kubernetes clusters on tractors. Yes… we said tractors, as in farming! This is a super cool discussion about MLOps solutions at scale for interesting use cases in agriculture.
Featuring
Sponsors
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com
Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with no ads, extended episodes, outtakes, bonus content, a deep discount in our merch store (soon), and more to come. Let’s do this!
Notes & Links
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International. I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
I am doing well, Daniel. How are you today?
Doing well. Yeah, it’s been an interesting couple of months and lots of new projects kicking off, so keeping busy… But we just came back from a little bit of vacation last week, which was nice.
You’ve got to tell everyone where you went now, now that you’ve actually brought that up.
Okay, yeah. We drove down to warmer weather. We live in the Midwest of the United States and drove down to Alabama, did some hiking, kind of back up through Mammoth Cave, which I learned is the world’s largest cave network, system, thing… I don’t know the proper terms, but yes, that was fun. So we went underground for a bit. Yeah, it was a good time. I’m pretty psyched for a couple of reasons for this show, Chris, because, of course, I always enjoy talking with you, but we’ve also got some familiar technology…
I don’t know if you remember, a while back we had an episode – I think at the time was called Allegro AI, which is MLOps. They’ve since rebranded to ClearML, and we’ve got Moses Guttmann, who is ClearML’s CEO and co-founder, but we’ve also got one of their partners, Greeneye, we’ve got Alon Klein Orback, who is the CTO and co-founder at Greeneye, which is an agricultural AI company. So this is going to be fun. We’re going to talk about agriculture, AI, MLOps, and get to chat with some old friends as well. So welcome, Alon and Moses.
Thank you. Thank you for having us.
Pleasure to be here.
[04:06] Yeah, it’s great to revisit MLOps with ClearML. We had that show previously, and of course, MLOps - even since we had that show, has just been kind of exploding as a topic that’s on people’s mind. What has that been like, Moses, in terms of just like this meteoric rise of people caring about MLOps and how they actually practically do machine learning?
So I think the market really matured in the last two years. I guess, it’s probably COVID accelerating the process, where everyone is working remotely, so you have to have automated processes and you’ve got to log everything, because you cannot call your colleagues every minute or so. And I think that the problems that we kind of discussed in theory two years ago became very day-to-day practical problems that companies and individuals run into on a daily basis.
And now, it’s becoming a problem that if before it was nice to have, now it became like a must for most companies. Back then, only a few understood the benefits and the need for this very comprehensive approach where everything is streamlined. And I think now, it’s kind of common knowledge; probably not that common to actually implement, but at least the understanding is there.
Yeah. And Alon, in terms of your company, which has customers who are in the agriculture vertical, but is very much like at its core an AI company, from my understanding… I don’t know the full history of Greeneye, so maybe you could give a little bit of that, but did you do a bunch of AI as Greeneye and then come to the MLOps problems? Or from the beginning, was that something that you needed and was problematic for you, so you kind of started with that early on?
Hey, Daniel. It’s a good question. We actually started from different goals or different objective of Greeneye, until we established what Greeneye is about. We started talking about the laptop and training on the GPU on the laptop, and stuff like that, and really understand there is no scaling it in many, many, many ways. And even at the server there, we needed to install CUDA, and cuDNN, and using no-op to keep the script running, and really all the worst practice that you can do.
There is no shame. We’ve all been there. [laughs]
Yes. So we started with no MLOps at all. Really, no MLOps. I think a few years ago, my brother, one of them told me, “Try Docker.” And I said, “Ah, no, no, no.” And the rest is history. Three years later, today, we are completely, Dockerized from end to end, Kubernetes, all the way cloud and edge… And this has changed everything, because then we realize we can do a lot of MLOps all around and move stuff.
Yes. And maybe just stepping back in terms of Greeneye, I think - Chris, I don’t know if you remember this, but I forget in what episode it was. I think it was one of our Fully Connected episodes, we were using the example of spraying…
Yup.
…in crop fields as an example of like this scale between completely human manual process, up to automation, and how that’s changed. It struck me, Chris, that neither of us are farmers and truly really know much about that process. So Alon, this is very much like the world you live in. So maybe you could just step back and let us know a little bit about AI in agriculture before we talk about some of the other MLOps things that your company is doing, along with ClearML, what does AI in agriculture look like generally, and how has that developed over time?
[08:06] Sure. So I must be honest, I’m not a farmer as well. I am from the technical side, but I like what we are doing. In the beginning image, I think you can divide the interests into two; the one that gives tools to all the farmers to get more information and more details about the field, the crop, the yield, or anything like that. And you can divide the other group as the tourists that make the decision by themselves. So you get a lot of cool companies like in Israel, Taranis, Prospera, the other that doing a lot of intelligence in the field. They collect the data, they analyze it and show the farmers insight. You get drones there, you get satellites there, you get pivot and cameras etc. And from the other end, you get decision-making tools, driving the tractors, autonomous sprayers, like the same domains that we are, Blue River and John Deere, a position that was, I think, four years from now.
As a follow up to that, I’m curious… You know, we’re so used to, on the show, kind of talking about these very technical topics. And yet the clientele that your company is serving is one that is getting into technology, as we’ve talked about, but if you look at the broad history, hasn’t really been something that we associate with high tech, and all. What is the merger of something as cutting edge as deep learning MLOps on one side, with farming on the other, where that’s making this massive transition? What’s it like being in that space where you’re presumably tying together two very, very different worlds?
We really like it. We are really a very diversifiable company. We have chemistry, we have agronomics data, we have data science, we have real-time… We have everything cloud, and MLOps, and business size and spare operators. And being in this spot that everything is connected, technology is related to the field… We really like it. You need to have the business to run and to make sense from a monetize perspective, but it’s also nice to do something nice.
Yeah, it’s nice to apply AI, I’m sure, to a problem that we all have, which is we all need food. I saw in one of the videos on your site - and this is where my knowledge of all the machinery and stuff, but I’m assuming this is like a spraying machine. I don’t know if it’s specifically for spraying, but it sprays. And then you’ve kind of got cameras. I don’t know, if you could maybe just describe this machine and the arms of the machine, just so people have a visual of where your technology fits in.
So maybe I will start to understand the problem that we are solving. So imagine you have a garden and you grow vegetables there, and what you do in your free time - you are weeding; you’re taking the weeds out, because they compete with your vegetables about resource, sun, water etc. And when you are getting bigger, you are starting to use mechanical tools. And when you are getting really big, like farmers in the Midwest, like Nebraska and Iowa and all of this area, you start to put chemicals, because you cannot control the size of farming. You put chemicals and you want to do it fast. So you don’t put with a small tractor. You have dedicated sprayers, [unintelligible 00:11:33.28] it’s a monster. You can walk underneath without the need to bend under the sprayer. So, it’s a really big monster that you just drive to do as much acreage as possible in a short time.
[11:57] And what we are doing in Greeneye, assuming the worst-case scenario that every spot in the field has weeds there - instead of doing that, we are putting in sensors, cameras, and our computers and nozzles and everything else we are using to spray only when you need to spray. So instead of putting 100% of chemicals over 5,000 acres field, you’re putting 10% for it. So you’re saving money, and the world’s saving chemicals. It’s a win-win situation.
So about your question - sorry, I forgot… So we have a big sprayer, we have cameras, each like three meters long, and we are just filming the entire boom. We are putting the cameras looking a bit ahead, so we have time to process. Everything is done in real time, no connectivity at all to the cloud.
So Moses, as I’m listening to Alon talk about his story and as we’re doing this, I’m thinking back to starting with these cutting edge MLOps… When you’re looking at the landscape on your side, as someone who’s bringing this technology to bear in the marketplace, how are you evaluating different opportunities in industry? I mean, is it just that everything is open? Do you have a way of looking and saying, “I see an opportunity where this technology in a particular industry is going to be very useful”? How do you make those kind of judgment calls on how to engage?
Good question. So first, it’s all about intuition, I guess, these days. Everything is changing so fast. It’s very hard to actually try to predict something, if I’m referring back to machine learning. And what we do is an iterative process, and we try every time, and then we refine. And that’s exactly how we look at the field itself.
So if we see a company or a specific field where we feel that what they need to do in order to solve the problem they’re trying to approach is actually an iterative process, where you have a model and you’re constantly refining, rebuilding a better model, then this is a great fit for MLOps, because that basically means that if you have enough automation, you can really accelerate the process. If not, obviously, you have to do the same process only manually, which time-to-market-wise really increases the timeframe from a research phase to actually something that is working, where you have some alpha in the middle.
And when you see a process where you can say, “You know what? With a bit of automation, this model would really work.” Like not 90% of the time, which means one out of 10 you fail, which is not – in theory, 90% sounds fine, but in practicality, this is not something that you can actually sell. You think to yourself, “Okay, the only thing that I need is a bit more information from the field itself, and then I can just refine the model, rerun it, get a better performance and then just repeat the process. Then, basically I’m golden. I can take it to different scenarios and get my model up and running.” And every time we see one of those scenarios, that’s this kind of moment where we hear the plain “Okay, this is a perfect fit for automation for MLOps as kind of a holistic approach.”
So Alon, following up on what Moses was just talking about with where the value of MLOps really comes in with automation, you gave the example, sort of the concrete example of the spraying, detecting places to spray within a field with this massive machine, that has the sensors or the cameras on it… So how does the automation or the retraining of models that you’re using, where does that come in? How often, and what sorts of things are you automating in practicality?
I think I can spend hours to answer fully this question… But I must say, the things about MLOps and ML in general, if you compare it to other industries, like coding, deploying servers etc/ there is no best practice yet. The entire industry, everyone invents something on their own and doing something else. We start getting to some main path, but there is no known best practice that everyone has agreed on. So I think from our side, we had a lot of challenges. We have challenges of data and controlling the data; we have a massive amount of data, more than millions of images over the field.a
One of the challenges is getting a model that was trained to the tractor. And before we had any automation, we trained the model, and then we freezed it, and we are using TensorRT and ONNX to TensortRT, and stuff like that. And we did it manually, like each step at a time. I think we got 10 models a year to the tractor. When we got automation – we have thousands of models; to the tractors, and re-run… And when a researcher finished training, there is a click and everything is done automatically. It’s converted, it’s got metrics there, to ClearML, and it’s checking the converting, if we didn’t miss anything by the converter, like performance… And we got the end of the file there to the tractor by a single click. So I think this is one example of automation that MLOps really changed.
[19:52] Do I remember correctly that you’re also running Kubernetes on the embedded devices?
Yes, yes. We’re running Kubernetes on the embedded devices. This is completely a game-changer on our end, because – this is also a conversation different, that I can speak for hours about it.
And when you say “on the embedded devices”, we’re talking about like on the tractor?
On the tractor, yes. We have a few devices there, and they are running the K3s. It’s the light version of Kubernetes. It’s really nice, because we get best practice from the cloud, and we get best practice to the edge. And, for example, if we want to use ClearML from the Edge or from the cloud, it doesn’t really make any difference to us. So we know how to press and move the secrets, and how to use it, and we just do it in details, yes.
So you have full visibility to the tractors inside the same dashboard that you’re developing again, and you have the entire cycle streamlined?
Something like that.
Good, good setup. I like it.
Yes, we’re using Rancher, and I can connect any running tractor around the world that is online now and just doing, even SSH to the machine. I can view it all. But this is more bigger one than the MLS and the MLOps. We call it SpecialOps.
SpecialOps, I like that.
We actually have dedicated teams that are doing research around SpecialOps. So we have an MLOps and DevOps and IoTOps, and now there is FinOps… This is the team that moves one step ahead and doing a lot of tricks. I tried to find, when we started, to use the ClearML; I started with ClearML, and I understood I needed to start Allegro… And I didn’t find the right point, how we got to know Moses and his guys, so I’m not really sure… But before we got to know him, we used Kubeflow Pipelines to do the runs and to do the metrics, and we got there to a big wall of complexity, and we shifted the training from Kubeflow Pipelines to ClearML.
I’d like to ask a follow-up question about something you were saying a moment ago. It’s something kind of close to what I’m doing when I’m not podcasting, and that is - when you talked about having Kubernetes in all the places, in this case on the tractor, and you talked about K3s, can you tell us a little bit about – I’m a big advocate in Kubernetes in all the places, at various scales, as a setup. So since in your use case you have done that, I’d love to hear how you arrived at that and what benefit you think it’s given you. Why do that? Because most people don’t think about putting Kubernetes in all the places. They’re running kind of in the cloud, they’re not out on the edge yet the way you are, where you’re way out on the edge. And as someone who, also in my day job, works out on the edge, I’m curious what your thoughts are about how Kubernetes in all the places is a good model going forward.
So Chris, you don’t think people when they think of the ideal deployment target for Kubernetes, they don’t think immediately a tractor?
I’m glad that he does. I’ll say that.
It should be right next to it.
We are in the sprayer world. It’s a good question. When we started containers, we started with no Kubernetes and we did like our own deployment system. And we first – like, in months, we had the wall of keep them alive, visioning, and everything, and then we understand, “Okay, someone solved these problems.” And then we got to Kubernetes on the cloud, and the edge; we still use containers, but with a different orchestrator. We have the Azure IoT, if you know it. We used it for a while, and then we’ve got another wall, because Kubernetes is this concept of separation, [unintelligible 00:24:00.21] and we understood, “Okay, we need to change.”
[24:12] And it was not easy, because nothing is really ready to ARM 64-bit. You’ll be surprised with most of the libraries. Maybe today there are some that are more, but most of the libraries like one year or two years from now, no vision for ARM 24-bits, and it was like there is no way that we are on the edge of the edge as a young company.
Do you think that NVIDIA having bought ARM recently will have any impact on that?
Not buying ARM, failing to buy ARM. Failing to buy ARM.
Oh, okay. Yeah.
I think the regulation issues might affect it.
I forgot about that. That’s a good point.
Yeah. But definitely, I think we’ll be here for the edge. And Kubernetes, back to your question - that helped us a lot. For example, when we have a new researcher or a new programmer or whatever, he does not do anything on his laptop. The laptop is only the gate to the port, on the tractor, or on the cloud; and this is the workspace. So if we get to the tractor, or in our subject, the research. The researcher comes to the company; a new researcher, we just hired a new one. She’s a really good one, and she’s got, “Okay, you get access to–” We still use Kubeflow for the notebooks server. So if she needs a workspace, she just clicks and gets a new notebook, and she can use our tool. We use PyCharm for remote interpreter, and you just connect to the pods there and you get all the data. And you can play from your PyCharm, you can play from your notebooks. And this is possible – you can get it in different ways, but this is mainly possible… All the games of sharing and forwarding with Kubernetes is much easier, and this is true for the edge and for the clouds. So no one is installing any dependency – I don’t know if you play with JAVA_HOME, or something like that, or npm install… No one is installing anything on his computer besides the IDE. Our philosophy in general, “Everything containerized, no workspace to be installed in any way.” And this is really specifically in an MLOps environment.
So Alon, in terms of the automation that you were talking about before, you mentioned this information about like thousands – you were able to do more than like thousands of models a year now, versus an order of magnitude lower now that you didn’t have automation. I’m wondering if you could talk a little bit more about, for your case – and I’m trying to think of maybe weeds look similar… So what kind of needs to be updated so much throughout the year for these vision models or whatever models you’re running? And then, could you describe maybe a little bit more – you mentioned kind of bringing in new data, training the model and then having the pipeline to push it out to the tractor, and deciding when and when not to do that. It’d be interesting to hear about the when and when not question, in terms of what you test within your MLOps to determine when you push something out and how you do that.
It’s a good question. I will start with the first one, about the models and the number of models. I think in the end – well, from my experience; I might be wrong, but having one model to rule them all, or something like that - it’s not enough. You’ve got a vision of [unintelligible 00:28:00.07] in it or any other model, and you need to put more effort to solve a real-world problem, because you have a lot of variables, a lot of variance in the real world. And you need to combine the classical vision and the deep learning one.
[28:20] So I think then we have metrics on the cloud for the models, for the big models, and so on… But then, we want to have metrics on the devices, on the tractors that runs themselves. So we keep constantly testing ourselves on the real environment how we are doing it.
Also, in terms of performance, not only metrics; performance, speed-wise, a cycle clock. How fast we can go… Today, we can go about six meters per second. You guys speak in miles per hour; it’s about 12, around this area, miles per hour. So this is a really big factor for us.
Besides that, we have different groups, different geography, and everything’s changed. The landscape changes, the weather changes, the sunlight changes. It’s a completely different game to play in Israel, for example, and the Midwest. The farmers change, some use tilt, and some in Midwest they mostly stopped using tilt. They just keep the crops, the old crops there, and letting the grain do their magics, and just sitting above the old crops. So everything is changing and we need to react to those changes. And this is for the first one.
The second one - I think that every tractor round that we are doing, or any other way that we are getting the data, we try to get as much as variance as possible. So we’re training our model by one or two or even 500 images. Again, it won’t change a lot for the model; we have a lot of them, so we try to understand when a few models don’t agree with each other, or something like that, or tricks like that, to understand when there is information that is interesting to re-run and train on it.
So are you logging your data preprocessing and dataset creation, and your training runs in the MLOps? And building, training off of certain triggers or something that you have set up, or how does that work?
So one of the challenges of MLOps is reproducibility. I think this is a really hard one to get right. You get code versioning, and then you get dependency. And well, okay, let’s say you solve that with Git and Docker, but then you get data versioning. And then, in all of that, you need some system that will take everything from every place you need, and then you need to push it in, to just click play and rerun it. So reproducibility is really hard, and if you did – I don’t know, half a year ago you did something good and you want to go back to it, it’s really hard.
We are trying to log as much as possible from the system perspective and from the training and the research perspective. What’s nice about ClearML, that we are using - it’s not only from an MLOps [00:31:2.01] in general. So we’re just pushing everything that we want to use as metrics, and show stuff there. So from this perspective, we just log everything possible, and if it’s visible, we can use ClearML for it. But also, we want to push our limits to run faster and faster. And if we run faster, we can do even more stuff. Today, we didn’t quit, but our mission is to spray less, grow more. So we want to do fungicides and pesticides, and fertilizer etc. So we need more compute power, or to be better in what we are doing and saving computer power for different tasks. So we try to log everything and be better in what we are doing.
[32:24] Quick question. You mentioned retraining models. Do you have a model per tractor, or scene?
No, it’s not a tractor per scene, but we retrain a model like – okay, there’s different reasons why to do it. Fixed - maybe we want to improve it in different variance of apperance of the backgrounds, or anything like that… Or we want to make sure that we’ve got a new weed that we don’t know, we are not familiar with it, so instead of doing a zero shot, we are doing a one shot and while doing it we can stop the system for doing this for a specific way that became more common. But in general, we are playing with a lot of tools. We are trying to get in line with the best practices in the industry, and we are also experimenting – it’s not just for the experiment, but it’s for the research to be better. So in this sense, we might be running the same train, or if we want to verify that we got the same result, like a real research. Not really – it’s not academic, you know, but in the sense
Okay. Well, given the fact, Alon, that you’re training so many models, you’re updating a lot of models, it sounds like there’s a lot of training scenarios that you’re encountering… You’re kind of doing this at scale, and you’ve been partnering with Moses and his team to do this.
I’m curious, actually, from Moses, from your perspective, looking back on the things you’ve been trying to enable with ClearML, and seeing someone use it at a larger scale like this, what are some of the things that you thought were going to be important and ended up being important, in terms of the things you’re tracking are the features that you’ve enabled? And maybe what are some things that maybe you didn’t expect, and now you’re thinking about differently than when you started things out?
As a follow-up to that, what insights are growing in your head?
I’ll try to cover everything. I’ll probably forget, so just remind me. Okay, so I think that Alon’s team were the first to say, “Guys, we want better connectivity with Kubernetes.” And the reason I remember – we started this discussion, a lot of our features are actually driven by the community, and Alon and his team started from the open source and kind of graduated, in a way. Basically, they just said, “We’re sick and tired of maintaining our own servers.” Plural. They had many. And they just said, “It’s not worth it. Just go do that for us.”
And we had multiple conversations even before. So we try to keep a very active Slack channel, and GitHub. This is how we develop features, right? Basically, people will say, “Hey, I want to build something”, and then it’s just a crazy idea. And then we try to think about, “Okay, maybe this is doable. It kind of makes sense.” And if it does, we try to figure out first how to hack it, so someone can continue with their day job and kind of build on top of it, and then try to realize, is there a way to actually structure into the platform itself? And if there is, we try to figure out a way to actually put it in there and see if there’s traction.
One of the things that I remember that Alon said we’re the first to do was to better connect the orchestrator with the Kubernetes cluster. Basically, when we started developing it, it was like – I don’t know, a long time ago. Kubernetes was not a thing. So containers were, but Kubernetes was not. It was just before Google just released Kubernetes as an open source solution, before it kind of killed Docker.
[36:16] So we started with Docker as kind of a bare metal. So we said, “Okay, fine. We’ll have the orchestrator that will just pull jobs, set up the container, and then run everything inside the container.” And it did that, and it’s great. But the resource management or allocation of Kubernetes is terrific. So these guys came, and they said, “Look, guys. We have a Kubernetes cluster, and we like the idea of your orchestrator.” So basically, ClearML orchestrator will do – think of it as a dynamic Docker file, in a way; I’m oversimplifying. A base Docker image, with the ability to control that Docker file you need to do in runtime, and then also introduced some caching. Bottom line, you do not have to have like a container per job, just to accelerate. Because when you streamline a process, you cannot have every step containerized; you end up with thousands of containers that no one will know who is using, and no one will delete, because someone might be using… Basically – yes, you get the idea.
Anyhow, so they said, “Okay, we love Kubernetes because it allows us to schedule resources very easily.” But then when the resource is scheduled, we want this dynamic approach, and obviously, visibility, which is always obviously hard with Kubernetes. We also don’t want our users, like the data scientist developers, to have actual access to the Kubernetes cluster, because - well, no.
So I think that was the first time we developed what we now call the Kubernetes Glue. It basically kind of converts a job from a ClearML into a Kubernetes job. Basically, trying to figure out whether this job can actually be executed on Kubernetes, give you kind of better visibility into the cluster itself… So users can basically push jobs into what we call a queue, which is – think of it as, in Kubernetes’ terminology, it’s basically like a template YAML that you’ll be using for that specific job, only you have a priority key on top of it. So it implicitly holds the setup itself, which is kind of resources etc, but also priority on top. And then use that in order to use Kubernetes as basically your resource scheduling, which it is terrific for, but it’s lacking the scheduler itself, like order and priority etc. This is exactly what the glue itself adds.
And obviously, it solves the problem of making sure that the end users, meaning that data scientists, will have access to the Kubernetes cluster, right? So that was the first feature that we added, just because of them. And this is how we heard on, “So you guys are running Kubernetes on the edge device?” and we were amazed. But someone is trying to do that.
So I feel completely obligated to throw yet another buzzword in…
Sure.
Just to ask you about if you have an opinion. And that is –
Don’t go blockchain, Chris. [laughter]
Well, that’s yet another one right there you threw out, Daniel. Hmm… No, I’m just going to stick with microservice architecture; when you were talking about all the containers out there, and managing that, and as people are moving more and more into microservice architecture over time and segregating off all their functions, and yet trying to keep them together… Does ClearML as a platform have an opinion on that in any way, or do you not care? Are you agnostic about where people end up? So if I throw the word microservices at you, what do you say?
Okay, that’s terrific. I love it. Because microservices is the reason Kubernetes was invented. Basically to manage them. But the idea behind a microservice - it’s alive, it’s production-ready, and it has to be stable; the default of Kubernetes is “If it fails, restart it”, because you had a good reason to put it there. This is not what is going on with MLOps. If it failed, then it will continue failing. Like, just do drop it; that’s the default. That’s the total opposite of microservices. And that’s basically our approach. Our approach is “Use Kubernetes for what it’s good for.”
[40:03] So you probably have another cluster doing whatever microservices that you’re running, which is terrific. But for the MLOps perspective, use Kubernetes as a resource scheduler more than anything else. That’s basically the opposite of the default of Kubernetes. And then, I guess, the bridge is serving models, which is actually a microservice, but you want that elasticity, because you still want to be able to change it without building new Dockers all the time. You actually want that to be in flight model upgrades, canary etc, that you probably want to control from outside, not from like an ELB perspective. So this is the bridge between the two, at least from our perspective.
So if I’m getting – just like stepping back and thinking… I’m kind of trying to connect some of the things, Alon, you’ve said, in terms of how things are working on your end. You sort of have data coming in off of the tractors, coming into various maybe data processing jobs, which might be queued up on a queue, which runs through ClearML. That might lead then into like model training jobs, which I agree – so I love your illustration, Moses, about like, you expect a lot of model training jobs to fail. Like, spinning up a service in Kubernetes to run a training is like the opposite of a lot of what they had in mind.
But anyway, so you sort of spin up these jobs in a queue… So from the data scientist perspective, you’re basically just saying, “Hey, I want to use this data to train a model, put it in a queue. It runs the training, and finishes or not.” But then like that model then, which has like a version that’s tied to the data, then sort of gets shipped out. In your case, does it get shipped out, like Moses is saying, to a service that’s running in your Edge, K3s cluster, as a REST service for what’s going on on the tractor? Or how does that piece work?
It’s a good question. I think we can see two paths to start the training. One place is from the data, as you spoke, and the other place is for the researchers who want to test the code, or new experiment, new model etc, and you want to fire a training.
So when we started - let’s take the example for the researcher, just going to his IDE, just connects to the remote workspace that is on Kubernetes, Kubeflow, Jupyters and so on, and he’s just playing there, and he just clicks remote-execute it. This runs to Moses’ servers and tells the ClearML system to speak, to log everything, like “Okay, I want to use this container, and this is the changes I did.” Like Moses said, it does not create new containers for every step. It just keeps the changes. Moses, if I’m saying something wrong here, feel free to fix me.
And then it goes to Moses’ servers, and then from there, it’s going back to our servers as our agent that Moses spoke about, that is the glue, and this starts the training. So this is the training, and the training continuously reports to ClearML, to the main server, what’s the status, and then what’s the metrics. And once we have metrics, we can decide one of the two, if we are free a new version for testing on the edge, or we stop there and just keep this track and moving on.
And the inference you’re running on the Kubernetes cluster as a service, or on the K3s, as a REST service?
So this is tricky one… We have something in the book that is not in production; when I’m saying production, like internal production, and research that we are planning to have K3s inside a pod to have a special environment for the developers that is separate; like a cluster in a cluster, something like that. But let’s keep this aside. We are using at the moment KFServing to serve the model [unintelligible 00:44:06.08]
[44:08] Our metric system - when we get a new model, we just serve it and the metrics just being API REST service locally. And this is something like a microservice that we’re using. One is responsible entirely. It does not know anything about the metric itself. You can just probe it to get results for each one, for what you want to do.
And there was another tool that is doing the metrics, and it just probes the service and accumulates, and then reports metrics to the ClearML service. So we have three different services. ClearML is not micro, but the other two are, and I think this is the cycle. We have a human decision in between, so we don’t do the entire cycle for every model, and not for every data.
So I have a final question that’s addressed to both of you in turn. I want to start with Alon. I’m actually letting Moses cheat and hear what Alon’s answer is. [laughter] So Moses, I’m throwing you a bone on that one. So here’s the question. As you are thinking about these amazing uses of technology, both from the technology creator and from the technology implementers’ perspectives, and you were thinking about what’s next, what are you wanting to do next? You’ve made it this far, you’ve had tremendous success, and you’ve got to have something that when you go to bed at night, you’re going, “Maybe I could do that.” Alon, I’d first like to hear for you, as the implementer in a very specific used case, what you’re thinking. And then, Moses (hint-hint), now that you’ve heard his question, you know, you can answer yours as well. So Alon, I’ll throw it to you first.
So I have many things in my mind, a lot of imagination before I go to sleep… Sometimes I see bounding boxes, and just seeing them and seeing them and seeing them. Sometimes I see different stuff. But yeah, in a larger scale, I think still in Greeneye perspective, we can do, and we are about to do the both that I said in the industry, about decision-making and helping the farmer, because we’ve already scanned the field, and we have the byproduct of a lot of data, and a high quantity of data, high-resolution data from the field. So our goal is to do both. And we have lots of ideas in the pipeline to also - one, to make the decision ourselves, and the other one is to help the farmer get better decisions for different stuff that’s not related to sprayers at all.
That was a good answer. Moses, how about you?
[46:50] So two things I can expose here. One, we are working now - this is public in one of the repositories - on the new version of the serving solution. Basically, we’re not fans of KFServing as an infrastructure, because it’s very hard… If you have a single model, if you’re not changing it, that’s fine. But if you’re constantly changing them, it’s not easy. Adding preprocessing is not easy, metrics… Everything is hard. So together with a lot of people from the community, we’ve redesigned the ClearML serving. So now, it’s internal testing, and it’s very, very nice. Basically, you can add the preprocessing without even code, deploy it, you have it auto scale on your Kubernetes cluster. So it’s basically building a serving service that is flexible, and that you can change online, which is terrific. This is what we want from these types of services.
So this is something that we’re working on and I’m hoping that we will be able to release before the end of next month. I think we have a talk in GTC, so before the talk, that’s a deadline, basically. We will have to release it, which is always good to have a deadline.
The other thing that we’re working on - and this is really research; we’re trying to wrap our heads around how do we actually solve it; that’s coming directly from Alon, actually. He was one of the guys that said, “Oh, I really want that” and he’s not the only one.
So we always want to make sure that we’re not building for a very specific problem, that it’s actually a widespread problem. And that problem is, I have a lot of data stuck there, there in the backend of ClearML, which basically means multiple databases. I just want to be able to query more deeply; like, create a dashboard… Like, the data is there; I know, I put it there. Now I want to have better interface for the database without actually accessing the database; this is doable, but probably kind of too risky. So we were trying to think how to hack together like a dashboard solution on top of it to allow you to create better visibility for an entire process, because the entire idea is to make this entire MLOps holistic approach. Basically, it means that if you have data, you should be able to use it in whatever step here along the way of developing your product; and this is something that is in making, but will find its way out very soon, I’m hoping.
That’s awesome. I’m super-excited to explore those things. I’m a fan, so I’m pretty excited to hear about the serving and the other things coming along. I appreciate both of you being willing to kind of talk through… First of all, give us an update on what is now ClearML, which previously, we talked about as Allegro, and all the great things you’re doing, but also in the context of this use case with Greeneye. So I’m really impressed with what each of you are doing… And yeah, thank you both for joining.
Thank you. Thank you.
Our transcripts are open source on GitHub. Improvements are welcome. đź’š