Stellar inference speed via AutoNAS with Yonatan Geifman, co-founder and CEO of Deci (Practical AI #148)

All Episodes

Yonatan Geifman of Deci makes Daniel and Chris buckle up, and takes them on a tour of the ideas behind his amazing new inference platform. It enables AI developers to build, optimize, and deploy blazing-fast deep learning models on any hardware. Don’t blink or you’ll miss it!

Changelog++ members save 2 minutes on this episode because they made the ads disappear. Join!

42 minutes
Recorded Aug 25, 2021
Published Sep 7, 2021
Download (41MB)
Transcript
🎧 17,351

Featuring

Yonatan Geifman – Website, GitHub, X
Chris Benson – Website, GitHub, LinkedIn, X
Daniel Whitenack – Website, GitHub, X

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Daniel Whitenack

Welcome to another episode of Practical AI. This is Daniel Whitenack. I am a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?

Chris Benson

I’m doing very well; it’s the dog days of summer as we record this, and I’m just trying not to melt.

Daniel Whitenack

Yeah, it’s hot like you’re sitting right on top of a bunch of GPUs, or something. [laughs]

Chris Benson

That’s exactly right.

Daniel Whitenack

Yeah. Well, speaking of hardware, today’s episode is kind of connected to that. I’m really excited to have Yonatan Geifman with us, who’s CEO and co-founder of the deep learning company Deci. Welcome, Yonatan.

Yonatan Geifman

Hi. Thank you for hosting me. How are you?

Daniel Whitenack

Yeah, doing wonderful. So I know a lot of what your company does is related to productionizing models, optimizing models… There’s a bunch of things that you do related to optimizing inference, and I’d love to dive into all of that as time goes on. But maybe let’s just start out by talking about why should we care so much about inference. A lot of effort and (I think) screen time, if you wanna put it that way, in terms of Twitter and other places is put on the training side of thing and scaling up training… But why do deep learning and AI practitioners need to be concerned about inference?

Yonatan Geifman

Actually, the first thing to be concerned with is the training. But after you finish building a model, then you need to deploy it somewhere. It could be in the cloud or in the edge. And when you’re going to deploy it and do serving or inference at scale, you must be concerned or think about how this model is going to perform in terms of latency or throughput, what is the operational cost of this model, or are you actually able to deploy it on the edge device that you want to use in order to serve the inference of that model.

[04:14] And it’s kind of the second phase of the development cycle after you reach the accuracy… But you must think about it at the beginning when you choose the model in order to know that you are going to finish the project with a model that meets the SLA of going to production in your environment, with your characteristics of performance that you’re looking for in production.

If you think about it, you’ll see that a model is trained once or once in a while, but the inference workload is huge, because it kind of linearly scales with the amount of data in production. So if you think about an organization that for example develops a self-driving car, you’ll think about the number of data scientists that during the training or building the models each one of them needs some amount of GPUs in order to do his work; but when you deploy these algorithms, these are deployed among huge amounts of cars. So in order to optimize the performance, you’re more concerned about what type of hardware you’ll need to fit into the car, compared to how many GPU’s you’ll need to have in the cloud for your data scientists to build those models. So this is kind of the proportions. The training is kind of linearly scaled by the number of data scientists, while the inference is scaled by the amount of data, the scale of the deployment that you’re going to use, and you must think about the performance when you want to go to production at scale.

Daniel Whitenack

Yeah. And as a practitioner, you mentioned thinking about inference upfront, and maybe even before training, thinking about what models you might even be able to use in production. As you’re working with different companies and different teams of data scientists, do you see people running into this challenge a lot, where they’re really happy with their model that they’ve spent a lot of time and effort training, and they just are completely blocked in taking the model to production? How often do you see that, and what are the main reasons why they’re not able to do inference with that model? Is it a latency thing, is it a memory constraint thing? What are some of the challenges you see?

Yonatan Geifman

So it is really divided into several problems, but there is a simple principle that you can always take a larger model and get better accuracy. And it’s easy to get better accuracy with larger models. But this only happens if you don’t need to think about production. When you’re thinking about production, you understand that those models are not production-ready, let’s say… For example, if you want to deploy them in the edge, you have several constraints, like the memory and the latency; you want to run probably real-time, or giving some reasonable response time for the model. But in the cloud, you’re more concerned about the cloud cost, of serving this at scale, the amount of hardware that you will need to orchestrate in order to serve your users or your demand for inference. And this is kind of the things that you need to take into account while you develop your model, because as I said in the beginning, going larger is easier to get accuracy, but it’s not easier to get deployed.

We see companies all the time getting some huge models, trying to fit them into small devices for edge inference, or companies that just understand that they took the largest model that they can have and get the right accuracy, but then they’re starting to scale and see the costs of trying these inference workloads, and then they’re trying to think about “Okay, how can I reduce those GPU costs?” or the CPU costs in running production in the cloud… And this happens all the time.

Our approach is think about these problems at the beginning, while you start development, choosing the right architecture at the beginning instead of having kind of trial and error iterations when you get to the accuracy but then you understand that you need to change the model in order to get it productized.

Daniel Whitenack

[07:51] Yeah. And I actually have a follow-up on that, mainly because I only have a certain sphere of experience and I only know what my organization does and what some other organizations do… Just for my own curiosity, how often are you seeing people use GPUs on the inference side, versus just CPUs? Do you see that being done more and more, or what’s the majority of inference use cases that you’re seeing? Is it inference on GPUs, or inference on CPUs? Maybe specifically thinking about cloud deployments. Edge deployments might be somewhat differently, but thinking about cloud deployment at organizations that you work with - is it often inference on the GPU, or is it often inference on the CPU?

Yonatan Geifman

I think it really depends on the task that you’re trying to do the inference and what are you trying to achieve? I think that if we’re talking about video analytics workloads you must use GPU. You have a lot of data, you want to process the data, the images in high resolution, and you need GPU performance.

If you’re talking about having some queries of an NLP model, you usually find those deployed on CPUs. But it doesn’t matter, because both of them are getting expensive when you get to scale. So if you look at, for example, prices on the cloud for having a 4-core CPU and a T4 GPU - it’s approximately the same. So it’s not like the problem is only the prices of GPU; also the compute of the CPU is getting expensive when you have to run large workloads, with large clusters with multiple nodes and cores.

Chris Benson

Could you talk a little bit – as we’ve kind of touched on edge specifically, could you touch on what some of the challenges you see about deployment to the edge and the inference that’s associated with that, or even beyond just the GPU/CPU consideration? That’s certainly something that I’m involved a lot in, and more and more people are now having to deploy to edge in all sorts of different use cases… And I think we’re pretty accustomed at this point to thinking about cloud-based deployment, because that’s matured a lot faster. But as more and more organizations are involved in edge deployments, I think they’re trying to explore their way through that. Do you have any guidance for that?

Yonatan Geifman

So first of all, there’s kind of a jungle of edge hardware types. There are a lot of types of hardware that you can use at the edge, and it really depends on the application. It could be a mobile phone, that if you deploy a mobile app with deep learning in it, you will found out that your users are spread across something like ten or more types of hardware, from iOS with [unintelligible 00:10:20.22] and Samsung devices with the Qualcomm Snapdragon, and hardware types like that. So first of all, you need to understand what hardware your users or you are going to deploy on. That’s the first task.

Then you need to understand what is the software stack that is best to use for that type of hardware. If we’re talking about Apple devices and iOS, we have the Core ML, but most of the other types of hardware will probably be better running with TF Lite, or those frameworks that are optimized for edge inference.

And above all that you need to understand what is the limitations of the hardware that you’re going to use. For example, memory constraints, memory bottleneck of loading the weights, loading the data, and stuff like this… And the performance - what are you going to get if you will run, for example, an object detection model on a Jetson, how many frames per second are you going to get? How many video streams can you put on that Jetson? And this is kind of something that if you will get to the accuracy, you’ll finish building a model and only then you will measure it on the device that you are looking to deploy on; you’ll get back to square one, to redesign the model in order to get the SLA, the latency that you’re looking for at the edge… And those are things that you need to have a holistic approach and to see how you solve all of them together. It’s kind of a multi-constrain optimization where the accuracy, the latency and the model size are things that you need to consider together when building an edge AI application.

Daniel Whitenack

[11:49] And that maybe leads me into another question about that. I do wanna get into the specific methods and technology that you’ve been involved in developing… But it sounds like you’re also sort of suggesting a different kind of workflow that people can have in their mind, where as you contrast it to “Oh, I’m gonna build my model, and the environments in which I’m training my model and testing it are totally different from those where I’m gonna deploy it”, how can people adjust their workflow to maybe – is it a matter of always making sure that you have a testing cycle where you’re testing on the hardware that you are targeting in the end, or how can that be integrated into data scientists’ workflows in a better way?

Yonatan Geifman

So we are kind of pushing to a hardware in a loop development approach. When you are taking the inference hardware into development stage very early in the model selection stage, where you’re considering some models - usually based on some open source repositories and academic papers - and you have to measure them at the beginning. This is kind of the first step in understanding if you’re going in the right way. After that, you need to understand or think how can you bring that model that meets the SLA to the accuracy that you’re looking for. So this is kind of an opposite approach - first reaching the accuracy, and then reaching the latency. And this is for edge applications with constraints about the latency and the model size. But those need to be considered at the beginning, and not at the end.

Daniel Whitenack

It makes sense. I am wondering maybe if you can just give a broad-stroke sketch of what people – let’s say that they get to a point where they have the model that they want, and it’s not quite optimized for the hardware target that they have in mind… Maybe there’s too much latency, or they need to shrink the model, or something… What generally are people trying out there in terms of methods for optimizing their models for inference on certain hardware?

Yonatan Geifman

There’s something we call the inference stack, where at the bottom we consider the hardware itself. On top of that we have a layer that we [unintelligible 00:13:59.26] On top of that we have open source methods like pruning and quantization. Pruning is a method to reduce and eliminate unneeded neurons and connections in the network. Quantization is representing the weights and the activations of the network in lower-bit representation like 8-bit, or something like that. So those all are kind of open source and public methods and techniques in order to build more efficient inference.

On top of that, what we’re doing and specializing is kind of the model design approach, where you need to select the right model that is optimized both for the data to reach the accuracy, and also for the hardware to reach the latency. And what we understand today is that different hardware types prefer different types of models. I will give a simple example - if you think about a GPU which has parallelism capabilities, it will prefer large, wider layers, with fewer layers in the network, because it can parallelize the layer itself, but it cannot parallelize between layers.

In contrast, in a CPU there’s low parallelism capabilities, and you will probably prefer having narrow layers, with less neurons, but you can have more layers. And this is kind of a general idea how the inference speed could be affected by the model structure. And by optimizing the model specifically for the inference hardware that you’re looking to use, you can get a significant boost in the performance, compared to having kind of an open source model or off the shelf model that you just took from an academic paper, or from GitHub, or something like that.

Break: [15:49]

Daniel Whitenack

So Yonatan, I wanna maybe get into some of those design aspects that you were talking of. And I find it fascinating that you’re sort of diving into this area of thinking about what sorts of layers run best on what sorts of hardware. Has that research around that topic - has that been going on for some time, and you’re able to sort of build on that? I’m curious, because hardware is progressing so rapidly, even in terms of accelerators, like GPUs and VPUs and TPUs, and all sorts of U’s - all of that is advancing very rapidly. So it sounds like you have some intuition around what runs best on what hardware, but is that rules that you’re encoding into your methods in order to optimize models for certain hardware, or what’s going on there? I know you’ve got this sort of architecture search concept.

Yonatan Geifman

Yeah, there are many types of hardware, and we cannot have rules for every type of hardware. Designing neural architecture is very complicated, because it’s kind of a composition of multiple layers, with multiple types and sizes. So searching among neural architecture is a large search space, with a very complicated search. So we can’t have a rule of thumb, like “For this hardware use that operation, and for this hardware use this operation.” What we are doing - we are employing neural architecture search that is hardware-aware, in the sense that it’s connected to the hardware, and optimizes the structure of the network to both have better accuracy on the data, and better latency on the hardware. And this is something that we must connect to the specific hardware in order to get that, because you can’t really model the hardware and understand how a specific neural network will behave on that hardware.

So you can think about it as a hardware in the loop neural architecture search, where we have an automatic search algorithm that searches among hundreds of thousands of neural architectures and finds the best one that kind of sits on the sweet spot between the accuracy and the latency on the given hardware. Searching in that space is very complicated, because the latency can be measured or estimated by using the hardware, but in order to understand the accuracy of a given architecture, you must connect the data and train it. Candidate architectures could be very expensive.

So the trick here or the secret sauce here is “How can we do it efficiently?” How can we scan hundreds of thousands of neural architectures and find the best one for your operational point in terms of latency/accuracy trade-off? And this is kind of what we do with our proprietary neural architecture search algorithm that is called AutoNAC (Automatic Neural Architecture Construction).

Chris Benson

Could you take us through maybe kind of an example, just to make it very tangible for the practitioners that are listening to this, in terms of how the practitioners can use neural architecture search to do that, maybe ideally with an edge use case? Just to give a sense of how you’re approaching it and what that feels like if you’re targeting inference on the edge ultimately.

Yonatan Geifman

[20:04] Yeah. So we have a collaboration with Intel that we’ve recently published in [unintelligible 00:20:06.23] last September, or something like that, about the performance boost that we’ve made to image classification model ResNet-50 for a MacBook laptop, for example. We’ve done that for several types of hardware, but let’s take for example the edge use case with the MacBook Pro. So we’re having the baseline model; we have three inputs to the algorithm of the neural architecture search. We have the baseline model, which in this example was ResNet 50. We’re having the hardware itself and the data. The data was ImageNet in the MLPerf benchmark, and the AutoNAC algorithm kind of searching what types of changes or how can it change the original architecture of ResNet-50 in order to get a better architecture that preserves the accuracy, of 76% of accuracy, for example, and minimizes the latency or maximizes the throughput? And it is very interesting to see that first of all it replaces some of the operations from dense convolution layers [unintelligible 00:21:10.27] and some other variants of convolutional layers that are, let’s say, less memory-bounded, because the cache memory also limits the inference speed of the network.

The second thing that we can learn from that idea is that some layers are more important than others. For example, if you think about what size to have on each layer in the network, we have some understanding that the initial layers that are doing the feature extraction in the image classification example are more important than the later layers in the network. So putting kind of most of the computation at the beginning of the network seems to have better accuracy-preserving properties. This is kind of what we see when we observe on the results of that run of AutoNAC on this example… And we end up with a model that is faster in something like 3x compared to the baseline, and having the same accuracy.

So this is kind of how we take an algorithm that is fully automatic and try to do a post-mortem to understand what happened there; why the output of this algorithm looks like that. Why the initial layers haven’t changed, or almost haven’t changed, but the later layers have changed significantly, replaced with other types of layers… Their size was smaller, and some other changes that we observed on the result of the algorithm.

Daniel Whitenack

I wanna actually get back to that later, because it’s really fascinating that you can sort of use these tools to learn more about the types of things that work well architecture design-wise on certain hardware; that’s really interesting. But first, just to sort of bring that example into focus - it sounds like you had this base ResNet model. I’m just thinking about the sort of inputs/outputs of the automatic network architecture search. Like, if I’m a practitioner and maybe I have my own custom model for object detection, or a custom model for speech recognition or whatever it is, and I train that, then in terms of doing this automatic neural architecture search I’m assuming one input to that is the serialized version of my model, in whatever format it’s in.

You’ve also mentioned that a dataset was input to that… Maybe you could give some details on that. Why is the dataset input? Is that so you can make sure that you’re not optimizing just for the hardware, but you’re also optimizing to make sure that performance doesn’t degrade in terms of prediction performance? And then also, do I need to have access to the specific hardware that I’m targeting? So I need like this dataset, and then do I need the specific hardware that I’m targeting and run the automated neural architecture search on that hardware? Or how does that work?

Yonatan Geifman

[24:02] Yeah, so you got it right. Let’s formulate the equation that the neural architecture search is trying to solve. We are talking about minimize the latency of the model on a given hardware, subject to getting an accuracy that is above a given threshold. So the latency of the architecture or the model can be measured without training in most of the cases on the hardware. So we don’t need the data to understand what is going to be the latency of ResNet-50 on a CPU of Intel. But the accuracy is data-dependent. And if we want to put that constraint - and obviously, we want to put that - we need to have the data and verify that the model that is elected by the minimization problem of the latency still meets the accuracy requirements. Because [unintelligible 00:24:47.05] the accuracy constraint, we’ll end up with a model with one neuron that’s prediction nothing. So this is kind of the composition between the latency that is measured on the hardware and the accuracy that is measured on the data.

Chris Benson

Could you talk a little more about the output path on that, in the sense of – let’s say that you have a model that’s doing object detection, or it could be anything really, and you wanna put it on maybe a platform that’s on the edge, that has a bunch of sensors and maybe a bunch of cameras pulling in… When you’re going through the training process, how are you accounting for that variability out there? Kind of going back to if you’re not targeting the hardware in the architecture search that you’re gonna deploy to, how do you y’all account for that? How do you say “Oh, there’s a very unique configuration for my output target, my deployment target?” How do you approach that?

Yonatan Geifman

Are you asking what happens when we don’t know the hardware in production?

Chris Benson

Well, yeah, I guess – like, if you’re targeting a particular environment that may be customized fairly significantly for deployment, as more and more practitioners are now kind of getting out of the data center and they are putting things on drones and they’re putting things into automotive being a big one, obviously… Anything like that that’s out on the edge and kind of has a custom environment - what does that look like from a practitioner perspective? Kind of out of the theory and into the hands-on.

Yonatan Geifman

So we prefer to connect to the actual hardware that the model is going to be deployed on. For example, if it’s Jetson, we connect the exact Jetson model to the neural architecture search. If we don’t know that, we need to have some proxies.

Chris Benson

Gotcha.

Yonatan Geifman

A good proxy might be the number of loading point operation, but we know that this is not such a good proxy, and some other methods, like pruning, target this metric. But this metric correlates to latency only on CPUs. So having proxies there - it’s not the right approach, in our perspective. Measuring the metrics that really matters on the actual device is the way to go, in my perspective… Like, measuring the actual latency, measuring the actual throughput on the device is the way to go in order to understand the exact performance that you’re going to see in production… Because I can show examples where [unintelligible 00:27:05.23] factor of two, and the latency is getting slower on GPU, on Jetson, on some types of devices. So these proxies are very problematic today, and I think that this is kind of the interesting part of doing hardware-aware neural architecture search, compared to other compression techniques, like I mentioned, as pruning, that reduces the number of FLOPs, or the number of parameters, or any proxy for the size or the complexity of the network.

Daniel Whitenack

So Yonatan, as you’re working in this space, one of the big things that I always start thinking about when I think of optimizing networks in this way is that there’s just so many different types of layers and custom layers that people are using, and new stuff coming out all the time… So what has it been like sort of maintaining your search space over time and growing that space to sort of include new things as they’re coming out? How do you approach that as an organization and figure out how to expand that search space, and what to include and what not to include?

Yonatan Geifman

[28:12] That’s a good question, because the research field of deep learning is progressing very fast… And we have a team of researchers that’s sitting on the latest academic papers that propose all those new layers and those new operators, and kind of reproduce all these models to understand which types of layers, which types of techniques are worth adding to the search space of the neural architecture search and which are not. And actually, the result is very interesting - in most cases, for example if we’ll take the computer vision domain, the basic operators that are well-known for the last five years or something like that are performing the best. Some of the tricks that we see now are not improving so much compared to using those blocks and operators that build ResNet and MobileNet and EfficientNet and those networks. So having the right composition of operators is more crucial than having all those fancy tricks that showed up in the last 2-3 years. This is something that we see. It’s quite a general claim; there are some cases that we see things that are worth adding to the search space, but in general, I would say that it’s not so easy to beat a ResNet model that is quantized and use the graph compiler like TensorRT or something like that, and you need to work hard in order to build. So this is kind of how we see all the advancement.

Of course, we have other advancements in other fields, like training tricks, optimizers and stuff like this, and we have to be on the frontline on all of those. This is something that we are working really hard in order to reproduce all state of the art, all the types of models, all those new models that just announced, and having the results and the operators in our search space.

Daniel Whitenack

And a follow-up on that, I guess, which is related to that approach, is - on the one side, you have all of these different types of architectures. On the other side, you have all of these different tasks being solved with deep learning models and AI models… So I’m curious, as you’ve experienced this over a number of years now, is it harder to optimize the inference of certain tasks versus other tasks? So maybe like NLP tasks versus computer vision tasks versus audio tasks versus - you know, maybe it’s time series modeling, or something… Are there certain domains of AI tasks that are harder to optimize than others?

Yonatan Geifman

So at the moment we are mostly focused on computer vision and NLP. And in those domains, we see that the principles that we are using, that are machine learning-based, are working across the board. Yes, I can tell that there are some tasks in a domain that are a little bit more complex. For example, semantic segmentation networks are more complex than classification networks, and they have to preserve the information along the network in order to do kind of image-to-image tasks. But also on those types of neural architectures - we can optimize them, and the principle is very simple. Most of the networks and the new, fancy algorithms are kind of built on top of three components. One is [unintelligible 00:31:29.11] few layers connected to the input. The second component is the backbone, and the third component is the prediction block. In most cases, 80% to 90% of the compute is happening in the backbone. And usually, the backbone are just a bunch of convolutional layers, in the case of computer vision. And by optimizing that significant part of the network, which is similar across classifications, semantic segmentation and object detection, you can get that boost of performance that we are looking to have in all tasks.

Chris Benson

[32:04] As we’re talking about this, I’m just visualizing everything in my head… As you get to the output and you’ve targeted a particular deployment target, and accounted for the hardware and what the capabilities are, I’m curious, how does your platform integrate in with whatever DevOps pipeline a practitioner might have put into place? What is a typical scenario for actually pushing the deployment out to the hardware that it’s gonna run on for inference look like?

Yonatan Geifman

So we look at our platform as an end-to-end platform, from development to production. We develop two production tools, one them called Infery, and the second RTiC. Infery is a lightweight, edge inference engine that could be integrated into a monolith application easily. And RTiC is a containerized inference server. So if you’ll take that solution, it could be easily deployed by DevOps with the model inside fetched from the mono repository that we provide as part of our SaaS offering, and kind of serve the model in a standardized API that contains all the packages, libraries and environment details that you need in order to go over Kubernetes for inference at scale.

Sometimes we see companies that already have their own infrastructure and don’t want to change their existing infrastructure, and we provide them with the specific model, just the exact model that ran the optimization; the model after the optimized model, in their format. We support all the types of formats, from ONNX to TensorFlow, PyTorch, Keras and all of these frameworks that you can run inference on.

So these are the two ways to get a Deci optimized model to a production environment, either by our deployment tools, or getting the model and using your existing stack.

Daniel Whitenack

I’m just kind of browsing around on some of the information about Deci, and it’s super-fascinating. One of the things that I see is this idea of DeciNets, which you share about, and share some of the successes that you’ve had taking this approach in various domains, for various types of models… Could you just share a few success stories in terms of what you’ve been able to achieve performance-wise with this approach?

Yonatan Geifman

Yeah, so DeciNets is a good example for that. We took a few well-known tasks like image classification, object detection and semantic segmentation, and we’ve taken the most famous open source dataset, for example ImageNet, COCO, and datasets like that, and we built kind of a catalog of pre-optimized models for each and every hardware. And what we are doing - we’re kind of plotting what we called an efficient frontier chart, with the latency on the X axis, and the accuracy on the Y axis… And plotting all the models that we know - the ResNet, the YOLO, and those models on those charts, and putting those DeciNets or [unintelligible 00:34:56.25] for detection, and kind of seeing what we call the efficient frontier - how those models reach better accuracy and better latency, and dominate this trade-off between the accuracy and latency. And now we provide those DeciNets for data scientists in order to try fit them to their specific data, fine-tune them for their specific data, and having kind of pre-optimized results of AutoNAC that is ready for use immediately.

Daniel Whitenack

And I’m wondering, as you plot out this landscape – I love how you termed that the efficiency landscape. That’s a really cool way to think about space. As you plot this out and explore that space yourself, I’m wondering - one way to think about what you’re doing and how you’ve expressed it is “I’m a data scientist, I’ve trained my model, now I run it through auto neural architecture search and get out my better model, faster and more efficient for the architecture, while still performing well for prediction.” But I’m wondering if this sort of cycle, as you do that more and more, you start building some intuition as a data scientist or practitioner to start with a better model in the first place.

[36:10] If I look at your plots of the efficiency landscape, can I learn some things about maybe – maybe I start with better models in the first place, rather than relying as much on the neural architecture search. Do you think there’s that sort of feedback and that learning that can happen from what architectures are learned by the automatic neural architecture search?

Yonatan Geifman

Yeah, absolutely. We share some information [unintelligible 00:36:35.08] for a given hardware. For example, one of the things that we see - those architectures are faster, but having more parameters, for example. This is kind of an intuition that we see around these models that can be used to understand that we don’t always look to smaller models, but for more faster models and accurate models. We provide those models as a starting point.

For example, if you’re considering taking a ResNet-50 or EfficientNet B0, you can take the corresponding DeciNet model and start from that, and tweak that for your application, whether you’re doing object detection, classification, or anything like that. And this is kind of giving the ability to use a mass-produced model, compared to having a model that is off the shelf, for general use. So for example, EfficientNet is a result by Google from two years ago, or something like that, that are supposed to be efficient, but EfficientNet are not so efficient for GPUs, even when [unintelligible 00:37:41.27]

So having a pre-optimized model for the given hardware that you’re going to use is very crucial, and in our example of DeciNets we show that you can have a model that is something like three times faster than EfficientNet B0 or for a Jetson GPU, while having even better accuracy.

So this is kind of a result that you can take off the shelf, instead of using EfficientNet B0; you can take that model and train it to your application, build on top of that some other prediction [unintelligible 00:38:18.11] some other tasks that you want to solve with that backbone, and get an AutoNAC result without running all the neural architecture search for that specific task.

Chris Benson

That’s very cool. This is so interesting, as you were talking about model optimization and things that Deci has done with it… What are you envisioning as you guys are looking into the future at this point, what kinds of things are aspirational for Deci, in terms of where do you wanna take the platform and what you envision will be the next thing in model optimization that you’d like to implement?

Yonatan Geifman

I think that we feel at a good point in the model optimization space, that now we’re seeing that we need to expand to the whole development lifecycle. After you have the data, we look on controlling all the training, optimization and deployment of the deep learning model on our platform, whether you’re using those DeciNets or using some off-the-shelf models, and kind of having a full workflow of development, optimization and deployment based on our platform. That’s because we understand that there’s kind of a triangle that we can draw, that is on one edge we have the model, on the other one we have the data, and on the last one we have the hardware. And this is kind of a combined optimization that every data scientist needs to understand how they solve that. And we are kind of providing the tools to optimize this triangle.

For now, we are mostly focused on the model side, but in the future we’ll be also focused on the data and the hardware side, in terms of not having them fixed, but having some techniques for data enrichment, data augmentation, self-supervised learning, having some hardware recommendation system, maybe having some FPGA capabilities and having our hardware that is optimized for the given model… And this is kind of the far future about how I see optimization in its full.

Daniel Whitenack

That’s awesome. Well, I am really excited that we got to talk through this, because I know I learned a lot about this neural architecture search and the things that you’re doing. I’m really impressed with where you’re headed with this, and I appreciate you taking time to join us and chat with us about it.

Chris Benson

Sure. Thank you very much. It was great to talk with you and I look forward to hear the episode.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Practical AI – Episode #148

Stellar inference speed via AutoNAS

with Yonatan Geifman, co-founder and CEO of Deci

Featuring

Featuring

Sponsors

Notes & Links

Transcript