Fully Connected – a series where Chris and Daniel keep you up to date with everything that’s happening in the AI community.
This week we discuss all things inference, which involves utilizing an already trained AI model and integrating it into the software stack. First, we focus on some new hardware from Amazon for inference and NVIDIA’s open sourcing of TensorRT for GPU-optimized inference. Then we talk about performing inference at the edge and in the browser with things like the recently announced ONNX JS.
DigitalOcean – DigitalOcean is simplicity at scale. Whether your business is running one virtual machine or ten thousand, DigitalOcean gets out of your way so your team can build, deploy, and scale faster and more efficiently. New accounts get $100 in credit to use in your first 60 days.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Linode – Our cloud server of choice. Deploy a fast, efficient, native SSD cloud server for only $5/month. Get 4 months free using the code
changelog2018. Start your server - head to linode.com/changelog
Click here to listen along while you enjoy the transcript. 🎧
Hi there, this is Chris Benson and welcome to another Fully Connected episode of Practical AI, where Daniel and I will keep you fully connected with everything that’s happening in the AI community. We take some time to discuss the latest AI news, and we dig into learning resources to help you level up on your machine learning game. How’s it going today, Daniel?
It’s going great. I’m excited about some of the news we’ve got going on today.
Yeah, I love the format, the way we’re diving into it… For those of you who may have listened to our last Fully Connected episode - hopefully, it was as good an experience for you. We’re definitely listening to your feedback, trying to shape the show to better serve your needs going forward.
I’ve been talking to a couple people this week, and there’s just so much going on… It’s good to just have a chance - for me personally - to talk through some of these things, because there’s so much going on, there’s so many topics, there’s so much jargon… To kind of try to put some of that into words is, I think, helpful; we’re learning along with everybody listening, so keep us honest and let us know what we get right or wrong as we’re going through this stuff.
Yup. And if you haven’t already, we hope you will join us in our Slack community, at Changelog.com. We have great feedback, great conversations that are happening there between the shows… We’re also on LinkedIn, in a LinkedIn group, and we hope you will join us on LinkedIn. You can just search for Practical AI.
Awesome. Well, this week, as I was going through and looking through Twitter and various news sources, one of the themes that came up when I was looking through things was really having to do with all the things that happen after we train our AI. The question is, you know, we’ve trained an AI model - what next? In your opinion, Chris, what happens next? What happens after you train an AI model? What do you do? How is it useful?
It’s funny - before I answer that, I’ll just note that this is the side of things that we tend not to think about too much until we get there. The courses that are out there are really focused on training, and architecture, and people will kind of say “Okay, I’ve got it”, but your model doesn’t do any good until you deploy it into the real world and it’s useful for your customer, for your end user.
I know that as I was learning my way through the field, this has been a bit of a challenge, because the deployment environments and what you’re targeting for deployment can be very different, and the standards have been slow to arrive there. That’s changing now, but as we started out before, some of these standard approaches were starting to come into being, every vendor was different, and that was a real pain.
[00:04:12.10] Yeah, and for those of you that are new to some of this jargon too, what we’re talking about here - you can kind of think about this AI model as a sort of really complicated function that has a bunch of parameters in it, so when we do training, we’re using a whole lot of data through this training process to tune and tweak all of those parameters of our models… And we might have millions of these parameters, that parameterize our AI model function to do something, to transform an incoming image into an indication of objects in that image, for example.
So the question is, you know, once we’ve gone through that process and set our parameters, now we have this function that can transform data - what do we do with it? What are some of the things that you’ve done after training, or you’ve needed to do after training, or you’ve seen other people do after this kind of training process, Chris?
Well, honestly, a lot of it involves cooperating with other teams in mid-sized or larger companies, if you’re in a small company or maybe just yourself… A model is only useful if you are able to integrate it into some software that’s gonna go out onto your target device, where you’re deploying. That’s a whole different set of skills.
So when you say “integrate it”, what is the integration, or what are you integrating, really?
So you would take a trained model and you have to put it into a software package, and therefore the model has to be in a form that’s usable. By usable, it means you have a trained neural network that is able to operate on the hardware and software environment that you need to put it in in the end, and it needs to be able to have access to the data that is gonna be feeding through it for inferencing purposes, so that you’re actually operating.
There’s a lot of stuff to think about there that your traditional data scientist may never have had to deal with before. There’s a lot of software engineering, and maybe even systems engineering involved in trying to get it out there, so I thought this was a great topic to go ahead and delve into, and talk about what those pain points are.
I’m glad you brought up the software engineering side of things. If you’re trying to code some AI stuff, whether you’re a software engineer or not, you probably know that this idea of functions or handlers or classes are part of software that we build… So in my mind, as I’m translating what you’re saying, Chris, I’m thinking about in a web server that’s serving a website, or something, we might have a whole bunch of functions that do something. You give it a specific request and it gives you content back; maybe a picture, a video, or just some HTML, or JSON, or something. So in integrating AI into that, really we’re saying that at some point in those functions or classes or other things that are part of the software that’s running in production in our company, somewhere in there we’re actually accessing this model that you’ve mentioned…
So it has to be in some form, like you said, to be accessed, and most of the time that’s a trained form. In other words, we train our model and then we save it somehow, and then we load that saved or serialized model into one of these functions, and then just execute the data transformation that it does, like I said, from image to objects, or something like that. That process of utilizing the function is called inference. With that – I don’t know, did I miss anything there, Chris? Or any jargon that you think is relevant?
[00:07:47.16] No, I think another word that you might use to simplify things is just think of it as you need to wrap your model up as a software component… And just as whatever your software that you’re deploying may have a number of components that make it up, the models are also components; they’re components wrapped in whatever language you’re deploying in… So maybe while you’re training your model in Python, in TensorFlow or PyTorch or whatever you’re using… It may be that you’re deploying in C, or C++, or Java, or - I know you and I love Go as well… And you’re doing the inferencing, as opposed to the training, through that way. So you think of the model as a piece of that software component going forward. It’s part of deployment, and all the things that surround software engineering and deployment go into that.
Yeah. So when you’ve deployed models in this way a lot of times, what’s been the access pattern, or how have people interacted with the model? I know for me it’s been a lot of times integrating the model into some sort of API. We can talk about it a little more later, as related to some of the news… But essentially just where it’s integrated into kind of like a web service, where you would make a request for a prediction and get back a result. Have you seen other patterns? That’s the one I’ve seen most often probably.
Yeah, it’s always in the form – using it loosely as a service. I’ve seen web services used most often on server side, where you may not be constrained by your connectivity, and stuff… A lot of times though if your deployment target is an IoT device or a mobile device, you still have an API, but it’s really operating as a function, to use the phrase you’re using earlier that’s just – the API may not be a public API that your software component is using inside your group of software components that constitute your solution.
It doesn’t really matter, in my view, so long as that you are essentially following the best practices of the environment in which you’re coding and what your deployment target is made up of.
It make sense, yes. That brings us right into really some of the news that is related to this, that came up this week. First, let’s kind of focus in on this inference service or servers bit of things. One of the things that I saw come out this week was an announcement from NVIDIA that their TensorRT Inference Server was now open source. TensorRT - I think it’s been around a little bit, but this was the official announcement of the TensorRT Inference Server officially as an open source project now.
This is a project from NVIDIA, and part of the goal in my understanding of TensorRT is to perform these inferences that we’ve been talking about… So post-training your model when you’re actually utilizing your model is to do that in a very, very optimized way, maybe on certain specialized hardware, for example on GPUs, which NVIDIA of course is concerned with.
It was exciting to see this actually be open-sourced and available for the community. It seems like there is a bunch of great stuff in there. It also includes examples of how developers could extend TensorRT to do things like custom pre and post-processing, and integrate additional framework back in… So must than just TensorFlow, but like Caffe 2, and others via the ONNX framework that we’ve talked about here quite a bit, which is pretty cool.
I was excited to see this… I know that you’ve utilized GPUs probably more than I have, Chris… Have you ever tried to integrate the inference side of things on GPUs?
Yeah, working at some of the employers that I’ve had… And for our cases, we always have a project or service that we’re supporting, we’re always deploying, and so… One of the great things about TensorRT was really the first one that I got into kind of at scale. It does a number of optimizations to your model, specific to deployment. You’re essentially taking your model and putting it through this process that NVIDIA has, where it optimizes it for inference and then deploys it.
[00:12:04.04] I’m not really surprised to see that NVIDIA has open-sourced their inference server, because they’ve been leading the way in a lot of areas, and forcing some of the other previous giants like Intel to play catch up for a while, but now we’re starting to see the market stabilize a little bit, and seeing more than one player out there… So if they want to continue to be the leader, open-sourcing their TensorRT technology is a very sensible thing to do to make it accessible.
I applaud the move on their part, and I wish they had done this earlier, when we were first learning it… Because you know, being open source now, we can figure out what our problems are on our own a little better, obviously, by going through the source code and not having to worry as much about bugs that are documented, and that kind of thing. It’s a great move on NVIDIA’s part.
I guess one thing to point out here - and correct me if I’m wrong, because I think you have more experience here - it seems like with TensorRT a lot of the focus is on optimization, not necessarily on the setting up an API to access your model…
…although I do see that they have this statement in the article about, you know, to help developers with their efforts; the Tensor Inference Server documentation includes various things. I think there is a tutorial in there that they’ve illustrated how to set up a REST API with TensorRT, and we’ll link that in the show notes, of course.
I think that’s definitely a helpful thing, because at some points I’ve seen a bunch of – it’s hard for me, at least, when I see a bunch of stuff about optimization, but then I still struggle with the integration part, like we talked about initially… So I’m glad to see them at least have some examples in that regard.
Yeah, I think TensorRT started with those deployment optimizations, and that was kind of its foundation… But it’s definitely provided more and more tools for developers and dev ops engineers to be able to get this out into the real world, and we’re seeing a general push in the industry to do that, from these companies that are supporting with GPUs and other technologies to get that out. It’s getting easier and easier to use these, and TensorRT has definitely been a big part of that for NVIDIA.
Speaking of running inference on specialized hardware, you were mentioning to me right before the show about something that you saw from Amazon, right?
Yeah, Amazon - like we’ve seen with other providers - have announced that they’re launching their own machine learning chip. It’s not something they’re planning to sell; they’re gonna be driving some of the servers in AWS this way… In the article that I was referencing which was a CNBC article, they used the phrase “taking on NVIDIA and Intel”, but I think to some degree it’s them reducing their risk or dependency on specific vendors. I don’t think we’re gonna see vendors out of AWS entirely any time soon… But Amazon not only now has more tools in the toolset in terms of chips that support this type of work, but also it gives them leverage with those vendors in terms of the pricing.
It’s all good from my standpoint, in that I’m hoping that this drives prices down and it gives them a little bit of leverage, and NVIDIA, Intel and Amazon all end up lowering prices. I hope it doesn’t take another path from that.
Yeah, let me know if you think this is a good analogy, because I’m not sure that it is, but… All the cloud providers now pretty much have GPU support, and I think most of those are NVIDIA GPUs, but also Google has developed this TPU architecture, which is only available in Google Cloud. It seems like now Amazon is kind of doing – maybe not the same type of play, but doing some sort of specialized hardware that’s maybe only going to be available in AWS. Do you think that is kind of a similar play, or…?
[00:16:07.03] I do. If we go back to the episode where we had NVIDIA’s chief scientist Bill Dally on, and he schooled us all in GPUs versus TPUs, and ASICs and such, and all the different hardware possibilities here… He talked about the rise of ASICs, and you can think of the TPU (to paraphrase him) as almost a lighter version – a GPU has a whole bunch more to it, other than just doing the math necessary in a neural network. So I think you’re seeing these very specific chips coming out, with Amazon and with the Google TPU.
The GPUs have that same capability, but they also have a whole bunch more. But it seems to be that people really focus on that specialization of doing the matrix multiplication. It’s really kind of commoditizing the industry, because instead of trying to recreate an entire GPU competitively, they’re really focusing on this use case.
Yeah, but it seems to me at least - and I’m not a hardware expert, but it seems to me that all these people are coming up with all of these different architectures, including Intel having the Movidius stuff, and other people having specialized hardware… It seems like there’s just a lot of architectures to support now, and that does seem like a challenge.
Maybe these projects like ONNX are a way to mitigate that challenge, because now we might wanna train a model, and we do that (let’s say) in PyTorch or TensorFlow, but we may want to deploy the inference on one of many different architectures. I don’t know, it seems like there needs to be a central point for standardizing our model artifacts, and I’ve at least had some success with ONNX in that respect.
Those that aren’t familiar, we’ve mentioned ONNX on the show a few times - it’s the Open Neural Network Exchange format, which is a collaboration between a bunch of people, including Facebook and Microsoft, and Amazon, I think… But it’s still pretty rough, in some respects… If you’re trying to serialize a model from scikit-learn to ONNX, for example, there’s a few rough edges there, at least in my history, at least with the docs… But it is a really great, ambitious project, and I certainly hope that they succeed, because I definitely see a lot of problems that could arise from trying to support all of these different architectures. It seems hard.
Yeah, I agree with you. I think ONNX was a fantastic first way of providing that commonality across these different technology platforms, and I think that there is still a lot of room, especially within the open source world, of producing other tools with a similar intent. Just as ONNX has provided us that common format, there may be a number of deployment tools that come out where a deployer can focus on learning that as kind of a standards-based approach, rather than all the individual stuff.
I know that in a prior company we were deploying to TensorRT, and something that I’ll bring up, which is the Snapdragon from Qualcomm… While the workflows had similarities, they were completely different workflows that we had to learn, and we had people on the team that kind of specialized in either approach, and stuff. It would be really great if you could target one workflow that would work across vendors in that way.
Yeah, abstract that away. Right before, just a second ago, Chris, you mentioned that you had worked with this Snapdragon before, which I’ll let you describe here in a second… But one of the other trends that I saw in the news and updates in the world of AI this past week was some stuff having to do with running inference, running models in the browser, on mobile, on client devices and IoT devices… This kind of idea of pushing models out of always being run in the cloud, in some service in the cloud, and more towards the “edge”, or the client devices. Is this a trend that you’ve been seeing as well?
[00:20:27.16] Yeah, I think it’s interesting… You’re seeing a lot of inferencing being pushed out to the edge, and I know that that specific use cases that I’ve dealt with have had to do with mobile devices that were kind of leveling up and getting a Snapdragon in them that we’ve deployed to, and also IoT. The world that we’re at right now, you have lots of mobile and IoT devices that are not nearly powerful enough. I think with the recognition that inferencing is being pushed to the edge, you’re seeing a number of vendors starting to sign up with Snapdragon or similar types of technologies, basically low-power inferencing engines that can be deployed to inexpensive hardware on the edge, with very limited computing resource.
I think you’re going to see that type of thing all over the place, and I think that’s a given at this point, where your inferencing workload is distributed between the cloud and the edge, as it makes sense.
I think the big question now is whether or not there’s enough use cases of doing actually training on the edge, and whether or not that becomes a thing. I don’t think that’s really taken hold; there’s certainly lots of conversations around it, but I haven’t seen it personally in industry, actually being deployed in a production sense.
In the cases where you’re talking about, when you were using this Snapdragon thing, the neural processing engine, the motivation for pushing that inferencing out to a mobile - or it sounds like in your case an IoT device, maybe a sensor or something like that - what was the motivation for that? Was it connectivity, was it efficiency, or timing? What was the primary motivation?
It really depends on the resource environment that you’re deploying into, and also what the performance parameters are of actually operating on whatever–
So by resource environment you mean the actual resources on the device that you’re deploying to, the CPU or something?
Yeah, and there can be a number of cases. An example that I had personal experience in was in speech recognition and natural language processing, where you don’t have time or you may not have an environment equipped with the right network connections to pass to the cloud and then pass back. There’s latency involved in that. If you’re in an environment where you simply don’t have time for that, a few two-tenths of a second delay or whatever it is that you’re dealing with, in some cases there are speech recognition technologies where the use case requires that you start processing before you’re even done necessarily speaking a sentence… So you may be already having processed the first part of the sentence I’m saying right now before I finish this second part; it may be that the latency issues get in the way. I’ve seen some very specific constraints around that in industry.
There may be some situation where you can go either way, where you can have it be cloud-based, but I think as inferencing becomes easier and cheaper on the edge, you’re gonna see it more and more, to where instead of it being specifically a constraint, you’re gonna see “Where does it make sense to put this, from a cost-benefit analysis?”
I’m thinking back to that – way back at our episode three, where the team at Penn State was deploying this app for African farmers that would classify plants… I’m guessing - I don’t know, but I’m guessing that there’s probably connectivity issues for the devices when they put them out in the field, which is literally the field, like the farming field, in this case… So I imagine that they can’t necessarily rely on inferencing cloud environment, because they simply just can’t connect.
[00:24:10.01] I think there’s this one issue of maybe just not being able to connect and having to run that on a device, but of course, there’s issues with that. I remember them talking about inferencing really – if I remember right, kind of draining the battery of the device, and that sort of thing… So I know there are constraints here; I don’t think you can totally just export everything right now to these low-powered devices and expect things to work out great… But there is some encouraging signs.
It could be that when you run training, you run it on a big, beefy server in the cloud, and the reason why you do that is because you have to process a ton of data; maybe you’re processing 200 terabytes of data or something like that… But it doesn’t include sensitive data, or something; maybe it’s anonymized in some case, but then if you transfer that model over and run it in someone’s browser, and then you’re running the inference in their browser, you may be processing their particular data… You’re processing the feed off of their webcam, for example. And if you’re doing that, obviously that could be very sensitive data, so one thing you could do is transfer all of that data up into the cloud, do your inferencing in the cloud, but then you’re essentially taking possession of all of that sensitive data, whereas if you run the model actually in the browser and do the inferencing there, then the user’s sensitive data actually just stays on their device, so you can kind of totally – maybe not totally, but you can avoid many of these kind of privacy and security-related issues in terms of how and what data you’re processing where.
Yeah. And there’s other considerations… A while back in an episode we were talking about the general data protection regulation (GDPR) in the European Union, which though it’s only officially applied there, many organizations are applying it globally, so they don’t have to support multiple business approaches and processes… And it may very well be that by doing the inferencing in your browser, for instance, instead of passing it up to a cloud, you’re able to fit within particular regulations in a given country, where you’re not actually moving the data. The model can be deployed widely, but the data has to stay where it is, and therefore that might be the only option, or one of the only options that you have, short of having servers in every jurisdiction that you’re gonna operate in.
So there’s a strong use case going forward from a regulatory standpoint for being able to just do it right there in the end user’s browser and let them keep the data private. It never moves, it takes the whole regulatory concern - at least that aspect of it - out of the picture.
Yeah… I think there are - with everything that we’ve talked about before, and I guess everything related to this - always trade-offs, right? I was talking to a friend of mine who is at a startup, and part of their startup IP and really the secret sauce of what they’re doing is in their machine learning model, right? But then if you take that model and then you push it out to someone’s client device and run it in their browser, of course there’s always the opportunity for – you’re releasing that model out into the wild and people can maybe just take it and look at View Source in the browser and figure out how to get your model and utilize it, and all of that…
[00:28:25.02] I know that he was concerned about those risks, but it’s probably – I don’t know, in my mind maybe the benefits outweigh the costs, because in the same way, there have been a lot of papers that have shown even for doing inferencing in the cloud, if you’re exposing some service that does inferencing for image recognition or something like that, it only takes a certain number of requests to that API to be able to mock or spoof that machine learning model, and actually create a duplicate of it.
So I guess there will always be those trade-offs, but there is kind of this transfer of the model to the client’s device, which it probably has some trade-offs there, but also these models aren’t super-small, and if you wanna update them over time, maybe there are some storage, or battery, or other sorts of issues going on there. I’ll be interested to see how people deal with those tradeoffs and what ends up becoming the driving force there.
To go back full circle, when we talk about these deployment technologies such as NVIDIA’s TensorRT, or the Snapdragon neural processing engine - which is called Snappy for short - those optimizations we made, they literally will change the architecture of the model that you’ve trained when you’re deploying… And there’s a number of techniques that they apply to optimize that. That’s part of that deployment of models out.
The way I see it - it’s great to have all these choices and options that are finally coming into being in the software engineering world. Over the years, the evolution of software has given us many choices for client-side and server-side, and how we’re gonna choose to distribute workloads, and fortunately, we’re seeing that same evolution happen fairly quickly… There’s already a roadmap on that from the software engineering world. We’re seeing that being applied to data science and to AI technology specifically fairly quickly at this point. We’re measuring it now in weeks and months, instead of years or even decades, the way it took in software engineering.
I think having different ways of deploying a given model in the days ahead is gonna allow us to best serve our customers in that way… Choice is good.
Yeah, choice is good in the sense of cost, too. Like you’ve already mentioned, if there’s more choices out there for this type of specialized hardware… I know this has been a big win for Intel’s chips that are in drones, and you can plug in via USB stick, and stuff… It just allows people to do fun things really quickly with deep learning, and also functional things that are really crucial to certain products. I think that you ultimately win as a consumer, right?
I’ve kind of stopped – well, part of me still wants to buy a big GPU workstation, which I probably will never do because I don’t have all the money, but the other side of me says “Well, at this point it doesn’t matter”, because I can get any sort of specialized hardware for doing this stuff in the cloud, and moreover, I can go and buy one of these chips that I can integrate into my Raspberry Pi or another fun device, and just build some fun projects… And when I need more compute power, then I just spin up more on the cloud, so… Yeah, I’m glad that I don’t have to keep that saving going for a huge GPU machine that’ll sit in my office… Although it’d probably be good for heating.
[laughs] Just through employers, I’ve had the privilege of having access to DGX-1’s, at this point DGX-2’s…
And those are machines from NVIDIA, right?
[00:32:13.05] Yeah, those are super-computers from NVIDIA, and also the work station, which is essentially half of a DGX-1… At least that’s what it was; the specs may have changed. Those are all very expensive, but those are for training at scale very complex models, and it’s great to see – I think right now we’re seeing so many players getting into the space with ASICs, and TPUs or the equivalents… There’s now choice in hardware, and that is really commoditizing the entire field.
I think it’s becoming very reasonable to get into deep learning for small projects, the way we do in software engineering, where you might go to work and have a primary, large-scale project you’re working on for your employer, but then you come home at night and on weekends and work on something that’s really passion-driven… And I think that is becoming more and more viable for data scientists who are really into deep learning, and for software engineers who are getting into deep learning. I think we’ll continue to see that. I still think we’re gonna have incredibly expensive AI super-computers. The DGX-2 is substantially more powerful and more expensive than the DGX-1 was. We’re seeing a breadth of what’s available out there.
Yeah, and turning now from all of that news and great stuff about inference and hardware to some things that will help us as we build those passion projects, or try to figure out how we can do inference at our company or on our new project - we’ll kind of turn now to the part of Fully Connected where we share some learning resources. In particular, we’re gonna share some with you today as related to this topic of inference. One of the ones that I really like, that I think if you’re new to this whole idea of what happens after training my AI model - maybe you didn’t know that there was something that happened after that, maybe you didn’t know about this whole idea of integrating models into APIs… This article is called “Rise of the model servers”, which sounds very scary, actually…
It sounds like a movie, doesn’t it?
It does, it should be made into a movie… But it’s from - sorry if I mispronounce the name - Alex Vikati. It’s on Medium, and it says “Rise of the model servers. New tools for deploying machine learning models to production.” I just found this to be a really good summary article in terms of first telling what a model server is, which we’ve kind of already discussed here, but she goes into a little bit more detail… And then she just goes through and gives you five different common choices for this, which includes TensorRT, which we already discussed, but it also includes something that I’ve used before, which is a model server for Apache MXNet, includes TensorFlow Serving, Clipper and DeepDetect.
She goes through and talks about each one, but also gives you a link to the various repos and the papers that are relevant. It’s a good jumping off point if you’re new to this whole side of how to do inference or set up inference servers.
Yeah. Another thing to note is I know we’ve talked about TensorRT - NVIDIA has some great tutorials and references on their Developer Blog (devblogs.nvidia.com), that you can get into and learn about that. Since I also mentioned Qualcomm’s Snapdragon and the Snapdragon neural processing engine, their SDK, which you can find at developer.qualcomm.com, has a lot of good material on how you can jump into that.
Those are two vendor-specific sources that I know that I personally have used quite a lot over the last few years.
SNPE (Snappy), I didn’t get that acronym until right now. I’ve never…
[00:35:57.12] That’s a good one. I mean, it’s not immediately obvious to me, but still a good play on their part. That’s a catchy one.
Chris, we’ll talk to you later.
Sounds good, Daniel. I’ll talk to you later.
Our transcripts are open source on GitHub. Improvements are welcome. 💚