We recently met up with Cormac Brick (Intel) and Mike Del Balso (Uber) at O’Reilly AI in SF. As the director of machine intelligence in Intel’s Movidius group, Cormac is an expert in porting deep learning models to all sorts of embedded devices (cameras, robots, drones, etc.). He helped us understand some of the techniques for developing portable networks to maximize performance on different compute architectures.
In our discussion with Mike, we talked about the ins and outs of Michelangelo, Uber’s machine learning platform, which he manages. He also described why it was necessary for Uber to build out a machine learning platform and some of the new features they are exploring.
DigitalOcean – DigitalOcean is simplicity at scale. Whether your business is running one virtual machine or ten thousand, DigitalOcean gets out of your way so your team can build, deploy, and scale faster and more efficiently. New accounts get $100 in credit to use in your first 60 days.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Linode – Our cloud server of choice. Deploy a fast, efficient, native SSD cloud server for only $5/month. Get 4 months free using the code
changelog2018. Start your server - head to linode.com/changelog
Cormac, thanks for joining me here at O’Reilly AI. It’s great to have the chance to talk to you. I know you just got out of your talk a little bit earlier… You talked about “Portability and performance in embedded deep learning - can we have both?” I wanna dig into that a little bit more later, but first I’d like to hear – I know you work and help the Movidius group at Intel, and I’d love for you to let the audience know what Movidius is if they haven’t heard about it, what you’re doing and what you’re working on now.
Yeah, thanks Daniel. Good to talk. My name is Cormac Brick. I lead VPU architecture at Movidius, part of Intel. VPU for us is a Visual Processing Unit, and that’s the key engine we have in our product line; I kind of lead that architecture.
At Movidius we’re very passionate about machine learning and computer vision at the edge. This is something we’ve beet at for a long time, going back 5-6 years, even before we were part of Intel, and we have multiple products now in the field. We’ve learned a lot as a result of all of that interaction with customers over the years, and the goal of the talk this morning was to really kind of reflect back some of that knowledge, as in what have we learned about tuning neural networks for embedded silicon, and then also tuning embedded silicon for neural networks… To kind of just reflect back what some of the realities are when you go to take a network to the edge, what’s really required to make that run really well.
Awesome. Just to dig into that a little bit deeper, when you’re talking about customers that are tuning neural networks for the edge on things like VPUs, which you mention, what are some of the customer use cases around this, and people that have found a lot of value in going down that road?
Yeah, sure. At Movidius we have customers who are engaged heavily in things like digital security and smart city-type use cases, really making more intelligent commerce; that’s one big use case. We’ve also shipped a lot of products on drones, that’s another use case, as well as alot of things around robotics and smart devices and camera devices as well. There are Google Clips products on the market now that uses our Myriad 2 silicon. A lot of the DJI drones have used the Myriad 2 silicon as well, and there are things like – you can wave at your drone using your hands to control it, and then put out your palm and the drone can land in the palm of your hand, so… Really compelling use cases have been enabled through our silicon through the use of both vision and AI working hand in hand.
[00:04:07.15] Awesome, yeah. And just to kind of confirm that, I was actually at GopherCon last week and one of the keynotes - I think on the second day, or something - they used a drone with a Myriad chip in it to do some facial recognition. It was some cool stuff.
So let’s dive into a little bit more about what you talked about - is there in these types of use cases where you’re wanting to run your neural network in a drone, or in a camera, or whatever it is, explain a little bit the tension between portability and performance that we’ve seen in the past, and the state of it now.
Sure. If you go the archive, or if you go to NIPS, or ICLR, or CVPR, leading academic vision conferences, we’ll find there’s a lot of work being done to optimize neural networks for things like ImageNet or MS COCO, or kind of academic data sets. And that’s awesome in terms of pushing the envelope over the fields and advancing the science, and it’s moving super-fast, right? So then typically when embedded engineers will start off on a problem, they have access to that sort of research and these sorts of models, and then they’re gonna wanna do something that’s gonna work for them and their device, right? And one of the things that they would find is a lot of the models that are available out there were tuned on ImageNet, which is great at recognizing 1,000 classes of images, and it can differentiate one sort of whale from a different type of porpoise, and this sort of stuff; very fine-grained classification on specific tasks…
Important problems, yeah.
Yeah… Not so much in the real world, right? We have different problems to solve. So then in the real world we may care about “Hey, my robot wants to be able to recognize 100 common objects found in the home” or “In this security camera we want to be able to recognize these different types of objects that are happening.” So yeah, different problems… And often those problems are simpler than the thousand-class problem of an ImageNet. So one of the things we were talking about this morning is using techniques like model printing, and sparsification to – if you’re doing what we called domain transfers, so you go from your thousand-class problem, say if you were taking ResNet-50, and you’re then retraining that for your home robot which wants to recognize 100 images, you’ll find that you can get away with a much simpler network, with a less representational capacity to solve that 100-image problem than the one you started off in the 1000-image problem.
So we were sharing some results and some techniques specifically around channel pruning, which is a very powerful technique when you are doing domain transfer to a simpler problem domain, and also looking at techniques like sparsification, which is introducing more zeroes into a neural network, because that’s [unintelligible 00:06:38.26] on platforms that support memory compression of neural network models. It’ll enable those models to run much faster in bandwidth-limited devices such as those typically found on the edge.
Awesome. Let’s say that I’m working on one of these robotics problems, I’m using a neural network and I want to pursue some of these methods to prune it down or optimize it for that setting or for that architecture… What’s the process and the barriers that I would face as of now going into that and what’s the state of the usability of these tools, and that sort of thing?
That’s a great question, because for sure, we were presenting a lot of works this morning saying “Hey, we were able to take a network and do pruning, and quantization, and sparsification, and go from 8-bit weights to 4-bit weights and this sort of stuff… But you know, straight up today, pretty non-trivial to repeat the results that we were showing this morning, right? To bridge that gap, we have a network in Intel that’s part of the AI Products Group - as part of the Intel AI Products Group there is an open source project called Distiller. It’s one of the resources listed in my slides, I think on the final slide, and I believe they’ll get posted to O’Reilly at some point…
Yeah. So there’s a link to something on GitHub called Distiller, and there one of the things we’re doing is if you went back maybe 12 months ago you would have found “Oh, this is an awesome quantization technique that somebody published, some grad student published a PyTorch fork or something with this… And then here’s something else that was available in TensorFlow for quantization, and here’s something else that was available in a different framework… What we were doing is really kind of taking all of those techniques that are available in a fairly fragmented way across the internet and trying to put them under one roof in a way that’s a little bit easier to access. That was the goal of the Distiller project, to show that… And it’s an ongoing project at Intel within AIPG to have this kind of set of tools. So they’re available in PyTorch, and that’s great, because PyTorch can export to ONNX, which is then widely available.
In addition to the work we’re doing, it’s entirely appropriate though to give a shout-out to the work the TensorFlow team are doing. Under TensorFlow Contrib there’s a bunch of useful tools there on both quantization and on pruning as well… And there’s a pretty strong ecosystem there, also showing a variety of techniques.
Okay, yeah. So it is at least to a point where I could get a model off of some repository - maybe in PyTorch, or whatever - and have some tooling that’s publicly available to prune that down for certain architectures.
What about prepping the model for maybe certain specialized hardware? You mentioned VPUs, and I know there’s a lot of other people pursuing things around, of course, GPUs, but also FPGAs, and other things… What is that state of the art…? Are these pruning methods and all of that tied into that world, or is that something totally separate?
That’s also a good question, and it was one of the goals of the talk today, to show that “Hey, here’s four key techniques that you can use, that will work well on any hardware, and on some hardware it will work extra-well… But if you employ these techniques, you’re not going to hurt your model’s ability to run across a broad range of silicon” And those techniques specifically are kind of model pruning, sparsification, quantizing a network to 8 bits, and then doing further quantization on weights to use kind of a lower bit depth.
So if you employ those four techniques, you will still have a model – you know, if you take a model and you represent it in ONNX, or in TensorFlow, you will still have a model that can work well on a wide variety of devices, but on some devices it’s gonna work extra-well, right? …because of different abilities to run quantized models at varying degrees of acceleration, and also different silicon will have varying degrees of (let’s say) weight compression technology. Even in extreme cases, for sparsity, there’s some silicon out there that can process sparse networks directly and in an accelerated fashion.
So against a variety of silicon you can employ these four techniques and get really good results across a range of silicon, and even better results in some other silicones. That was the core point.
To answer the second part of your question, in the final slide we were making the point as well that hey, if you set out with a single network and you know the piece of silicon you’re running on, absolutely there’s other techniques you can employ to really fit that piece of silicon as best as you can, to really make this one network shine on this combination of this network and this silicon… And there’s been some very interesting work published on that in the last couple of months, and it’s a pretty hot research topic now, showing how using – you may be familiar with AutoML, right? So being able to use that type of techniques to refine a model or to learn a model that works really well on a particular version of silicon, with these types of performance claims and tradeoffs. That’s a pretty active area of research, that’s pretty interesting.
Awesome, awesome. I know that one of the things that I’ve appreciated as I’m hacking on things at home is that a lot of the stuff that you’ve come out with through Movidius makes it really easy to experiment with neural networks on a lot of different types of devices, through the Neural Compute Stick, and other things… I was wondering if you had any interesting stories or customer experiences that you’ve heard about of people enabling new sorts of things with these devices.
[00:12:22.07] Yeah, we’ve really enjoyed the experience of launching the first version of the Neural Compute Stick based on Myriad 2, and it was great to get out there and meet lots of developers… And also, when we launched that - we announced it some time before, and we really launched it then at CVPR last year… Yeah, it was great to see what everybody was doing, but also to kind of show them, “Hey, AI at the edge is possible.” If you go back 15 months or two years ago, people really associated AI with the cloud, right? So our first goal was to break down those perceived barriers, and for more people to be able to use AI and to see “Hey, AI at the edge is possible.” That was our initial goal, and it was a great experience, very enjoyable talking to all the developers.
A couple of things we’ve seen - we’ve seen people use this, one of the software ambassadors for Intel used this to do a prototype water filter, so taking the guts of a microscope, putting that up to a camera, into a Raspberry Pi, with a Movidius Neural Compute Stick connected, and being able to show that you could actually use this to detect water impurities, so to have an entirely offline water impurity detection device that could be used effectively on premises, at the edge, with no cloud connection, or anything like this… Super-cool idea, and we were able to show that that’s possible.
Equally, we’ve had people putting them on a drone to detect sharks in the water, also doing prototype medical imagining to detect melanoma on skin, also kind of driven by image classification. Yeah, so those are just a few things, but there’s been a lot of other fun projects posted on GitHub. I don’t have a link to our models site and example site, but I can provide you with them for the blog page also.
Awesome. Yeah, we’ll make sure that gets in our show notes, for sure. Well, I appreciate you taking time again. To wrap things up here, I was wondering - from your perspective, since you’ve been working in this space for a while, what can we look forward to over the next couple of years with performing AI at the edge? What are you excited about and what do you think we’ll see over the next couple years?
I think we’re definitely going to see a lot more silicon become available, both from Movidius Intel, and also from a bunch of competitors. I think that’s gonna be really interesting, as inference silicon – those kind of metrics business people would track, like the number of ops per watt we can deliver, or the number of ops per dollar we can deliver… And we’ll expect both of those metrics to progress at really fast paces over the next number of years. And if I look at what people are able to do with the first version of the Neural Compute Stick, with the capabilities that has, and – well, I can’t disclose product roadmaps, but some visibility of the types of things we’re gonna see, in terms of the volume of compute that various people can bring to market, at much lower price points and much lower power points… I’m really excited to see how that’s gonna play out, and the types of things people are gonna do with that. I think it’s gonna be a very exciting space to watch in the next few years.
Awesome. Well, thank you again for taking time, and enjoy the rest of the conference.
Thanks for joining us, Mike. It’s great to chat with you and meet you here at O’Reilly AI. I’ve heard about Michelangelo, this ML platform that you guys have developed at Uber, and I’d love to hear a little bit more about it, but first, give us a little background of who you are and how you ended up where you are.
Yeah, thanks. Happy to be here. I currently am the product lead for ML Infrastructure at Uber. That encompasses a lot of things, most notably the Michelangelo platform. A little bit of background on me - I’m an electrical engineer by training, and out of school I worked at Google; one of the places I got my ML chops, so to speak… Which is weird to say. I worked on the ads team at Google, specifically the ads auction group, and I was the product manager for all of the ML signals that go into the ads auction there… So these really real-time, high-scale, super-productionized ML systems that predict if you’re gonna click an ad, and if this ad’s gonna be relevant, and stuff like that. That’s where I learned how to do ML right, and probably best in industry in terms of productionized machine learning.
Then about three years ago I joined Uber, where we started the Michelangelo – which is not named after me in any way…
That’s a shame.
Yeah, people ask me that question all the time… We started the Michelangelo platform, which helps data scientists and engineers across the company build ML systems, prototype, explore ML systems, build them, and then deploy them into production and serve predictions at scale.
If you’re in a company that’s trying to build up their AI presence within the company, why would they need an ML platform? Why isn’t Jupyter Notebooks everywhere just fine for people?
That’s a good question. The state of Uber’s ML stuff about three years ago was that a lot of people were trying to do that. There were a lot of people – you know, grad students learned how to build their ML models in their grad school classes and whatever, and they have their own ways to do it, everybody has their own… I use R, I use Python… And we saw was that people were either trying to productionize an R model and run an R runtime in production at low latency, which is just very challenging - and people will cringe when they hear that today… Secondly, you will see data scientists that did have engineer support - they would build up these bespoke towers of infrastructure per use case basis, that would tend to be less well-built, just because they had lower resources, but duplicative of different pieces of infrastructure that people would build to serve these models in production across all the different ML use cases the company has… And then kind of the scariest is people just wouldn’t get started at all, because some people wouldn’t have a way to get their models into production.
So we saw the opportunity to build a common platform to help people have a unified way to build models, and to (this is the trickiest part) put those same models that they had prototyped on into production, to make those predictions… And along the way, bring a lot of data science best practices, build into the system reproducibility, common analyses, versioning and all that kind of good stuff, that is kind of like these data science best practices that aren’t yet really well established. We have a lot of really well-established software engineering best practices that everybody knows - CI/CD, version control and stuff like that… That stuff is not as well appreciated in the data science community, and it’s just because a lot of this work is new. It’s not like these guys don’t understand the importance of it, but it’s just like the best processes and the best patterns for building this stuff have not yet – we have not really converged on those yet.
So we’ve spent a lot of effort to focus on where we think this stuff is going to go, and to help build the tools to empower data scientists to do the right thing from the beginning.
That’s really hard to say. I would say we probably have more than – so this platform supports machine learning use cases across the company… Everything from fraud-related things, to predicting how long it’s gonna take a car to get to you, to even ranking dishes in the Uber Eats app… All of the main ML stuff runs through this platform now. And this is just like an interesting kind of platform development challenge - we have a lot of people who kind of use it, and they’re like “Hey, I kind of wanna build an ML thing”, and they dabble and explore a couple little models they wanna make, but maybe they never end up fully deploying that model to production…
So it’s kind of tricky to say like “How many actual use cases do you have on this system?” We know it’s well over 100, but it’s hard for us in the platform to say “Is this something that this human is just using this as an experiment, or is it fully productionized and deployed across the whole company?” That’s an area that we’ve just under-invested in a little bit, but we think there’s a lot more to do there.
As you’ve seen people start to use the system, are there features of it that kind of surprised you in the sense of how people relied on them, or things that people needed that you didn’t expect that they would need, or other things?
Yeah, that’s a really good question, and I’ve been reflecting on this a lot recently… You know, I’m the product manager, so it’s kind of my job, but… [laughs] The thing I would say that has gotten disproportionate adoption - given maybe even our under-investment into this, where you still could do a lot more in this space, but our users just adopted this overwhelmingly and they love it - is our feature store, which is part of the platform. Common problems for managing features for ML workflows are that you have to clean your data and transform your data, and combine it all, and also historically, into a training data set, so you can train your model. But then once your model is created, how do you do all those same transforms in the same way, the same pre-processing to that data in real-time when you deploy your model? So there’s kind of this dual type of ETL that happens in different computing environments, that’s really tricky, and–
Yeah, possibly on a variety of resources…
Yeah. And we see a lot of vendor solutions here, but I feel like we don’t see anybody really tackling that kind of stuff, and I think it’s partially because it’s not sexy at all to work on that stuff, and also just because it’s super-hard to do it properly. We’ve provided some nice ways for people to define their feature transforms to the platform, and then be confident that those transforms will happen consistently across both computer environments, real-time and offline…
But I think the other interesting thing is we saw – let’s take the Uber Eats world, for example. They probably have more than ten different models that they use to rank dishes and whatever they do. A lot of those models use the same kind of features, and before this feature store, data scientists didn’t have any insight into “Hey, other people that were working on similar problems, what kind of feature pipelines have they built?” And then when this feature store came along, now when a data scientist wants to start a new model, they can just look and see “What features exist that are relevant for me? Let me just start my model exploration process with the X features that are most relevant to this problem from the beginning.” So there’s a whole new element of collaboration, visibility, feature sharing that was previously not there. I really don’t see solutions in that space in the industry today either, so I think that’s a really promising area.
[00:24:05.10] Cool. Yeah, I look forward to hearing more about that, and definitely if you publish anything about that, I’ll be happy to post that on the show links here.
The other thing I was curious about, just from the fact that you mentioned before that the incentives for data scientists are kind of different and not always aligned with producing production-ready models and all of those things - how do you build up a team to build an ML platform where really you kind of need software engineering experience to be able to build something that’s production-ready, but you need the knowledge and the expertise around machine learning to be able to understand what to build so it’s gonna be relevant to the people you’re building it for?
I think one of the nice things is that we’ve had a little bit of – the leadership in our organization has been relatively forward-thinking to be willing to fund the development of an ML platform much earlier than I think is common in the industry, and that’s allowed us to get it wrong a couple times before we got it right… But we feel like we really got it really right now… And there’s like a tension between data scientists wanting nimbleness and flexibility throughout their exploration and prototyping stages, and if you think of any productionized system, it’s super-stable, so how do you accomplish both of those constraints? It’s a challenge.
Some of the design philosophy that we’re taking - and this is always developing - is we’re trying to allow data scientists to work within our system using the tools that are most relevant for them. We’d love for them to work in Jupyter Notebooks, and write all their models the way they normally would, where we can provide some helpful APIs for them - for example the feature store stuff to pull in their data, so they don’t have to reimplement a whole bunch of work that already exists in terms of enterprise intelligence that’s already been done. But after a certain point, when the prototyping stage is complete, if you think of this machine learning lifecycle, where it’s like “Now I wanna actually use this in production”, and maybe it doesn’t mean you’re gonna launch it to the whole company and you’re done with the project; it could just be like “I wanna experiment with this on live traffic.”
We focus on making it relatively low activation energy to take your prototype and transform it into something that can go into these productionized, well-engineered, hardened systems, that we can be confident will be stable from a systems perspective… And we still wanna give data scientists the ability to monitor these models that are in production for not just systems issues like whatever applies to typical microservices, but also the data science monitoring, how accurate this is modeled over time, are there any model drifts, stuff like that.
So there’s a story for data scientists throughout the lifecycle, and a story for engineers throughout the lifecycle, and it balances… And the challenge is “How do you balance between those at the different stages, taking into account all of the priorities for both stakeholders throughout?”
Awesome. Yeah, that gives some great perspective. Well, to kind of end things out here, are there places online where people can find out more about what you guys have done, and maybe also some things that you’ve put out there that you might wanna share?
Yeah, that’s a good question. We’ve published a blog post about Michelangelo - I think October 2017 - and it’s pretty easy if you just search “michelangelo ml platform” on Google, you can find that. We’ve published a lot of other pieces about related ML work we’ve done, and I think we’re likely to in the near future open up the [unintelligible 00:27:34.04] a little bit more on Michelangelo, so stay tuned.
Cool. Awesome! We look forward to that. Thanks for joining, and enjoy the rest of the conference.
Thank you, I appreciate it.
Our transcripts are open source on GitHub. Improvements are welcome. 💚