Practical AI – Episode #69

Escaping the "dark ages" of AI infrastructure

with Evan Sparks, co-founder and CEO at Determined AI

Featuring

All Episodes

Evan Sparks, from Determined AI, helps us understand why many are still stuck in the “dark ages” of AI infrastructure. He then discusses how we can build better systems by leveraging things like fault tolerant training and AutoML. Finally, Evan explains his optimistic outlook on AI’s economic and environmental health impact.

Featuring

Sponsors

DigitalOcean – The simplest cloud platform for developers and teams Whether you’re running one virtual machine or ten thousand, makes managing your infrastructure too easy. Get started for free with a $50 credit. Learn more at do.co/changelog.

Notes & Links

Edit on GitHub

Transcript

Edit on GitHub

Welcome to another episode of Practical AI podcast. My name is Chris Benson, and I’m a principal AI strategist at Lockheed Martin. With me, as always, is my co-host, Daniel Whitenack, who’s a data scientist with SIL International. How’s it going today, Daniel?

It’s going really good. It’s Thanksgiving week here in the States, for those that are listening from the States, so a little bit shorter week. I’m working through Wednesday, so it’s a good week, and I feel like I’ve been reasonably productive… What about you?

Same for me. I’m working through Wednesday, but I’m looking forward to having a long weekend ahead. Do you have any special plans for Thanksgiving?

Well, just Thanksgiving dinner, but then I’m gonna help out my wife, who has a candle business, and CyberMonday Weekend is pretty insane for them. It’s a company called Antique Candle Co. and they’re gonna ship out an insane number of orders… So I’ll probably be packing boxes with candles, which will be a nice break from staring at a screen, and something completely different.

No AI in that one.

Not as of yet, although… So I help them with some marketing and Facebook ads stuff, and obviously, in advertising it is interesting to come from the AI perspective, because you see certain things, like in Facebook ads, where it’s talking about optimization and learning… As you kick off the ad, there’s a learning phase, where it’s figuring out how to optimize the placements, and the cost, and all that… So it’s interesting to think about it from that perspective, for sure.

Fantastic. We need to include a link to your wife’s business in the show notes, and…

I certainly will. Shameless plug.

There you go. And in case there are any AI people who wanna jump into candles… I guess for me, I’m just taking a breather next week; I’m at Carnegie Mellon University for an AI conference, to do a panel, and…

That sounds great.

…and then finishing up the week with two things in a row. I’ll be in Philadelphia, doing an AI and Ethics talk as a keynote at an ethics conference…

…and then I’m finishing Friday night in Austin, where the final Alpha Pilot, which we’ve had an episode on - the world championship race will be there. At the end of that race, that evening, we’re gonna hand out a one-million-dollar check to the winner.

Exciting stuff.

Yeah, pretty big deal. If anyone’s interested in hearing more about that, we have an Alpha Pilot episode from not long ago, and you’re welcome to tune into that… But turning to today, we have a fantastic guest. We have Evan Sparks, who is the co-founder and CEO at Determined AI. Evan, welcome to the show!

[00:04:06.15] Thanks so much for having me, guys. It’s a pleasure to be speaking with you today.

Pleasure to have you on the show. If you could just start us off giving us a little background about yourself, how you got to where you’re at at this point, before we dive into Determined AI.

Yeah, absolutely. As it pertains to my career in machine learning and AI, I really got my start in that space fresh out of college, in quantitative finance. This is the mid-2000’s, I was working for an asset manager based in Boston, where we were doing applied machine learning to the stock market, to pick stocks and trade client portfolios.

I did my Ph.D. in physics, and this was before the data science hype, but the rumor I always heard for people that got out of Academia was there was like “Oh, you can go do all this cool maths stuff in finance”, but I never quite figured out how to do that.

Yeah, it definitely was a very common career path, and it’s funny - probably a few years later everybody then went into ad tech, or something like that, and now it’s probably autonomous vehicles, or something. There’s always an interesting corner and a hot area to be doing this stuff, which is one of the things that I find super-fascinating.

So after a few years of that quant finance thing, I found that other people really liked looking at P/E ratios all day, and that wasn’t for me. I was much more interested in the technology problems we were solving. I ended up going to work for a startup in the NLP space called Recorded Future. We were taking the web and throwing it through this massive NLP engine, and building structured data products based on it… And trying to figure out how we sell that structured data to places like trading firms, but also the federal government, and so on. Ultimately, that company found a good niche in threat intelligence, basically trying to build predictive indicators of where cyber attacks are gonna happen, and so on, but again, with this same data-driven machine learning technology.

In many ways, the roles were pretty similar… One being in financial services, but the other being in this totally different startup environment. But always building models and driving forward data products. In both cases though I found I was spending much more time building and maintaining my own infrastructure than I was worrying about the modeling problems. And it was really the case in those days - this is around 2010-2012 - as Hadoop was becoming popular and so on, where as soon as I was tasked with analyzing a dataset that didn’t fit in the memory on my laptop, my world just collapsed. You were forced to figure out how to write MapReduce jobs and so on. I took that as kind of a good signal to go back and invent the world that I wanted to live in in grad school. So I had the good fortune to join the AMPLab at UC Berkeley right around the time that Apache Spark was born, and my co-founder at Determined AI,

Ameet Talwalkar and I got to work on it right away, building out the machine learning ecosystem around Spark. So we were among the designers and initial contributors to MLlib, which is the standard library for machine learning in that ecosystem.

The rest of my Ph.D. was really focused on “How do we give people tools to build, and machine learning applications and optimize them in a large scale, distributed fashion?”

This is a slightly less formal question, but it must have been a perfect fit in terms of working on Spark and being named Sparks. I’m assuming Spark was not named for Evan Sparks…

No, absolutely not. [laughter] It’s funny, I sat to Matei Zaharia, who was the creator of Spark. Spark was around 0.3 when I landed in the lab… And the first few days we sat next to each other, there were these long, weird stares going on back and forth, until finally we broke the ice and made a joke about it… But yeah, it was kind of a fortunate coincidence from my perspective, I guess. There was a long-running joke that my real name was Evan Apache Sparks, but… [laughter] Not so much.

[00:08:18.13] So yeah, it was good timing, and honestly, the AMPLab was a great place to be for what I wanted to study, which is really thinking about “Where does this intersection of huge volumes of data and machine learning really get real, and how do we build out supporting systems to enable this?” Also, while at Berkeley, I met my other co-founder at Determined AI, Neil Conway, who was more from the pure distributed systems part of the world. He’d been a Postgres committer, he was working at core Apache Mesos for a while, around kind of distributed resource management… Meanwhile me, on the other hand - I was the more dyed in the wool, theoretical ML student. He’s now a professor at CMU, in the machine learning department. In some ways, you’d think of me as the person who takes what those guys do individually, figure out how to match them together, and then hopefully you can figure out how to build interesting applications on top of that intersection of systems and machine learning.

So while at Berkeley - and I promise this is getting into what we do here at Determined - one of the big things that we saw, the big megatrends that were happening within Academia first, was this shift to deep learning as a primary way that people wanted to be doing machine learning, particularly in industrial settings. So it started with computer vision, and speech, and obviously more recently we’ve seen amazing advances in things like NLP and text… And this meant people retooling, learning how to use tools like TensorFlow, buying GPUs en-masse, figuring out how to take what had been a tiny corner of Academic machine learning and really make it into an industrially-viable technology… And stubbing their toe on a lot of serious problems along the way.

So you go from a logistic regression that trains on my Spark cluster in a couple of minutes, to a big, week-long training runs for large-scale image classifiers, on massive cluster GPUs, for example. You start to have a lot of design decisions baked into your modeling choices that you didn’t have before… Things like, you know, “How many layers should this architecture have? How does the model capacity relate to my training dataset?” and so on. And in ways that are not really intuitive, and end up being really highly empirical.

So we saw that, and we also saw that the frameworks, the TensorFlow and the PyTorch and so on of the world, are really good at their individual tasks they’re tasked with, which is helping you describe what your model is, and get it [unintelligible 00:10:42.29] a single or maybe several GPUs on a machine, but really bad at helping model developers through the rest of the workflow associated with getting one of these applications into production. Stuff that you guys have covered on your show before around data labeling and so on - we don’t do any of that at Determined AI, but there are other pieces of the workflow around hyperparameter optimization, architecture search, getting your models to train really fast across a wide variety of different hardware platforms, dynamically managing resources in the cloud, so that you can pay for the GPUs only while you’re really using them.

All of that stuff is handled right now on a manual basis, honestly, with Bash scripts and duct tape, in many cases… And people don’t really have a good way to support their more general workflows as they’re in this model development process. At Determined AI, that’s really the gap that we serve to fill.

How do we enable you to do the rest of the pieces of your workflow, while still using the tools that you know and love - your TensorFlow, your PyTorch, your Keras etc. - but make you much more productive as an individual engineer, but more importantly as a team of engineers - how do you share your results in a reproducible fashion, and how do you make sure that I can get the same model out of my infrastructure as you do? At Determined AI that’s really what our mission is.

[00:12:06.25] I’m curious to dive into a few pieces of that, but you mention in one of the blog posts on Determined AI about people still kind of living in the dark age of AI infrastructure, where certain larger companies have built sophisticated AI-native infrastructure for their own use, but everybody else is struggling. I’m curious if that dark age that you’re seeing is due to the fact that, like you say, there’s all these other pieces of the AI workflow - that might be data pre-processing, model deployment, model optimization, data labeling… Is it that there aren’t good tools for those other pieces of the workflow, or is it that they don’t play well together in a sort of all-in-one workflow? …or just people haven’t had enough time to develop standardized methodologies around these things. What do you see as the main contributor there?

I think it’s a little bit of both. You hit the nail on the head - in many cases, there are individual tools and point solutions to some of the problems that you mentioned; there are tool kits for model compression, there are services and open source libraries for just hyperparameter optimization, and so on… Even sometimes full companies built around these things. But in our view, what ends up being a result of that is that you get these tools that are isolated, and aren’t designed to work well with one another. And more importantly, you then miss broader opportunities that might exist around optimizing the entire workflow, if you can kind of step back and look at that… Rather than individually “How do I make this particular piece of the puzzle go absolutely as fast as possible?” Sure, you eliminate that bottleneck, but you might still be completely bottlenecked on ETL, or data collection, or training time, for example.

So you have to be careful, as an organization, about where you’re investing your time and your resources in terms of making those things better. We think that a more holistic design - that is one where the pieces are kind of designed and know about each other - opens the door for certain types of optimizations.

To give you an example, our resource manager that is built into our product at Determined AI is totally AI-aware. It’s aware of the fact that what you’re doing with running your jobs on our system, all of the jobs are somehow related to training or running inference on deep learning models. And you can start to make a bunch of interesting assumptions about the workflow that you couldn’t if this was just general purpose computer.

For example, the idea that these things are iterative, and that they have intermediate state, like model weights and state of the optimizer, that can be used to checkpoint and understand where the computation was, and then reschedule it, say, to run on another device… Now we have that kind of design in the resource management section, but then when we’re designing our hyperparameter tuning algorithms, for example, and implementing them, we can take full advantage of knowing what that internal scheduling layer looks like, and use properties of that scheduling layer that we couldn’t if we were just running this as a black box job on something like Spark, or Kubernetes, or whatever. And that power of these components being designed with one another in mind allows us to do this job much more efficiently, in a much more fault-tolerant, resource-aware kind of way than we would be able to otherwise. If you’re spending 90% of your time starting up the cluster and getting it done, that’s a lot of wasted cycles for your GPUs, that your data scientists really wanna be putting to work, finding good models and solving your problems.

[00:16:00.01] I’m curious - you mentioned a more holistic view of AI infrastructure, and I know that something that can happen because there are so many pieces to this, that you can end up with scenarios in companies where you have a data engineering team, or something that’s in charge of all of this pre-processing and getting datasets ready, and then you have the modeling group, and then you have deployment, and app people, app integration group people… Do you see that trend disappearing as things are more tightly and better integrated together? …or do you think it’s reasonable that a data scientist could take something all the way. I guess could and should - should they be doing that?

I think it depends on the company and the scale of the application that’s under development. For example, if you’re building a self-driving car, that’s probably not a job for a single data scientist. [unintelligible 00:17:04.01] I’d call it a generational moonshot, if you will. And there it makes perfect sense that you’re gonna have a massive team of people just worrying about data labeling, and data ingest, and ETL. Another set of people just working on the perception pieces of the job. Another set of people just working on maybe a different component around path planning, and so on.

There, in those scenarios, you really wanna think about “Okay, what are the various teams and what are the personas of the users of a broader machine learning platform? What do they care about, and how do we facilitate coordination and communication between those teams?”

In other cases, companies have done a really good job of cleaning up their data, putting it into massive data warehouses, and even making their feature catalog self-serve, and the kind of thing where a data scientist who says “Hey, I’m looking for a fraud model for mobile purchases in South-East Asia. We’ve decided we’re losing enough money on that particular area that a specialized model on this particular part of the world makes sense.” In those cases I do think that proper infrastructure can enable a data scientist to go from start to finish all the way, and ideally you wanna get with that person to the point where they don’t have to work directly with a data engineer to get the features flowing through the system, and so on.

In my view, almost more importantly - or a place that we see people get tripped up - is around deployment and monitoring of those models. We see people often taking models that are built in PyTorch or TensorFlow or whatever, and completely rewriting these things in C++ or Scala or whatever, because that’s what fits into the production serving environment. That side of things - we see these deployment engineers… That’s a job I would love to see go away in the common case, if the infrastructure gets better. You want data scientists to be able to get to the point where they’re confident that it works well enough on test sets, and maybe even start to A/B test it, and then hit a button and deploy it more broadly [unintelligible 00:19:09.26] is definitely the central thing we need as an industry in order to make these technologies more viable and successful.

Break

[00:19:25.04]

Evan, I’d like to ask what are some of the unique challenges that are related to team interactions that you’re seeing, in terms of sharing data, sharing GPUs, and other aspects of jointly utilizing AI infrastructure? Could you speak to some of those challenges for us?

Yeah, I think from our perspective the data piece is one that every organization faces, particularly organizations who are dealing with sensitive data. And that is something that we’ve seen users kind of figure out on their own. They have a versioned, [unintelligible 00:21:36.23] access control system on their primary stores of data - at least the interesting data; oftentimes the data that contains PII, and that sort of thing - they really tightly regulate who gets access to those data resources and when, as they should. From our perspective, it’s really about integrating with those various kinds of authentication mechanisms and supporting security on those data stores. So we do that out of the box.

The second and third pieces I think that are harder for organizations that most people don’t really have an answer for are first sort of resource sharing. The rude awakening that many people get into with GPUs in general is they’re really expensive. You’re talking about spending upwards of 150k on a DGX-1, which is one of NVIDIA’s latest servers filled with V100’s. One of those might be good for two data scientists… But in order to enable your team to really be productive, you need several of those kinds of servers. And we see people doing really immature things with these systems. We see people managing them with either static allocation, meaning Joe gets GPUs one through four on this box, and Kyle gets five through eight, kind of forever more… Or they’ve got some kind of Google Calendar system set up. This is some really sophisticated organizations that we run into, where that’s the way they’re managing this expensive resource.

Do you think that’s just because of the mixed background of people working on this sort of technology, that a lot of people are coming from science, or maybe a non-computer science or non-software engineering background, or do you think it’s more than that?

Yeah, totally. I think that’s a big piece of it. And honestly, people who are really good at thinking about convex holes, and the right shape of your loss function, and so on, probably shouldn’t be wasting their time, honestly, thinking about the right way to do resource management. That problem has been solved in a bunch of different domains, and that should be a layer of abstraction; that’s one that we provide to folks. There are other solutions to this problem as well - some of the cluster resource managers that I mentioned earlier, like Kubernetes, or we see people using queuing systems like Slurm, from the HPC world; those things all have their drawbacks. But in general, this is a problem that modelers don’t wanna be thinking about. And more generally, I think we need better abstractions for these folks.

[00:24:04.11] That’s certainly a challenge… I’ve been at two large organizations, one that I’m still at (Lockheed Martin), where we have many DGX systems within the enterprise, and we are from kind of an AI-oriented high-performance computing context trying to make these resources as broadly available as possible… Kind of conceptually, how do you think about that? Obviously, you will see organizations that start off doing this “You get a GPU, and you get a GPU” and all that, but that doesn’t scale against the workloads; certain teams only need one GPU at a time, and it may not take very long, and others might need dozens for a much longer period of time, and everything in between. Conceptually speaking, how do you approach differentiating between users and the various differentiated workloads that they’re having to contend with?

We love to see people that try and plan for this sort of thing. They try and get a sense of “Okay, I know I have this data volume coming in next year. I know roughly speaking it’s gonna take me this long, on this many GPUs, to train my models. Let’s set aside budget and bring those resources on-prem, or secure them with long-term leases on one of the cloud providers, for the most part.” Now, that does a good job at kind of helping you plan for your baseload… But then, as always, there’s gonna be things that come up, like towards the end of the quarter, or a new model family comes out, or a new project takes really high priority that you’ve just gotta ship… In which case we see real benefits to bursting onto cloud resources.

Within the context of our system, that’s a core feature that we offer. We call it Elastic AI infrastructure. The basic idea is that if the system is configured, and there’s budget within the organization and so on, you can do that dynamic provisioning of those cloud resources, spilling work over onto them; we handle the data transfer and other aspects of that planning for you. And then as the workload goes down, those resources are leased and the organization can save money.

So we think it’s a combination of having good planning, but also maintaining some flexibility in your systems and in your processes that are required to really help AI scale within the enterprise.

I know one of the things that I’ve talked to people about as they’ve talked about this particular problem is the fact that the data transfer as you’re trying to scale new GPU nodes in the cloud or something, if you have to transfer 200 GB of data very frequently, that could be a downside. Are there ways around this data management piece while still keeping things elastic?

Yeah, so when we see people in hybrid cloud and on-premise environments, we like to take a look at what their infrastructure is for replicating that data, and we’d like to see it be continuous, where the copy of the data that lives on the cloud and the copy of the data that lives on-premise are maintained in a way that they’re not exactly identical necessarily, but very close, or there is a path for them to become identical very quickly. So that sort of incremental process ends up being important.

The other side of things I’d say is, you know, with all this discussion about just how big the datasets have gotten, and how much data you need to fuel deep learning and so on, we are mostly looking at customers where the upper bound on the size of the training set they’re dealing with is in the order of terabytes. And that is a lot easier to manage and transfer and move around; it’s still hard, you don’t wanna do it a hundred times a day, but it’s easier than moving petabytes, which is the scales that a lot of people in the Hadoop space and so on will talk about. So that gives you a little bit more flexibility and, the data transfer being the big bottleneck in our experience is often the exception, not the rule.

Yeah, so it’s good to hear that terabytes is small data now. [laughter] It’s only when we get to petabytes, I guess…

[00:28:07.02] I’m just kind of curious - I’m pretty fascinated, as you’ve taken us through the approach, I’m curious, as you’re looking out at the competitive landscape, as you see different organizations tooling up, everything from the giant companies like Google and Microsoft and Amazon and such, to smaller startups in the space, how do you think about yourself in a competitive advantage mode? What do you really think differentiates yourself from those out there? How do you think about that in your head?

I think there are a couple of key things. One, we’ve got some pretty unique expertise on the team, in this space. These are problems we’ve been thinking about really deeply, both in the academic but also a professional setting for, collectively (the team) dozens of years. And we’ve got a track record of delivering some really popular and influential technology in the space.

The other thing I’d say is I think the cloud vendors are there to build their platforms to help monetize their hardware. The GPUs they’ve invested in, they wanna get people using, and so on. So all of it is – you know, Google pushing Google Cloud, or Amazon pushing their cloud, and so on… Where we differentiate ourselves is by being really neutral to the vendor. We will give people access to the best, cheapest, correct technology for their particular workloads… And you’re already seeing signs of vendors getting custom hardware for these particular tasks.

Google has TPUs now, Microsoft just announced a partnership with Graphcore, and you can bet that Amazon is working on AI-specific hardware… There are gonna be a bigger menu of hardware choices to be available to help you solve these problems down the road, and we think that developers, in the same way they don’t wanna be worrying about the resource management and the calendar system, they definitely don’t wanna be worrying about reprogramming their applications, and figuring out which chip is best for this version of my language model, and so on.

We think that a layer of abstraction from a systems level can offer that kind of flexibility. So you submit your job to us, we figure out what the best hardware to run it on is, we go acquire that for you, your job gets built and run, and then those resources are released - that basic idea I think is something that we can do and we’ll be able to do better than the larger cloud vendors, because we won’t have these exclusive ties to one or the other.

That kind of leads me a little bit into a next question, which is around automation, and I guess more specifically around AutoML methods. I see AutoML mentioned quite a few times on the site, and also – I mean, there’s been kind of a general trend of AutoML platforms being released. Google Cloud AutoML, or H2O Driverless AI… There seems to be a lot of focus in this area. I was wondering - it probably makes sense to people… Like, one problem that AI people are gonna have is managing their GPU infrastructure, but maybe people think that the hyperparameter tuning and the modeling side of things is kind of their baby, and they don’t wanna mess with things like that.

What do you see as some of the major advantages of automating some of that piece of things, and utilizing some of these AutoML methods to automatically figure out architectures, automatically figure out the right hyperparameters, or automatically do other things…? What role do you see that playing in the future of AI infrastructure?

Yeah, so I think the way we think about it right now is that you’ve got these experts who are highly trained in their particular fields. Maybe they’re really great at understanding the physics of solar flares, or understanding how robotics work, or whatever it is… And yet, they’re spending a lot of their time doing highly tedious tasks.

[00:32:04.27] So looking and telling the end of the log files, figuring out what the loss looks like, deciding “Is this an area I wanna keep investing in, or should I try a radically different model architecture?”, that sort of thing. And then writing the same 50 nested for loops to tune over my parameters, over and over again, when there are better algorithms out there for this stuff. Either they don’t know about them, or they don’t have time or interest in implementing them… They don’t quite realize it’s easy [unintelligible 00:32:32.06] to miss the fact that much of this work could be totally automated away, or at least partially automated away.

Our view is really we wanna give these practitioners power tools. Instead of saying “We’re gonna build a robot that builds a house for you, let’s take a carpenter and equip him with a power hammer, and a circular saw, and so on.” That’s the phase where we think that we’re in when it comes to AI development. If you can equip experts with tools - again, new layers of abstraction that they can reason about; move from fiddling with the knobs individually to reasoning about search bases and budgets around how many GPU hours you wanna put into solving a particular problem… And then letting the system pick the right algorithm for hyperparameter optimization, or the right way to approach that problem. We see really terrific gains.

We’ve had customers tell us that they were able to replicate what had been a two-month process of manually tuning hyperparameters and selecting model architectures in a single overnight run of our system… And that’s leveraging best of breed algorithms from active learning, developed primarily by my co-founder, Ameet Talwalkar, around hyperparameter optimization and architecture search. That to me – sure, if you could do that 50 times a year, I’d be printing money right now. But even if you can save somebody a couple of months, a few times a year, that ends up being really powerful in the way that they get their work done and how quickly they ship their applications… And again, they start thinking about the data problems and the modeling problems that they have, and not so much “How do I write out this infrastructure?” and that sort of thing.

I know one of the things that Determined AI is working on has been a lot about making AI work reproducible and being able to track experiments. Within the larger body of literature in AI we’re always hearing about explainability and transparency and such as that in AI… So I guess what I’m asking is why do you think that this important, to have this reproducibility built into AI infrastructure going forward? What kind of benefits do you see it offering and what do you think might be missing in terms of the things that we are tracking, or parts of the conversation that haven’t really been addressed yet?

Yeah, I think that if you’ve told a software engineer that their code wasn’t gonna be tracked, and that even if their code was tracked, they were gonna check it out from GitHub and try and build the system, and there was only like a 2% that they were gonna get the same artifact out at the end of the day as their peer who downloaded the same repo that afternoon, they would look at you like you were completely crazy, right?

You’re right.

But that is very much the state of reality and the world when it comes to machine learning practice… And it’s because we have all this stuff under the hood that we need to track and get just right in order to get our algorithms to converge to the same level. It doesn’t help that the optimization problems we’re solving these days are [unintelligible 00:35:28.02] so there’s a bunch of stochasticity embedded in them, and so on… But the idea that I need to collect and understand every random seed that lives anywhere in my system, I need to understand what are the right hyperparameters for this particular run, what are the settings of the optimizer, and so on… And how is my model even initialized in the first place? Those are all necessary ingredients. I also have to keep track of what my data is.

[00:35:55.01] Now, once you have built a solution or a system for insuring reproducibility across runs of different machine learning models - and this gets to your point of why is this important… Now you have the kernel of something, that can be used to enable very direct and repeated collaboration among data scientists. You can say “Hey, download my version of the model”, and you can reproduce it exactly Okay, great. Reproducible - done. That’s cool, reproducible builds. But now I can also use that to say “Hey, why don’t you extend my model? Try training it on a different dataset. Try running it on 64 GPUs and make sure that it converges in the same way. And I can begin to sort of riff with my colleagues on the next great idea. And I think that’s sort of the dream.

It’s one thing for a single developer to be able to continue to innovate, but once somebody has a good idea and now you can broadcast that idea to the entire rest of the organization, and everybody incorporates that into their solutions, now you’ve got a flywheel going that can really help an organization accelerate.

And again, we see these kinds of best practices and properties emerging at places that are really sophisticated in their AI infrastructure - the bigger companies, the Googles of the world and so on - but that hasn’t hit the mainstream yet, because our tools don’t have support for that… So that’s one of the main things that we try to drive at Determined AI.

Alright, Evan, I’d like to kind of switch gears a little bit here… We’ve been talking a lot about practical things around infrastructure, which I think is great, because this is Practical AI after all, and those things are super-important… But I was also curious to hear some of your thoughts on another subject. I saw that you wrote a recent blog post about AI leadership and positive impacts on things like the economy, on human labor, and other things… I was wondering if you could share a little bit about the motivation behind that article and why you thought some optimism needed to be brought into that conversation.

Yeah, it’s funny - the company is headquartered in San Francisco, and as I get outside the San Francisco AI bubble, or whatever you wanna call it, at dinner parties with friends outside of this world, a common theme that comes up is isn’t AI all about automating jobs away? Isn’t it all about taking away my livelihood? And you know, it’s scary as we move into – for people who are even in skilled jobs, they’re looking at “Hey, is an algorithm that is really good at text summarization gonna replace the need for the training programs in my law firm of an army of freshly admitted attorneys doing discovery work?” and that sort of thing. And the answer is “Maybe.”

When I think about technology, I like to look back on what has technology done for the economy over time, and how has this story played out previously? On the blog post I use an example of how Japan recognized that their population demographics [unintelligible 00:38:53.12] in the ‘80s and started plowing a lot of money into robotics. And of course, now they’re a world leader in robotics. But it was in service of kind of planning for a world where the majority of the population was gonna be over 65, and building out infrastructure to support that.

I think that a similar view needs to be taken of AI here. We look at the industrial revolution… We’ve been automating things for a century and a half at this point, and probably longer than that, depending on how you wanna think about it. It always does lead to short-term job displacement, but in the long-run quality of life and standard of living across the globe has risen dramatically. So I think we need to take that view on technology as a whole, in that we have to be careful about what it does in the short-term to people, and making sure that we’ve got social policies in place to help folks out… But it’s good to be optimistic; these technologies can enable things that felt like science-fiction ten years ago to be real, like the self-driving cars that we see on streets, and so on… But they can also really help in a bunch of ways that are otherwise unexpected, around helping environmental health.

[00:40:06.07] We’ve got a customer in the waste management space that specifically uses AI to help do recycling much more effectively. We’re also working with folks in pharmaceutical drug discovery that are using AI to cure new diseases. So there are ways that these technologies can be used broadly for the social good, and that was really the motivation behind this piece that I put together.

Yeah, it’s really great to hear that, actually. As a brief tangent, Daniel and I are both very focused on using AI for good, and we talk a lot about it during various episodes. Daniel is focusing on making language availability within AI more broadly available, because there are so many languages out there in the world that are not getting attention from technology, and I focus on animal welfare issues, and such… So I love your optimism in this space.

Turning to the next thing - obviously, with the potential for AI to continue to increase productivity at large, despite some of the bumps in the road obviously for society that you already addressed, and given the fact that there is a tremendous concern right now about privacy issues, how do you look at that dynamic tension between productivity and privacy? Are they always at odds with each other? Are they mutually exclusive in the context of AI, or do you see a more optimistic path where you can be productive and yield privacy at the same time?

It’s a really interesting question on a broad area… With my recovering academic hat on, I think it’s a really interesting question from a fundamental research, where we can set up what you’re calling perhaps – we can formally study where there is fundamentally a privacy and productivity trade-off, and first we try and answer that question… And then if there is indeed this trade-off, maybe there are ways that we can come up with that will give us precise control over that trade-off as we make it.

An example I like to talk about is federated learning, where users could potentially remain completely in control over their data and it stays on their edge devices, and yet the collective wisdom of all of the users, through AI and things like homomorphic encryption and so on, could be used too in a differentially kind of private way help update models that globally make use of lots of users’ data without leaking individually private, sensitive pieces of information.

I don’t think this stuff has been completely figured out, which is why I think it’s still a really interesting research area… But I’m hopeful that as consumers demand that their data be kept private and so on - which I think we’re seeing a lot of, and look no further than GDPR in the European Union as evidence of this - that we will start to have to get clever with how we navigate that trade-off space. I’m really excited – I watch the research coming out in the field pretty closely, because I think there’s some really exciting stuff happening.

Yeah, I know that in the most recent versions of TensorFlow and a bunch of other projects there were very certain things around privacy… And of course, you have things like federated learning, like you’re talking about. I was wondering - as we get near to the end of our conversation here, in terms of practicalities for AI practitioners, whether that be someone that’s working on some of their first AI projects, maybe as part of a startup or something, or a larger company, what are some of the best things that we could implement to help your workflow? What’s the biggest bang for the buck that we can do? Maybe that’s looking into things like AutoML, or maybe that’s implementing experiment tracking… Where do you think people should start changing their workflows first to make the biggest impact?

I think if you’re in the early days of your project and just kind of getting your feet wet with the technologies, my advice would not be to go try an AutoML solution off the shelf. It might work for you, but you’re gonna be in a position very quickly where you don’t understand what’s going on one layer of the stack beneath you… And as data problems come up, or the next model needs a new tweak to it or something like that, you might be at a loss and you might be at a place where you’re stuck.

[00:44:32.06] Instead, what I tell people is invest heavily in your data production tracking versioning to make sure that you’re in a spot where you can go back and replay the past as it was exactly at that point in the past, and build your models in that particular way… And begin to invest in tracking and understanding your workflows from a code and data and models perspective. That is some level of experiment tracking.

The other thing I’d say is start simple. Start with the simplest model that could possibly work and solve your problem. That will do two things. One, maybe your problem is really simple and you don’t need a fancy 50-layer convolutional neural network with an LSTM bolted on the side to solve it… Which is a good thing to learn. But at the very least, it gives you a baseline for “Okay, this is the baseline, no signal to noise ratio place I need to be. I need to make sure my models are at least as good as this.” And it gets you in the habit of targeting a metric that you can use to evaluate whether or not your model is good enough. I think that’s a really important lesson for people getting started.

Yeah, and I think those are amazing tips… In terms of the experiment tracking one - I think you’re right on the money. That’s a huge benefit that people can have. For people that are maybe not coming from a software engineering background, in your experience – maybe they’re not quite to where they’re ready to invest in the full Determined AI solution, but what would be some practical ways for them to track certain experiments? It initially is just a matter of metadata and naming things correctly, or getting into good version control habits with GitHub, or where do you see people struggling the most, or what are some simple ways that maybe they can benefit themselves?

Yeah, so I would say that for sure get used to using software version control tools for your code, and version to the models that you’ve got. For data, things like S3 (Amazon) for example, they offer version data store; you can turn it on on your bucket and start using the version numbers as you’re pulling data off of it.

And then for the last piece, honestly (or a big piece) is around metrics. That in the early days can be recorded through either some pretty ad-hoc processes - structured log files where you write down what you think are the key parameters of a particular experiment or run… So think of it as maybe a JSON blob that records the keys and values that you care about, and store that somewhere where you’re sure you can get access to it, and so on.

There are also open source projects out there like MLflow tracking, which can help facilitate this, and give you dashboards around this as well. So that might be another place that I’d recommend people check out if they’re interested in another open source option in this area.

Awesome, yeah. That’s great. I should also mention Joel Grus on the podcast; we’ll link his episode in the show notes as well. He talked a good deal about responsible AI development practices, bringing some of that expertise from software engineering into the AI research and AI development workflow… So we’ll definitely link that.

And I guess to close out, for listeners who might not necessarily have all the skills and infrastructure in back-end engineering, and they’re wanting to kind of level up, maybe they’re even a little bit intimidated by diving into this new area - do you have any other ideas to close out with on how they can level up those infrastructure skills?

There are a number of great online resources. It’s funny, I’ve never really thought about that side of things needing to be leveled up. In fact, that’s kind of why we provide the software platform that we do, to try and keep people from worrying about that…

That’s fair enough.

But yeah, I think the various cloud providers do a good job of providing education around things like Kubernetes and so on, that can be helpful as you’re thinking about what’s the modern way of building out this infrastructure… But I don’t have specific resource recommendations in mind right now.

No worries. Well, Evan, thank you so much for coming onto the show and telling us all about Determined AI, and infrastructure. It was a fantastic conversation.

Sure thing. Great speaking with you guys, thanks so much for having me.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

0:00 / 0:00