Practical AI – Episode #56

Worlds are colliding - AI and HPC

get Fully-Connected with Chris and Daniel

Hosts

All Episodes

In this very special fully-connected episode of Practical AI, Daniel interviews Chris. They discuss High Performance Computing (HPC) and how it is colliding with the world of AI. Chris explains how HPC differs from cloud/on-prem infrastructure, and he highlights some of the challenges of an HPC-based AI strategy.

image

Featuring

Sponsors

DigitalOcean – DigitalOcean now offers three managed databases — PostgreSQL, MySQL, and Redis. Get started for free with a $50 credit. Learn more at do.co/changelog.

The Brave Browser – Browse the web up to 8x faster than Chrome and Safari, block ads and trackers by default, and reward your favorite creators with the built-in Basic Attention Token. Download Brave for free and give tipping a try right here on changelog.com.

FastlyOur bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.

RollbarWe move fast and fix things because of Rollbar. Resolve errors in minutes. Deploy with confidence. Learn more at rollbar.com/changelog.

Notes & Links

Edit on GitHub

Transcript

Edit on GitHub

Welcome to another Fully Connected episode of Practical AI. In these episodes, Chris and I keep you fully-connected with everything that’s happening in the AI community. We’ll take some time to discuss some of the latest AI news and dig into learning resources to help you level up your machine learning game.

I’m Daniel Whitenack, I’m a data scientist working with SIL, and today’s special Fully Connected episode - it’s always Chris Benson (my co-host) and I that do these episodes together; we kind of chat back and forth. We are talking about what’s going on in the AI news, and some of the things we’re seeing, and one of the things that I was seeing was some more mentioning of HPC clusters in the AI context. It turns out that Chris Benson, my co-host is somewhat of an expert in this area, and very closely in this area with Lockheed Martin… So I thought today we could just take some time and I could interview Chris a little bit about HPC clusters, and we could discuss what they are, how they’re being used, what the future is, and all that. Are you ready for that, Chris?

I sure am. I’m looking forward to this episode; it’s turning things on its head just a little bit here.

Yeah, you get to have some empathy for our guests, and figure out what that’s like. You can give me some pointers about my interview skills afterwards.

Okay, I’m nervous for the first time in forever, Daniel.

[laughs] Alright, we’ll see how it goes; see if either one of us crashes and burns as we’re doing something slightly different… But I think we’ll be alright. Maybe just to start things out, could you remind maybe new listeners, or those that haven’t listened to the intro episodes - remind us what you’re doing, where you’re working, and how you ended up crossing ways with the HPC world?

Okay, so I’m a principal artificial intelligence strategist at Lockheed Martin, and I work directly for the chief data analytics officer, whose name is Matt Tarascio. We do a lot of things for the company at a corporate level, supporting the four big business units. They each are entities onto themselves. You can go out to Wikipedia and check Lockheed Martin; we have missiles and fire control, we have aeronautics, we have space, and we have Rotary and Mission Systems. Those are the four business units, and they are all doing incredibly cool things. I’m not trying to sell them, it’s just that it’s a really interesting environment in which to work… So our team tries to support that.

[00:04:28.25] One of the things of many that we are doing on our team is supporting the high-performance computing efforts, along with other teams; it’s not just us doing that. But we are very involved in high-performance computing strategy on how to support all the different people. Lockheed has something on the order of 110,000 people, give or take.

So it’s a large company and a diverse set of things that different things do throughout the company. So what we’re trying to do - we have lots of HPC capability already within the company, but we are reassessing in terms of how we’re doing it and providing support. So what I can talk to certainly about - all the types of decision that have to be made. I obviously won’t be talking about Lockheed’s specific decisions and how we’re implementing, because it’s proprietary knowledge, but we are kind of neck-deep in all the different decisions, and how do you do this in 2019… And this is a changing space. I’ll turn it back over to you, and then we can dive forward.

Yeah, sounds good. I’m not sure if you knew this, but I had a very brief interaction with the HPC community back after – well, I started out in Academia, so in Academia if you’re in any sort of computing research field, oftentimes you’ll interact with HPC clusters… But then after my undergrad I did an internship with the National Center for Atmospheric Research in Boulder, Colorado. They operate several – I’m not sure actually at the moment what they operate, but at the time they operated several big supercomputers, and I was doing some benchmarking of vector computing on a new IBM POWER6 computer.

It’s been a while for me, so I’m interested to kind of – like, I have some concept back in my head of my interactions with the cluster at that time, and what people were doing, but I’m sure it’s just vastly different now, so I’m really interested to hear how things have progressed and how that intersects with AI.

Maybe to begin with, on that subject, could we define what is high-performance computing or HPC? How is it different than some things that might also be in people’s mind, like cloud computing, or something like that?

Just like defining AI can be harder that one might expect, because of all the diversity of opinions on what it is, high-performance computing is also undergoing quite a transformation at this point in time… So I would suggest that I will offer my take on what it is, and there will certainly be people out there in the audience that will disagree with me on this, but… Just as if we were defining AI.

So there are applications in the world that need a tremendous amount of computing resource, more so than you’re typically finding from either on-prem or traditional cloud resources… And you need to be able to scale up with a processing capability that is often done massively in parallel, to be able to tackle a computationally-intense problem. So when I talk to people that have been in this space for a long time - and I’ve only been in it for several years now, not only Lockheed, but at previous employers - the nature of the field has changed a lot. If I talk to people that have been doing things like simulation… Obviously, Lockheed builds platforms for our customers in various environments - space, underwater, or whatever - and that requires a lot of simulation.

[00:08:00.18] So if you talk to people that have been running high-performance computing clusters for a period of time, what they’re trying to do is say “I need to take maybe a new vehicle or something, and try it out in a simulated environment to solve problems, and figure out while it’s still in this state what it needs to be able to do, and what the problems are, and stuff like that”, before you get in the real world with the real device, and have it not working the way you’re expecting.

So the traditional way has been having these massive clusters of CPUs, and it’s been incredibly expensive to do that historically. You often saw that in government-sponsored laboratories that were associated with government programs. Off the top of my head, things like the Livermore Computing Center, the Lawrence Livermore National Laboratory, they maintain… And there are others out there. But that’s what I think people traditionally think of - being able to say “I need to apply 10,000 or 20,000 CPU cores to a problem in massive parallel to work through it.” That’s how I see it historically, but it’s not really how I engage it personally.

Maybe we can dig into a couple of these jargon things that you mentioned. When you were talking about a cluster - we’re talking about a certain number of compute instances, whether those be virtual machines, or physical nodes, that are working in concert to do something. Now, I’m sure people are also maybe thinking like “Oh, well in the cloud I can have a Kubernetes cluster, I can have a bunch of instances, or on-prem I can buy a bunch of servers and hook them together”, but am I right in – one of the elements of HPC cluster is really that the nodes are tied together in a specific way, even hardware-wise, that makes them for example communicate very differently maybe that a standard on-prem infrastructure for running web servers, or something like that? They can communicate in a very efficient way, and also handle very large amounts of data. Is that one of the differentiators between let’s say a bunch of on-prem servers running websites, and a high-performance computing cluster?

Yeah, I think so. And as I try to answer this, I wanna acknowledge that these CPU cluster side and the software stacks that go in there is not my area of expertise. I’m bringing kind of the AI/ML perspective, where I’m much stronger, talking about things that you and I often do, Kubernetes and stuff… But part of it, I’ve discovered, is really cultural. There are software stacks that are applied to tie these clusters together. They tend to be closer to the hardware, in a lot of cases that I’ve seen kind of generalizing some of the use cases I’ve seen, where people put together different architectures.

You’ll sit down at a terminal and do a virtual desktop, but the virtualization across clusters is very close to the hardware to pull that together, and so the traditional view of that is very different from how we look at it in the AI world these days. It brings some challenges into the case, that as AI/ML is becoming part of this - and we’ll talk about that, obviously, in a few minutes - you have very different paradigms on how these clusters are constructed and how you interface with them. I’ve learned the hard way - there’s really not a great one-size-fits-all across all the use cases, and so if you have all those use cases… Maybe the Lockheed people don’t; maybe they are fortunate that they have a particular specialty they’re addressing, which reduces the total scope of what they have to do… But if you’re addressing many different types of use cases, then it can be a struggle to be able to do that.

[00:11:50.06] We were talking about Kubernetes… If you look at more of the traditional CPU side and what I’ve just described, about being able to get close to the hardware, there is a technology called Singularity, which is kind of Kubernetes-like, but it’s – and I’m gonna give a completely non-technical, squishy definition… It’s containerization, but it’s not quite to the extent you think of when you think of Docker and Kubernetes; it’s an open source project that is a lot more like – it’s containerization, but it’s a lot closer to what we traditionally think of as VMs. And that’s one of the popular technologies I’ve seen in this space, that people are looking at. And it’s not the only one, but being open source, it’s a good one to talk about, since we tend to advocate for open source solutions on this show here.

So that’s one where that community is culturally trying to take advantage of containerization, but probably not the way some of us who have come from traditional Docker/Kubernetes in recent years would think of. It’s not quite to that point.

What I am hearing a little bit - I’m just trying to break it down for my simple mind, I guess… Let’s say I have instances in the cloud, right? There would be ways for me to spin up a huge number of instances, run some Python thing on all of them that communicates between all of the nodes, and all of that… But really what I’m doing is I’m spinning up these sorts of generic environments, that are really geared towards a wide set of applications - from web servers, to data processing, to databases, to whatever it is I can run on those instances.

They’re meant to be generic, right? Whereas kind of what I’m hearing is that an HPC cluster, from the start, you say “Well, this cluster - I’m gonna build this so that it can run massively parallel, data-intensive applications at scale.” What do I need to put into this cluster to make that happen? I guess that could include things like specialized connections between the nodes, it could include specialized hardware, it could include specialized software setups, specialized queuing systems and job scheduling, specialized ways of dealing with containers and virtualization… So it’s really the amalgamation of all of those things together, that are really geared towards the specific use of the cluster, I guess. Would that be accurate?

It would be, and most of that is gonna be outside my area of expertise, because we have other amazing people on the team that know that stuff inside out. I am learning that. And when I say that I’m talking about the CPU side of the equation - there are schedulers, and I mentioned Singularity, which does scheduling and does containerization, and it’s designed to take advantage of all those processors across the cluster. So in that way it’s similar to the world that you and I more often operate within, but it’s not exactly the same. And if you currently try to apply more of a traditional CPU-based simulation paradigm, it doesn’t work well in a Kubernetes cluster… Because that was one of the first things I learned as I explored these things - “Why aren’t we using that? Why wouldn’t somebody use that?”

There are reasons that are largely beyond my expertise that I was given, and I trust those experts. I have learned that they really know their stuff. So we have that represented on our team, and I kind of leave that alone.

One of the things that I have really been focusing on myself is more on the AI/ML side, which looks a lot more like the environment that we’re used to, in terms of largely - not even specific to our company, but in general - there is an expectation in high-performance computing now that AI/ML use cases are now requiring that level of computation. So we’re seeing this kind of rapid race up the curve, where originally people would say “I have a GPU to run things on”, and then they would say “I have a small cluster, that’s either on-prem”, or there are now cloud providers that are providing that… And as our industry on the AI side and machine learning side becomes more sophisticated and our models are becoming more complex, the need to drive that computation for highly complex use cases is really shooting up.

And also, it’s interesting that a lot of the traditional simulation side, that would traditionally have been done on a CPU-based cluster - you’re seeing some of that come over into a GPU world at this point. So the AI/ML workload perspective was really not part of high-performance computing until fairly recently… And now you’re seeing those two worlds merge right now, so it’s a very fast-moving field.

Break

[00:16:36.20]

So when we say something’s going to run on an HPC cluster and it’s massively parallel in processing, massive amounts of data, could you give us a perspective on what approximately the scales we’re talking about are? I know you mentioned a certain number of CPUs. Could you give us a perspective on how big are these clusters that we’re talking about?

Sure. On the CPU side, a large use case can consume tens of thousands of cores to run simulations in tremendous detail, and be able to do all the parallel computation that’s required of that. People from our side, with our bias, tend to think “Oh, well that’s gonna be eclipsed, and the world goes GPU”, but there are many use cases that are not necessarily specifically optimized for GPU. We are seeing some cross-over there, and there are companies out there that are in the GPU space, NVIDIA being one of them, that are basically trying to pull traditional CPU-based use cases over into the GPU world. You have to do that assessment of what that means to your organization and the projects that you’re involved in. It’s kind of funny - so you can get to that level on the CPU side.

On the GPU side, it’s interesting that as HPC is really addressing the artificial intelligence and machine learning space at this point, then you get into a situation where you can almost consume - for really sophisticated training techniques - a tremendous amount of computation. So it’s really not always about just “I have X number of GPUs. Okay, that’s my requirement.” Going forward, in training we have concepts like mass hyperparameter exploration, where you’re trying to find optimal sets of hyperparameters for your AI model, and you’re training them in parallel, varying hyperparameters, so that you can find the various performance gains and optimizations to do that. That’s one way where you essentially can absorb all the compute that’s available to you. And then there are other things like deep reinforcement learning; we’ll get into things like large-scale self-play, where you are allowing the agents to run, and going through that training cycle of deep reinforcement learning also in parallel to speed up, and to also find different avenues through that.

And then at the end of the day, those are kind of served by auto-scaling anyway, so it’s less of “Well, I have X number of GPUs and I’m gonna run with that over a given period of time. That meets my requirements”, and more like “If we’re gonna do something like this, how much capacity do I have right now?” It may be that in my prior effort, with a slightly different approach, I only needed a certain number of GPUs, but if I’m for instance gonna jump in to doing this mass-scale hyperparameter exploration, I might try to suck in every GPU I can to get through that, so that I can get through it in minutes or hours, instead of days or weeks or months. So I guess the elasticity necessary in your high-performance computing cluster becomes very important, so you have to have strategies that can accommodate those types of use cases.

[00:20:37.28] I’ll definitely say that it’s impressive, like you said, the amount of compute that’s needed even to train a single model, in certain cases, and certainly to explore hyperparameter spaces… For those that might not understand this whole idea of hyperparameter optimization - if I’m gonna train my neural network, I have to make decisions and put in user-defined parameters; these parameters that are not set through the training process, that are things like the number of nodes in this layer, or my learning rate, or a dropout, or something like that. So there’s all these parameters, and one way of figuring out how to best set those parameters to get the best models is to just try a whole bunch of them… Which obviously takes a lot of computational power, but you are kind of exploring that whole space.

I’ve just read – you might have seen this too, Chris… There was an article recently that showed how some of the large-scale language models that are being trained now - training one model contributed as much carbon input to the atmosphere as running five cars for their entire lifetime of use, which is just like… I don’t know, putting it in terms of something that hits home real-world, something you interact with daily, rather than petaflops, or something like that - it just really hits you that this is significant in very technically interesting ways, very impactful ways, and in a positive sense, but also potentially there’s side effects there as well.

There sure are. Anyone who’s listened to the show for long knows that you and I are both incredibly social-conscious people in terms of how we perceive the world, and the kinds of choices that we make… So this is definitely a weak spot in dealing with providing massive amounts of computation within a reasonable time period, that needs to be addressed.

I remember when that came out, about the running five cars for a year, or whatever it was, and I was a little bit stunned… So it’s one of those things that we need to figure out.

Yeah, definitely. Let’s maybe turn a little bit towards the HPC for AI, and how these worlds are colliding… Because I remember, for example, when I did that internship that I mentioned, the primary applications that I was working with were climate modeling applications. I know people have used these sorts of clusters for quite a while for these climate models… I know also in grad school when I was doing computational chemistry calculations, where you’re basically trying to calculate properties of materials based on what you know about how the physics work for atoms and molecules…

So we were kind of submitting jobs to HPC clusters at that time, in a couple different places around the country… And so I know that there’s been this history of HPC clusters being used for these large-scale (like you said) simulations and scientific computations, and those sorts of jobs. But I also know recently I was talking to – I live in the same town where Purdue University is, and I was talking to one of the data scientists that works for Purdue University, and he was saying that now they have sets of nodes, and they’re continually buying more that are specifically geared towards AI applications. So I know that this is happening, and obviously, you’re working in this area, so could you describe maybe why HPC is relevant for AI, and maybe when would I want an HPC cluster for doing AI, versus maybe just spinning up some stuff in the cloud, and vice-versa?

[00:24:22.08] Yeah, it’s a level of really where a use case requires the horizontal scale of what a cluster provides… Because you’re still using the same GPUs for that, but the question is – the cluster gives you the advantage of saying “I can go get the latest NVIDIA GPU, or any of their competitors”, and be able to say “Okay, I’m gonna go do this for my project.” You can be a student and go do that, run it [unintelligible 00:24:46.21] or go to these cloud services where you say “Okay, I’m gonna lock into a really good GPU there, and use it.”

In industry though there are use cases - and I haven’t only run into this at Lockheed, I’ve run into it previously as well… You’ll be dealing with a solution that may not only have challenging models to train, but in many cases you have many models that are working together, that are collaborating, where each model is narrowly performing a particular task in terms of its inference, that it’s performing with great accuracy, but because of the scope of what you’re tackling and the problem set, there may be many of those type tasks, and you have different models that are applied to each one, but they have dependencies across them.

They’re not standalone in the sense of each model may be only attending to its own input and inference and output, but some of those inputs may be from other model outputs, and stuff… So you may have to manage quite a few models that are interrelated. And those relationships matter as much as just the construction and training of the model itself.

One of the advantages of having a cluster is if you’re dealing with a complex use case like that, where you have these tight dependencies between different models, then it may not be just retraining one, it may be that one model and how it’s performing and what it’s doing and the choices you make there affect other models that are highly dependent on it, and you may change what kind of hyperparameters you’re using, and stuff like that. They can alter the inference itself, but you’re having to look at it from a system perspective, instead of just a model perspective.

So clusters can be really effective when you’re iterating on those types of things, and you’re having to get a whole lot of training done for every iteration, and then be able to go back, and you don’t wanna wait weeks or months because it’s not realistic in the real world. Without the cluster, the problem that you’re trying to solve would not be practically doable in the real world.

It’s super-interesting, and I definitely see the advantage there. But maybe it’s my cheapness, or the fact that I work for a non-profit, or something, but I’m thinking like “Oh, it seems like there’s so much risk…” Like you were saying, no matter what comes out from NVIDIA - the latest GPUs, the latest accelerators, types of software that you can run on certain architectures - all of that is pretty much available very quickly in the cloud; so you can have access to that sort of thing very quickly, in a flexible way… And it kind of brings about a little bit of fear in me if I think about “Oh, we’re gonna choose to invest in a specific architecture, and build out this huge cluster”, which I’m guessing takes a ton of time, it obviously takes a ton of money, and then technology is progressing so quickly that like – how are you not scared that you build this thing and then it’s obsolete in a year? How does that work in a company or in your strategy?

[00:27:56.08] That’s a great question. Your HPC strategy has to accommodate that natural refresh, that natural progress… Because you don’t wanna buy into a technology and expect just to leave it there. So it’s not something that you just go do and walk away from. You’re gonna do it in many phases, that accommodate changes in what’s available to you… And you try to look ahead and structure that automatically. Then you try to take advantage of what you’re trying to accomplish.

For instance, some of the more typical principles you’re gonna find in HPC strategy that you’re trying to accommodate is you’re trying to remove barriers to innovation, and you’re trying to allow with one of these clusters the ability to develop anywhere, with a consistent user experience, and deploy wherever you need, based on your different use cases. You need a solution stack that is consistent with what people would expect to find, whether it’s inside your own organization or external to that, as you bring in new talent… And be able to allow that stack to evolve over time, and refresh. And it needs to support things like Agile development, and iterations, and such as that.

And then what I’ve just alluded to is that user experience is really crucial, because you can put a huge, amazing cluster, that you’re investing many millions of dollars into, but if your user experience is a very bad one, either you drive users away and they seek out other alternatives, or you reduce their productivity and you reduce their ability to rapidly work on the problem set that they’re trying to… And all those things hurt your organization.

So taking all of those into account, you have to think about – you know, some of the more obvious things are like on-prem versus cloud, in terms of how you’re structuring, and hybrid is another popular thing that people are talking about… So how does your population of data scientists - how do they use it? Are they tending to do it just during certain hours, or maybe will they have a week of intense usage and then they’re not doing much model training in the weeks to follow? Or do you have a consistent level of training requirement available around the clock, seven days a week, that you’re doing? How do those moments spike over time?

I’d love to dig into that user experience… I should clarify - the main reference I have for this is back when I was doing this stuff in my internship, and other things… And just for reference, the user experience with that was like “Okay, here’s what I’m gonna do. I’m going to log into my home space in the cluster. The first thing I’ve gotta do is compile this Fortran code against whatever sorts of things are on the cluster, which I’m not totally sure what’s there…” Basically, it would take me days to get things configured correctly.

Then I’d have to submit a job to a job queue, wait for that job to get queued on the cluster. Then it would get queued, and however many hours later I would get a notification that I had oversubscribed memory, and my job crashed, or something. Then I’d queue it again, and go again… If I compare that to my workflow now, where I’m thinking about “Oh, well I’m gonna write some Python, I’m gonna push it up to a node in the cloud, and I’m just gonna use an S3 client to pull down some data from an object store… And then I run my job, there’s a GPU on there.” That workflow is just so vastly different from what I remember from the HPC world.

Even the data side of things - in the HPC world I remember there was literally a tape silo; for those that don’t know what a tape silo is - it’s like a data store where things are written to these physical tapes, and there’s a little robot arm in there that when you submit a job and you say “I wanna attach this storage”, the robot arm goes over and grabs the tap, and puts it into something… I still don’t know exactly how it works. It connects that to your node, so you can have access to it.

[00:32:09.14] So these two worlds - the workflow side of things, the user experience side of things, at least in the back of my mind and how I think about it - seem so vastly different. Are there ways to bridge that gap now, to where like “Oh, I can spin up a Jupyter notebook and run a job on a cluster”? I don’t actually know what’s possible.

Yeah, probably a good place to start for that answer is looking at the cloud providers that we are familiar with, and creating that user experience for the best possible workflow for a typical user is something they are spending an enormous amount of time on. All of them - Microsoft, Amazon, Google, NVIDIA with their cloud - they all have approaches.

Speaking only for myself, I think my favorite of all those, having used multiple of them, is probably Google Colaboratory. It gives you kind of a free Jupyter notebook environment that you can use, within costs. If you’re doing something as a project on your own time, and not necessarily working with clusters, and stuff. But the ability to just get into a notebook and do that with just individual GPUs - they’ve really done a good job of making that seamless.

One of the challenges right now for companies that are doing it - one of them is a proprietary solution by Domino Data Lab, I know they’re out there, and there’s others as well, that are trying to take that kind of simplicity in your workflow and apply it to large-scale clustered environments. In my view, I haven’t seen anyone that I thought was perfectly there yet. There’s nobody that I’m seeing do that so far that I think is doing it with the simplicity that I love, like I said, in Google Colaboratory. So I’m hoping to see that – particularly, I would love to see open source solutions that will allow us to get there… But I really think if you don’t support your users well in that way, you’re just gonna reduce your productivity and increase your cost.

Okay, so we’ve talked about what an HPC cluster is, how the experience differs, what the scale of these things is… Could you describe some of the AI use cases? You mentioned reinforcement learning, self-play, these other things… I’m curious if there’s particular – well, I guess you’ve mentioned hyper-parameter tuning as well. If there’s particular types of AI problems or parallelism that fits really well in an HPC setting, and maybe also if there’s types of AI workflows that would not fit well in an HPC setting.

Sure. I think it’s less about the specific application, and it’s more about how you’re combining different models across. You may have a problem where you’re building a model, and it can be any of the things - it can be a convolutional neural network (CNN), or it can be a generative adversarial network or whatever, and you can do those without clusters. I think the place where the cluster becomes very advantageous is when you are combining a bunch of those together. You could almost think about it as like using LEGOs, and you have different LEGO parts that are each representing a different building block, and you put those together to build your little LEGO house, or whatever you care about.

And when you’re trying to iterate on those issues where you’re combining a bunch of different models and there’s a lot of dependencies between those models - I guess in that way, as I’m hearing myself talk, it’s not so different from enterprise-scale software development in general. It’s not even specific to AI. You develop a model architecture to solve the problems, and so it’s not one at a time. It’s that massive horizontal parallelism that you need to iterate. And when you find a use case that needs that massive horizontal scale to iterate effectively and in a timely manner, that’s where the cluster really helps.

[00:36:06.20] The other place would be the fact that if you’re a large organization or a cloud provider that is serving many different teams of people working on problems, and they all are using a certain amount of your capacity, and you’re trying to accommodate many different use cases with many different characteristics in how they’re using the resource, then that’s where a cluster and being able to provide it in a form of a cloud - I don’t necessarily mean a cloud provider like Amazon, Google, Microsoft; it can be an internal cloud - but you need to be able to handle those to make sure that all the people, all the teams in your organization are able to be productive when they need to be productive, without you being the single point of constraint on them.

That certainly has a lot to do with why any large organization is going to invest in these - being able to ensure that all of their business units are never constrained by compute resources; they may be constrained by their problem set or whatever, but the compute is not the issue.

There are cases where companies will have a sort of on-prem infrastructure that’s more generic, a cloud-like environment; their own infrastructure where they’re trying to enable generic workloads. But it sounds like a lot of the things that you’re talking about - or at least there are a good number of organizations, like Lockheed or others, that are specifically building clusters that are geared specifically towards AI. I’m guessing that you’re not going to build out a thousand GPUs, or whatever; I don’t know what the scale of the GPUs are in these sorts of clusters, but… You’re not gonna build out that sort of thing for just generic workloads.

You’re making a commitment for the long-term to really invest in AI applications with this cluster, right? What sort of pressure does that create in terms of – I’m assuming once you have that cluster in place - and I don’t know what the scale of the investment is, but I’m sure it’s amazing… What sort of pressure does that create to really be squeezing everything you can out of that cluster? How do you make sure that your AI is so dynamic and it’s changing so quickly? How do you guess “Oh, I’m gonna need this scale of cluster for all of these AI sorts of problems that we’re solving”, when AI itself is changing so rapidly, and the types of models are changing so rapidly, and all of that?

Well, first of all, we have a lot of great strategic partnerships out there with other organizations that have similar interests, and in some cases similar scale in terms of what they’re trying to address. So it’s not only building out the infrastructure, but you have to buy the hardware, and you have to make an estimation on what you think your GPU utilization might be. If you don’t have a lot of history in that, that can be a real challenge.

I think every organization I’ve ever been a part of over the last few years, or talked with, has had to tackle that. I don’t know that there’s a great way of doing it, but I think part of the challenge in answering that question is that to some degree if you do a good job of it, “If you build it, they will come”, to use a Field of Dreams quote there… If you have a great infrastructure that suddenly increases people’s productivity, then whatever your historical thing has been in terms of utilization and uptake on your systems, you’re very much likely to have an uptick on that when you provide a great way of engaging on that.

[00:39:40.26] So you have to accommodate your success factor, of “Wow, I’m meeting everyone’s expectations, and now it’s almost getting the better of me if I’m not careful.” And then you have to make sure that not only is the hardware refreshable, but the software is extendable, too. If you think how fast any one of the things that go into this – if we talk about a Docker/Kubernetes stack, and you think about all the advances in Docker and Kubernetes, and that they are constantly evolving, because they’re in such widespread use… And those are not specific to an AI workload, because they have public stuff out there.

NVIDIA has a production-grade AI platform that they use internally on their own massive, massive stack that they have for their self-driving car stuff, which is called MagLev. You can google MagLev and there’s some information out there on it. Not a whole lot, because it’s not a product that they sell; it’s an internal thing. But I know that’s how they approach – within the context of being inside a Kubernetes cluster, it’s how they approach all the AI-specific workflow tasking that has to happen to make something work well. Google has their approach, and Amazon and Microsoft - they all have their approach; we may eventually see some good open source solutions to be able to do that, but… It’s not just a hardware thing.

We tend to get locked up into – when you think of HPC, it’s just the hardware; but you have to have a stack which is enabling all the things you have to do in your workflow to get it done. So it’s quite a lot to think about, and especially allow to be growing at varying paces throughout the stack.

Yeah, that’s a great explanation. I guess maybe it’s a turn to a more forward-looking question - what excites you about the future of AI on HPC, or what trends are you seeing in terms of HPC usage for AI applications, and what of those are the things you’re most excited to follow and be part of?

I think it’s really tied into how I see AI itself going forward, because the AI side of HPC – you know, the HPC is the enabler for what you’re trying to do with AI models… And so what we’re seeing an explosion of over the last few years is – just in the time we’ve been doing this podcast, it used to be people talked about building a great model to solve their problems, and now you’re seeing production cases where you might have many models working in a solution, and you’re seeing the commoditization and democratization due to these… Oftentimes open source software is really driving the cost down.

So just as people’s desire to solve more and more complex problems with a variety of neural network solutions working collaboratively together is increasing very rapidly, I think you’re gonna see that the providers are doing that; you’re gonna see all the major cloud providers certainly moving from “Let me grab a GPU” to clustering as a service. The way you think of it - you may have service agreements that accommodate a baseline of a number of different GPUs, with different types of elasticity models in there.

If I go and look right now, I’m not up to date on the latest offerings, in all cases, and I need to see – some of them may already be starting to do that at this point, but that’s gonna become really common, I think… Because certainly all the large organizations, like the one I belong to - this is becoming a standard part of many things. A year or two ago, doing neural network development was still a bit of a specialization, and I think what we’re seeing is that it’s now becoming part of system and software development in general, and becoming a very standard skill, that’s expected to be part of any solution going forward.

I think the HPC space is rapidly expanding and advancing, to be able to accommodate this exponential AI growth that we’re all experiencing.

Given that, I know there’s probably people out there that are excited about this, and they would eventually love to be working with Lockheed, or other large organizations - maybe even government organizations - that actually have the ability to build out these large-scale clusters. Or maybe there’s people working in Academic research; a lot of times academics use HPC clusters that they get grants for, and that sort of thing.

[00:44:03.07] Assuming people don’t have the funds to set up their own HPC cluster in their basement, which I’m guessing there’s very few of those people out there, are there any ways to learn (at least a little bit) about some of the ideas around HPC, that people could dig into in terms of a learning resource?

Yeah, I think we’re just getting to where this updated version of HPC is starting to develop, now that AI is part of that story… And I know that I ran across a Udacity course that is being done - it’s High-Performance Computing by Georgia Tech. I have not taken the course myself, but it’s one of their nanodegree programs, and I think that would probably be a good place to start.

And not everything – just to point out, you can have a smaller cluster that you can build, and you can get several HPC servers, your own solution, and network them and try out with some of the software that’s becoming available… And then, like I said, especially to handle, that makes more sense if you’re going to be running models around the clock. If you’re very occasional, like “I need a lot once in a while”, then cloud providers probably are a more economical way to go. Or some hybrid of in-between, a baseline of training that you’re looking at - you can set up a much smaller version in your organization than what a Fortune 100 company might be doing. So it doesn’t all have to be at massive, tens of millions of dollars or more scale to be able to do that.

As the cost of hardware has driven down, try something small; mess around with it. Take these courses; there’s new information coming online all the time if you’re googling… See what the various vendors are offering, because a lot of great information comes by looking through the various vendor sites.

Yeah, I’d also like to advertise a little bit… If there’s anyone listening to the podcast who is in college, or studying computer science, studying sciences or AI in college, and you’re looking for an internship after college, and you’re kind of interested in this high-performance computing space, I mentioned that I did that internship at NCAR, and that was such an amazing time for me to be able to be hands-on with these just amazing super-computers, and getting to really be hands-on with some of the latest technology on that front… Even getting to participate in a high-performance computing conference, and other things…

That internship program is called the Science Park internship, and they have it every summer. It’s still going on. So if any of you university students out there are looking for something to do, I’d highly recommend that. I’ll link that in the show notes as well.

It was really great to have this conversation, Chris. Things are changing so much, and it’s really nice to orient myself with a little bit of this technology; I’m excited to see where it goes, and how you’re involved with it over time. Thanks for indulging me and letting me interview you a bit today.

I appreciate it. I’m gonna take this awkward guest hat off and put my co-host hat back on… Which is a much more comfortable hat for me. Thanks for doing this, Dan.

Yeah, definitely. We’ll talk to you soon.

Thanks a lot. Bye!

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

0:00 / 0:00