Practical AI – Episode #138

Multi-GPU training is hard (without PyTorch Lightning)

featuring William Falcon, creator of PyTorch Lightning & CEO of Grid AI

All Episodes

William Falcon wants AI practitioners to spend more time on model development, and less time on engineering. PyTorch Lightning is a lightweight PyTorch wrapper for high-performance AI research that lets you train on multiple-GPUs, TPUs, CPUs and even in 16-bit precision without changing your code! In this episode, we dig deep into Lightning, how it works, and what it is enabling. William also discusses the Grid AI platform (built on top of PyTorch Lightning). This platform lets you seamlessly train 100s of Machine Learning models on the cloud from your laptop.



O'Reilly Media – Learn by doing — Python, data, AI, machine learning, Kubernetes, Docker, and more. Just open your browser and dive in. Learn more and keep your teams’ skills sharp at

Snowplow Analytics – The behavioral data management platform powering your data journey. Capture and process high-quality behavioral data from all your platforms and products and deliver that data to your cloud destination of choice. Get started and experience Snowplow data for yourself at

Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with no ads, extended episodes, outtakes, bonus content, a deep discount in our merch store (soon), and more to come. Let’s do this!

Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at

Notes & Links

đź“ť Edit Notes


đź“ť Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I am a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. This week we have a really exciting show. I’m pumped to talk about this. We have William Falcon with us, who is creator of PyTorch Lightning and CEO of Welcome, William!

Well, thank you guys for having me. I’m really excited to chat with you.

We are as well, and I think I might have even mentioned this to Chris on our Slack channel, but I saw you on Twitter when was launched, there was a screencast of “This is some things that you can do with” It was one of those moments – I don’t know if you’ve ever seen a Kelsey Hightower demo in the Kubernetes world, or something like that… It was one of those moments where things just sort of snowballed, and then all of a sudden you were running models on all of these GPUs in the cloud with very little effort. It was pretty cool, so I’m excited to dive into that at some point.

Yeah, I’m excited to share it.

Yeah, cool. So maybe before we get to there, let’s maybe start at PyTorch Lightning. People might have heart of PyTorch, they might have heard of Lightning… I know Lightning kind of shows up in my Twitter feed quite a bit… Could you just give us a little bit of context for what PyTorch Lightning is, and how people can use it, where it might fit into people’s workflow?

[04:14] Yeah, so I’ll talk a little bit about my experience to understand the motivation behind it… Because my sense from speaking to people in the community is that we’ve all had very similar problems and thought about very similar approaches. The difference is we open sourced this, and a lot of people started contributing to it.

So I started out as a software engineer, and I was working in finance, and before that I was an undergrad and I was starting to do research… And I’d been working as a software engineer, and when I got into AI research, it was in computational neuroscience; we were trying to take neural activity from the brain and trying to reconstruct what generated that. That was in the context of eye sight, basically. And so what happened there is none of us were really big engineers in deep learning. We weren’t experts. So I started training models, and back then I was using Theano, which is a very old framework… And I remember the first time we got something running on the GPU, and it was magical, because suddenly my time went from months to a few days, and I was like “Great.”

The research continued, and what I found myself doing over and over back then was I’d have an idea about something that wasn’t quite that. In neural decoding it’s basically like translating a sequence of signals into something - an image, or another signal. So it’s a translation problem, in essence. So you could do things like GANs, auto-encoders, you could do things like regressions… Many ideas. And I would wanna try a few different approach with my teammates, and we’d have to copy the code over and over again. You would either fork the project and then copy that code over, and then if something new came out, like multi-GPU training, you’d have to then write it into all the code that you did. So suddenly, you’re maintaining ten different files that are all doing the same thing. And I started abstracting that into like a joint class.

I think all of us kind of do this at some point… And I think at that point I had been using SQLearn for a while, so I loved their fit and all those methods, so I was like “Okay, whatever, we’ll just call it fit and do that.”

And then I transitioned to TensorFlow, because we needed to get into multiple GPUs and it was really hard to do in Theano. So that took our training time dramatically down.

Then I continued working on it for a while, but the problem is it just continued. Every time I wanted to do something new, you had to copy that code over. And then new things came up all the time. There was just a different way of training, and it was really hard to go back and copy and paste all that stuff.

I left that project for a bit and went into the startup world, and spent a few years putting NLP models into production. And there it was less about focus on training and more about deploying models. So I was just like, cool, quick baseline and then just put that thing in there and see what happens. I was less concerned about solving a very unique problem and more about “Hey, I have the data here. I don’t care what the model is, I just wanna see some results.” So we got that working, and ended up scaling that to a company that got acquired, and that was basically using NLP to help low-income, first-generation citizens figure out how to pay for college over text message, which was really cool.

From there, I started my Ph.D. and kind of like started that research flow again. Then, coming from the startup world, I was like “How do I bring that speed and agility to research?” Because we all know this, and I think [unintelligible 00:07:37.01] talks about this - we all know this first-hand, but the outcome of doing anything with AI nowadays is honestly a function of how fast you iterate through ideas. Because 90% of your ideas are gonna fail, and then one or two are gonna work, and then you’re good to go. So literally, just how fast can you power through those ideas is probably the single biggest predictor if that thing is going to work or not. I knew that, and I wanted to bring that ability to my Ph.D. research. I was like “Hey, maybe I can finish this thing in three years, as opposed to six, or whatever.”

[08:06] Ambitious. [laughs]

Yeah… Looking back now, it’s not a good idea, but yeah, that was the goal.

I know the feeling.

Yeah. So I took my code from my undergrad days and kind of brushed it off, and then at that point I’d already switched to PyTorch, so I was like “Okay, well let me just rewrite this thing in PyTorch and see how it goes.” So I started working with, again, NLP at that point, and then we moved into audio research to do speech synthesis, and so on. And all of that using the same code.

It was interesting, because the first code was for NLP, and then I modified it to work for audio, and then vision, and so on… And then eventually - I don’t think it was quite there, at that abstraction level yet, because I was still having to do a lot of bespoke code, but then I don’t know what happened… I guess over the winter something clicked, and then the trainer got factored out, and then it just because obvious that at that point you need to separate the model from the hardware. So that’s what Lightning became.

Then I open sourced it, and joined Facebook, and I researched that summer as an intern at FAIR, and continued my Ph.D. research. And there you have a giant cluster, and I was like “Okay, if I have Facebook resources, what can I do?” [laughs] I’m very ambitious in terms of trying to do research ideas, so we were trying to scale up massive datasets on the cluster as much as we could. I was consistently training 500 GPU models, that kind of stuff, all the time at FAIR with this framework. And people noticed, because the cluster - there was like a handful of teams across Facebook that was using a cluster that efficiently, but the rest of the teams weren’t, because it takes a lot to do training at scale.

And so I started working with those people, because they’re experts at this… So we embedded a lot of those practices into Lightning, and then ended up with a framework now that can do really scalable training. And then at that point there was some adaption internally, then adaption externally, and then it just kind of took off after that. But I came at it from “How do I move really fast through research, knowing what I know about putting models into production as well, and knowing what I know about doing research as well. Having both requirements made it really interesting.

What’s really cool now is that it’s evolved into – you know, my vision really was you and I, all three of us, are going to code the exact same thing, in our own projects. We’re gonna code half precision, we’re gonna code stochastic weight averaging, we’re gonna code whatever new thing comes up. But why waste that effort? That’s not the job. The job is to – you know, if you’re Lockheed Martin, predict metal whatever. Like, find deficiency in materials; I don’t know what you guys do there, but… [laughter]

That’ll work, that’ll work.

I think that’s exactly what Chris does. I assume. [laughter]

So that’s the goal. The goal is not to figure out how to implement stochastic weight averaging, right? So what’s cool now is that – I think we’re approaching 500 contributors, but these are all top researchers and Ph.D. all over the world to implement these things and put them into papers… And then within a few hours it’s ready and available for everyone. So do you have to know how half precision works on GPUs? You don’t. But you just know that it’s gonna save you memory. So it’s been basically turned into a community project, and my vision was really “Can we build the world’s research lab? Can we all have access to top researchers and resources?” And that’s what’s happened so far.

I noticed as you’re kind of going through the story, it seems like as you progressed over those years through the different aspects of your own life, and you’re kind of looking at the same problem through multiple lenses, as you’re going from software development, and then you’re doing research, and then you’re at Facebook doing research, and the scales are changing… It seems very much like you were scratching your own itch, but having the benefit of taking into account multiple perceptions of that problem, so that you ended up having a very rich understanding of what was needed and how it could satisfy multiple user groups. Do you think that’s a fair assessment, or am I missing the boat? It seems like it was a really smart way of building a robust project from different perspectives, all rolled into one.

[12:07] Yeah, I think that’s right. Like I said, none of this was ever because I was trying to build anything for anyone else. I was trying to make myself move fast in research.

I think once other people started using it, they gave me the perspective there, and they put those constraints… I mean, Lightning is not where it is today because of me, it’s there because of the community. There’s no way I could have ever created this by myself. I think I could see the idea and see the templates, but a lot of my job has been to guide the community and maintain standards, maintain usability… I care a lot about user experience, and I don’t wanna remember a lot of stuff. So it’s just been a lot of guidance there as well. But at the end of the day, it’s the community that’s done a lot of this.

But I think holistically having to focus on a lot of domains has made it super-general, because doing NLP is very different from vision, and it’s very different from reinforcement learning, and meta learning, and so on… And it’s not obvious to know where they overlap. So it’s been kind of a research project really, in the long-run - how do you factor out deep learning code and make it interoperable? Yeah, so that’s been an interesting journey so far.

You mentioned when you were introducing the motivation behind lightning the idea of decoupling models from hardware… And I noticed, even just if I look at the repository for Lightning, you talk about PyTorch Lightning is just organized PyTorch, and it’s organized to sort of decouple science from engineering. So you’ve got this model side and the hardware side. Could you dive into that a little bit more and talk about the specifics of what does it mean if I’m using Lightning, what does it mean that my model is disentangled or decoupled from the hardware? …both practically, in terms of how I write the code, and what happens once I hit Fit, like you’re talking about.

Yeah. So I think if you’re working at a company - or any team really, even research - if you’re working with multiple people, you need the ability to share code. And if you’re at a company, or even university lab, you wanna share code across teams. And that’s really hard to do without something like Lightning. Because what happens is people tend to intermingle a lot of stuff, like data, model and hardware into the same files. Well, one team may not have GPUs, or may have different types of GPUs, or may only be using CPUs, or your production requirements mean that you can only use CPUs for inference. So there are a lot of constraints there. And I guess if you’re not thinking about it how we are, from the abstract level, you won’t really realize that a lot of the reasons why a lot of that code doesn’t operate together is because you’re mixing the hardware with the model code. And that’s something that took us four years probably to get there, to see those, to have these insights… And what that means is that we can factor out deep learning code into three major areas; well, at least four, I guess. And we’ll find more; it’s ongoing research. So one is training code - this is anything that has to do with linking your model to the machine specifically; so how do you do the backward paths… You know, backward pass and distributed is very different from just on CPUs… At least technically speaking. What happens if you have half precision there? What happen if you’re using stochastic weight averaging? What happens if you have truncated back steps, right? There are a lot of details that go into it.

So all of that is handled by the trainer. And this is the stuff that you’re gonna do over and over again. It doesn’t matter if you’re doing audio, or speech, or vision, you’re always gonna have a backward pass, you’re always gonna have a training loop, and so on. The model is the thing that changes. The model is not just – I like to think about models… In Lightning we have this concept of a module, and to me a Lightning module is more of a system.

We can think about a model like a convolutional neural network, or a linear regression model. Just like a self-contained module. Today’s models are actually not models. We need a new name, because there’s something that doesn’t exist, and I think the Lightning module, which is a system, because models now interact with each other. Like, what do you call an encoder and a decoder working together to make an auto-encoder or variational encoder. a They’re not models; it’s collections of models interacting together. Same for transformers.

[16:07] So that’s really what the Lightning module is about - you pass these models into it, and then how they interact together is abstracted by that. And I think that’s a missing abstraction that was not there, which is why people were jumping through so many hoops, to be like “Oh, well how do you do GANs? How do you do this other stuff?”

So it’s important to decouple that, because now I have this single file that’s completely self-contained, that I can now share with my team across in a different division, and their problem might be completely different, with a different data set, and they don’t have to ever change the code on that model; all they have to do is change what hardware they’re using and then what the dataset is. As long as it conforms to the API that the model is expecting, it works. So it makes code extremely interoperable.

I think people come to Lightning because they wanna train on multiple GPUs and so on. And under the hood we have this API called Accelerators that lets you do that. But that’s only a very small part of it. I think once you get into it, you see that the rest of it is the ability to collaborate with peers, and be able to have reproducible and scalable code.

Thank you for the great introduction to what Lightning is, and how to think about some of the abstractions that you’re working with. I’m wondering if you could maybe share a little bit – I’ve seen some different stories online, but I was wondering from your experience with the community that’s working with this, could you provide any sort of stories around how people have been able to scale things up with Lightning? Maybe in your own work, or maybe stories that you like to highlight.

There are a lot of companies and labs using Lightning. You can get on GitHub and see that for yourself. I don’t know the exact numbers, but it’s definitely in the thousands, like a few thousands of them. And they go from pharma, to retail, to anything you can think of.

Today what’s interesting is that – you know, when I run into these people, because we’re coming to work with them on Grid (some of them), it’s interesting to hear the use cases. Stuff that I would have never imagined, because I’m not a company doing this kind of stuff. So that’s why I made a joke about Lockheed Martin, but I’m sure you guys are doing much more advanced stuff… Unless I’m building planes, there’s no way that I know to do that, right?

So what’s cool is just like, it’s been super-flexible. I think there are public cases that we can talk about. There are blog posts by big companies like NVIDIA, Facebook and so on, about how they use Lightning; you can read that. I think something that we do specifically in the community is we really like to protect our partners, because this is a community, and we wanna keep people’s work fairly private as well… So I won’t get into too many details. I’m just pointing you to open sources that you can look at, and how they use it. But these are big projects as well.

[20:05] There are probably about 3,000 projects now that use Lightning that you can literally just go to see them. So the companies that have open sourced their work, you can see what projects they’re working on. It’s everything from video prediction, to segmentation, to NLP, to summarization, to classification… We integrate really well with basically most frameworks out there. So if you use anything that’s PyTorch-based, it’s very likely going to work with Lightning right off the bat.

Now, in terms of scaling – we’ve done it internally, but we’ve also heard from the corporate partners that they’re training things on… Yeah, I guess the number - there’s no real limit so far; I guess it’s whatever PyTorch supports…

[laughs] However many GPUs you can get your hands on…

Yeah… That’s a big part of Grid now. With Grid and Lightning you can literally type in a thousand GPUs, and if you have the Amazon quota - great. [laughs] And we can give you as many as we can as well, but there’s no limitation. You just have to run it. I know it sounds crazy, but you literally just have to run it, and then it’ll just work.

So it’s just a function of the compute there. A month ago we did a collaboration with Microsoft. Microsoft has this library called DeepSpeed, which is really cool… Facebook has one also with the FAIR Scale team. Basically, it lets you scale up models dramatically by helping you use CPU memory efficiently, and the way you shard gradients, and the way you shard parameters across GPUs really helps…

So we were able to train a GPT model… I remember it was like 20 billion parameters, or something like that. We have a case study for that. So just for context, the original GPT-3 was – I don’t remember; it was like… Let me see here. 160 billion parameters, or something like that. I don’t’ wanna misquote you numbers, but basically, whatever the original GPT-3 was, I think it was like one third of that, with only eight GPUs. That’s crazy. I don’t think anyone in industry needs that much; I haven’t seen people use that much… So I’m just saying, that’s a pretty good lower bound.

175 billion. Or at least that’s what Google is telling me on a search…

So you were very close.

And you said you were running that on eight GPUs?

Yeah, the A100s. Only eight of them.

Oh, wow.

I mean, it’s A100s, so they’re much bigger than V100s, but we’ll be doing more tests. That was for deep… And what’s cool about it is if you’re just using Lightning on your trainer, you just say – I think it’s like “plugin = deepspeed”. Like a string called “deepspeed”. Just by doing that, you get that out of the box. That’s the kind of stuff that we embed into training. So do you have to know how to do that? You don’t. But now you get that benefit.

I wanted to real quick pop in one thing before we start moving on on this. There are some people that are listening that may not – they may even be not PyTorch users; they might be TensorFlow users, but they’re thinking about switching… You know, we always get into conversations… How does the workflow look like when you’re integrating PyTorch Lightning into your workflow? You’re using the rest of the ecosystem… Could you at a high level, just for those who haven’t used it, and maybe not have something directly that they’re going “Oh yeah, I’ve done similar to that. I can just add Lightning into that” - what that looks like, what that savings, why is it called Lightning for them… They’re kind of going “Oh, there’s this thing that may really help me.” Can you kind of just top off a little bit of a workflow and how I go from the beginning to getting something productively deployed, and what that looks like, for somebody who hasn’t seen it before?

Yeah, absolutely. Wait, so I’ve found the blog post; it was actually 45 billion parameters that we scaled it up on eight A100s.

You can look it up, it’s called “Accessible multi-billion parameter training with PyTorch Lightning and DeepSpeed.”

We’ll link it in the show notes.

Yeah, that sounds good. Okay, so basically it’s how do you adopt Lightning into your workflow, right? I mean, obviously, if you’re coming from not PyTorch, then you would just start with Lightning. There’s a very simple readme there… I would say copy-paste that readme. There’s an MNIST example on there, and you just run it.

People will say “Oh, but where are the advanced examples?” and my point is that “That is the advanced example.” All you have to do is change the data and it’ll still work for ImageNet. [laughs]

[24:15] That’s great.

That’s the beauty of it. There’s no different example for that. I mean, we’ll put it in if you want, but at the end of the day, just change your data and set GPUs to 64 and you’re good to go. So that’s the easy part. So if you’re coming outside of PyTorch, then you can do that.

If you’re coming from within PyTorch, then what people tend to do is when they start a new project, they’ll either start it on Lightning directly, or they’ll convert their existing projects into Lightning. So it is really a refactor on your PyTorch project. You basically take your main loop code, which usually looks something like, you know, you initialize the model, you set a bunch of flags, you set some sort of argparse arguments, and then you download some data, and link it somehow. It’s all boilerplate. Then there’s two loops in there, which is like for epoch in epochs in your data loader, and then you start training.

So literally, everything up to that for batch in your data loader is deleted, so it’s gone. Then the only thing that you need to track is what’s in there; we call that the training step, which is the meat of what you want. I mean, think about when you’re doing work, that’s what you spend your time on. So that goes into this function called the training step. Then the training step goes all the way from taking your batch into generating a loss that you return, with a gradient attached (some graph). It could be a few lines. Usually it’s only a few lines, because that’s most of what you’re doing. Now, the model that you left at the top, that one you can keep it separate and just pass it into a Lightning module and just use it. You know, self.model=model. Or you can define that model within the Lightning module. So you can literally copy-paste the layers and all that into the Lightning module if you want… Because the Lightning module is an nn.Module at the end of the day.

That gets you basically most of it. Then you need to find your optimizer and bring it into a function called configure_optimizers( ) Then you just return it there. You’re gonna link up the parameters through that as well. So that’s three methods - that’s your init, that’s your training step, and that’s your configure optimizer.

Then the rest of that is optional after that. Forward - we don’t actually need it. We use the forward method for inference, right? So if you train a model and you – for example, an autoencoder. An autoencoder has two sides - an encoder and a decoder. The encoder maps some input into some space, an embedding, and then the decoder maps that embedding back into some space.

So an autoencoder can be used two ways. You can use it as embedder, basically, so you can take an image and get an embedding for it, and then do similarity search, and so on. So if you’re building like a visual engine or something, you would do that. Or you can use a decoder for sampling. You can give it a random vector and it’ll give you an image, for example. Or text, or whatever you want.

So depending on what your use case is, that’s how you’re going to implement the forward. Because the forward is what’s going to be called in production. You’re gonna call the model with the input to it. So we actually allow the model to be TorchScripted and put into ONNX for production use cases. It’s literally a function called .to_torchscript.to_onnx and then you’re good to go, and it does all the things for you. You just have to get the inputs, transform it, pass it through, and then do the returns. It’s very simple.

Now, there’s other stuff left – so that’s literally it. You just have to copy that stuff… And then anything else that’s left is usually around data, or maybe validation, or testing. The validation - we have a validation step and a test step as well, where you can just copy-paste that code in there, if you want a validation loop or a test loop.

For the data, you can leave it as is. You can just pass in the data loaders directly to Lightning. Or you can use something called the data module, which is a completely optional abstraction… But it basically captures your training, validation and test data loader into one class, and couples the transforms as well. Because what usually happens at big companies is that I’m working on – let’s say I’m maybe selling something. I’m selling clothing. So I have the dataset of our inventory, with images and so on, and then when I give it to you, you’re gonna be like “Hey, how did you transform the images? Did you crop it? Did you random flip? What did you do?”

[28:21] So unless they give you that code, then it’s gonna be a little bit hard, and we could mess it up. So the data module embeds all of that. So I just have to say “Here’s the data module for the clothing dataset”, and you just run it and you know it’s gonna be consistent across the board, no matter how you run it. So that’s an optional – I mean, I highly encourage abstraction, but it’s optional.

That’s basically it… So if you do it, I would just recommend - don’t delete your project; just do the refactor first, put it into Lightning, run it once… When you do it with Lightning, you’re gonna be able to run it on your local machine with CPUs or GPUs. Take a batch of data from your dataset, or a single example, and overfit both models - your original code and this one - with the same seed and everything. And make sure you get the same results. Then once you get that, you know you’re good to go; you know you didn’t mess it up. At that point, you can go ahead and say GPUS=128 and then off you go.

So it sounds like that if I’m a PyTorch developer and I’m already using that API, I’m creating the layers of my model, I don’t have to throw out the way that I created that model. In some ways, I get to sort of delete a bunch of my code having to do with the hardware stuff, and some of the other training-related things, and I can keep my model and sort of refactor it into this PyTorch module, the Lightning module, and then call the trainer… And essentially then I now have less code, but my code is also more robust in that I can run that training on a whole variety of hardware, and that sort of thing. Am I basically summarizing that correct, or anything you would change about that?

And it’s more readable. You can literally give it to your colleagues and then they know to go to the training step to see what’s happening. Otherwise, what do you do today? You’re like, “Hey, here’s this seven lines on GitHub…”

Yeah, it’s crazy.

They’re like, “Wait, where is what you’re doing?” Because most of it is boilerplate training stuff. Now you can be like “Hey, here’s exactly what I’m doing.” They’re like “Oh, you’re sampling the latent space before doing this thing. Oh, interesting.” It’s not mingled with all this other stuff, so it’s very easy to read as well.

I joke, but it is kind of like cleaning your house, I guess. Imagine – I guess roses, right? Maybe this is a good example. A rose - you have to cut it from the bush, and trim all this stuff, and then you get this bulb at the end, which is what you care about. It feels like that. It’s like, no one’s adding these other leaves because they want to, it’s because they have to. So when you refactor your code, it’s this sense of like “Okay, it’s a lot cleaner now. I just removed a lot of unnecessary stuff.” And also stuff that you’re likely to mess up. We test very thoroughly, and we have thousands of people testing this stuff. So did we mess up the backward pass? Definitely not. Did you mess it up? Hopefully not. [laughs]

Break: [31:11]

Okay, I wanna kind of circle all the way back to where our conversation started, because I wanna get back to that cool demo that I saw on Twitter about Maybe you could just give us a little bit of sense of what is, kind of how it came about, how it’s maybe connected to the Lightning community (if at all), and then we can get into some of the details about what it enables.

So as you saw from my story, I care a lot about reproducibility and speed of iteration, and something that I thought a lot about as we were doing research and building Lightning was - in a corporate setting, you would want to scale this stuff up on a lot of compute, and you have cloud resources, and all these different things. So the requirements for training at scale in a company are very different than just like on a Google Colab, or a Kaggle. It’s just a very different world.

It’s funny, because deployment also goes into that. People are like “Oh, here you go. You deploy it on this thing.” It’s like “Well, yeah, but most real machine learning systems are not just an API.” So we know that – I mean, a lot of us build these models, we’ve all been at companies before as well at scale, so we know exactly the pain points there… So the thing that kept coming up is like “Cool, Lightning is letting me do all this, but I’m still having to do all of this cloud stuff. If I ask for 32 GPUs on Lightning - yeah, Lightning will do the thing, but you need to give me the 32 GPUs.” And giving you the 32 GPUs - that’s a lot of work, to do it consistently and at scale and cheaply, so that you don’t have to burn resources.

So what people end up doing generally is they build these ad-hoc internal solutions. They put together Bash scripts, or things… They string together samplings of a platform. And they’re great, and yeah, you’ll get things running, but you won’t be able to just scale them down immediately. You won’t be able to have really fast build times because they’re highly optimized. You wanna have real-time logs, you wanna have real-time matrix, you wanna have real-time integrations. So all of these bells and whistles, when these things happen internally - they usually get pushed away, because they’re not a company priority… Because they shouldn’t be. You know, you’re building airplanes, you’re not building machine learning platforms. So you’re normally not going to put the effort into making all the things that we care about as researchers and data scientists and machine learning engineers in there. So it’ll just kind of make your life a lot harder.

So it’s about “How do we bring that whole experience and encompass that model development cycle in a scalable way, for the needs of companies and even big labs?” Because most serious AI labs - they’re training things on very large scales as well. Because training is a bit part of the picture; it’s not just the deployment. I think the deployment is interesting, but it’s a lot easier, because we’ve been deploying websites and things forever, but we haven’t been training for that long. It’s kind of a newer thing.

So that’s really the focus of, is to just completely eliminate the pain point that was left from using Lightning by not even having to deal with it. You just type in 32 GPUs and it just happens.

So I am wondering… There’s still a lot of people, I think - and maybe I have a misconception about this - that they think maybe training models on GPUs in the cloud is always gonna be more expensive than training on a sort of… Like, you’re gonna buy an on-prem server and do it in-house. Based on your experience with that and the current state of cloud providers and all of that, is that perception mostly driven by the fact that – and I feel very seen by the comment about, like, you have all these Bash scripts strung together; that’s my life, maybe… [laughs] But is it because that way of doing things is a bit inefficient and you waste a lot of resources, and that sort of thing? Where do you think that perception is coming from, and do you think is accurate, I guess is my question.

[35:56] Yeah, I think you hit it right on the nail. If your system is inefficient, then it’s more efficient to have your own machines. Running on Grid means that we install your dependencies, everything you need to link up your data, in a matter of minutes, if not seconds. People don’t generally optimize their stuff in the backend to do that. So what they end up doing is they wanna run on the local machines because they don’t have to install their environments, they don’t have to do all this stuff again. It’s just there, and it’s repeatable, and things start immediately, so it’s a lot cheaper.

I’m not gonna say that running on your local stuff is not generally cheaper if you’re doing things 24/7, but you’re limited by bursting capabilities. So you’re never going to have – I don’t know how many GPUs AWS has, but it’s gotta be hundreds of thousands. So if you have to hit a deadline or do something really quick, and even go through ideas fast, if you’re buying your own GPUs, you’re gonna be limited by how many you have there. So it’s gonna be more like sequential model building, I suppose, to asynchronous building.

So with Grid you can go spin up 200 GPUs, run for five minutes and shut them down, and you just got a lot done. Whereas on your own machines, even if you were to do it yourself on the cloud, you would probably not even get the models running for 20 minutes and 30, while you spin up the machines and set up all that stuff. So I can take $100 on Grid and get more GPU minutes out of it than you would normally without optimal systems. So it’s just very optimized.

Now, I do think that people need to know about things – I mean, we do a lot to lower the cost, and I think one of those things is spot instances. Spot instances are machines that can be killed at any time by AWS, or whatever cloud provider you’re using. And then at that point you’re kind of done. But the nice thing about spot is that it will be like 50% to 80% the discount. So if a GPU costs $3/hour, it could be like 30 cents/hour to maybe a dollar an hour. It really depends.

So I think what you’re saying is true, because I did the calculus myself, and in fact, I have a blog post on how to build your own GPUs for this reason. But that was only for [unintelligible 00:38:00.01] and it cost me maybe 6k to build that machine, which is great. Now, $6,000, if I’m paying full GPU prices, I’ll burn through that in like two weeks, for sure. But if I’m paying spot prices, then that changes the game. And not only that, but if I’m getting more training minutes out of that, that’s a lot better. And then you factor in depreciation and all this other stuff, plus maintenance - then it actually becomes a little bit competitive.

It does. I’m curious - and we’re talking a lot about the training… Could you talk a little bit about’s deployment story and what that is? In my mind, one of the things – speaking for myself, I’ll be training centrally in the cloud and stuff, but at the end of the day I’ve gotta get my model or my system of models out there into something, often some sort of edge device, not cloud-based. Something that’s a physical thing out in the real world. Can you talk about how you work with to affect that?

Yeah, so today Grid doesn’t support deployments. The thing that we like to focus on is making sure that we really nail certain experiences before moving on to other things. So we will support deployment at some point, probably very soon. But the thing is like, I don’t think that we’re fully optimal on the training side yet. I think we wanna provide a really world-class experience there.

So for our users today - you can now access artifacts, you can get model checkpoints and all that stuff… So the deployment – most users have a deployment system in-house already, so they can just take the artifacts and do their thing. So we’re not blocking any of that. And all of these things are URL-based, and if it’s Lightning - that’s very easy to do.

Now, we’re gonna make it a lot easier, for sure - kind of the way that we do things - but today we are laser-focused on training. But I will say, I think working with Grid at this stage is great, because I think companies will be able to help us influence that roadmap, and help us build something that they really care about as well… Because as soon as we start getting to deployment, we’re gonna do it our way, and we have a very special way of doing things, so we hope that we have the feedback from the community and users to make sure that we’re doing it in a really useful way.

[40:01] And how as a user of – because this is really fascinating to me, because I’ve even been struggling to get some in-house GPUs, just with supply chain issues and all of those things… So running things on the cloud is something that we’re actively thinking a lot about, and doing it in an optimized way… Now, we kind of talked before about going, say, from PyTorch to PyTorch Lightning. Let’s say I’ve got my Python code, I’m using Lightning, it works great, and now I wanna run it with on 100 GPUs in the cloud. What does that look like? Do I need to set up my cloud account, set up billing on that side, and then set up my Grid account and then use a Grid tool to connect them both? How does that whole flow work from that point?

That’s a good question. Generally, I like to think about what we’re trying to do like that leap between Windows machines to Mac machines, where things just work. Like, what is that Apple experience for machine learning. And to answer your question, it’s very easy. It’s not as easy as I want it to be today, but it will be. Basically, there are a few ways. We have three tiers of usage on Grid. We have the community tier, which is free. Literally, you’re just paying the AWS compute. There’s nothing in there; we’re just orchestrating stuff for you. But it doesn’t really work for teams and big companies, because there’s a lot of stuff that needs to happen. Then we have the teams and enterprise tier, so that you do those kind of things.

On the community tier, you literally have to do nothing. You just copy-paste the link to a GitHub file, you paste it into the UI or use the CLI, and you select how many GPUs you want, and you press enter, and you’re done. It’s that easy. Dependencies are automatically pulled for you, they’re inference from the code that you have, your requirements, all that stuff. So we try to do as much as possible.

Yes, there will be times when that fails, and we will work with you to figure out what happened, and make sure that we get it done… But you know, dependency management is a big deal for everyone, and it’s a really hard problem to solve, so it’s gonna take us a while to fully solve that problem. But if you are at a company or a big lab, usually – we call that community tier. That’s gonna work great for side projects, and public data, and stuff like that; Kaggle, prototyping things… Sure, if your data is not secret, then it’s fine. It’s great for academics as well.

But if you have corporate data, then you’re gonna be in the teams and enterprise tier. There, what you end up doing is we basically link up your cloud accounts. So you just set it up through Grid, you’re passing credentials through there, and then those keys let us control resources on your behalf only as much as you allow us to, to make sure that we orchestrate everything on your cloud. So it’s kind of this hybrid on-prem vs. not on-prem. We also offer on-prem if people want it.

[42:47] So once you do that, you basically put in your cloud credentials in there, then you’re good to go. When you run stuff on Grid, instead of running on the Grid Cloud, which is a community cloud, you just select your cloud, whatever you named it, and then you just run on it. That means you can link up as many of these as you want as well.

As we kind of wind up here, one of the things that’s really struck me through the conversation is that you are a man of substantial vision. And as we kind of wind up, I’m really curious if you would kind of look out a little bit beyond just the next product cycle, and that kind of thing, into where you want to go, both with, and where you see the larger industry going in general, in terms of trying to make this work a little bit better for people, and take the struggle out of it, that you clearly have been working on for a while, in various capacities? Could you tell us a little bit about what future you think we’re going toward, and how you would like to shape it?

You know, when I started in research, I was really disappointed that I had to do so much work over and over again, that other people were doing. And that I had to learn so much just to decode a little bit of neuroactivity. And the world that I would love to help bring to the table is a world where the person, the scientist, the researcher, the machine learning engineer, the person that has the knowledge of whatever they’re building - the doctor, the biologist, the mechanical engineer, you name it; the person who really knows their domain - can basically focus on that, and have machine learning and all of this cloud stuff just kind of fade into the background, and just be like Wi-Fi, just like your cell phone signal. You don’t think about it; you’re just working on your problem. So how do we take that leap? I think that’s what we’re trying to solve. Are we there yet? No, but we’re definitely on track.

I think that working with a lot of amazing companies and getting to make sure that we support their use cases is what’s gonna help us get there. So the person who builds the models, who has the ideas, the adopter - they can be the ones to actually train and deploy this stuff. Because at the end of the day, I think that deployment is literally just another training cycle; except the data is live and you’re not back-propagating into your model.

That’s awesome. Thank you so much for talking to us about and Lightning. It’s been really wonderful, and like I say, we’ll put show notes on everything - the relevant links that we’ve talked about, in terms of Lightning and; definitely check it out. Yeah, thank you so much for joining us, William. It’s been a pleasure.

Thank you, guys. This was a really fun conversation. Thank you.


Our transcripts are open source on GitHub. Improvements are welcome. đź’š

Player art
  0:00 / 0:00