Multi-GPU training is hard (without PyTorch Lightning)
William Falcon wants AI practitioners to spend more time on model development, and less time on engineering. PyTorch Lightning is a lightweight PyTorch wrapper for high-performance AI research that lets you train on multiple-GPUs, TPUs, CPUs and even in 16-bit precision without changing your code! In this episode, we dig deep into Lightning, how it works, and what it is enabling. William also discusses the Grid AI platform (built on top of PyTorch Lightning). This platform lets you seamlessly train 100s of Machine Learning models on the cloud from your laptop.
Matched from the episode's transcript š
William Falcon: Yeah. So I think if youāre working at a company - or any team really, even research - if youāre working with multiple people, you need the ability to share code. And if youāre at a company, or even university lab, you wanna share code across teams. And thatās really hard to do without something like Lightning. Because what happens is people tend to intermingle a lot of stuff, like data, model and hardware into the same files. Well, one team may not have GPUs, or may have different types of GPUs, or may only be using CPUs, or your production requirements mean that you can only use CPUs for inference. So there are a lot of constraints there. And I guess if youāre not thinking about it how we are, from the abstract level, you wonāt really realize that a lot of the reasons why a lot of that code doesnāt operate together is because youāre mixing the hardware with the model code. And thatās something that took us four years probably to get there, to see those, to have these insights⦠And what that means is that we can factor out deep learning code into three major areas; well, at least four, I guess. And weāll find more; itās ongoing research. So one is training code - this is anything that has to do with linking your model to the machine specifically; so how do you do the backward paths⦠You know, backward pass and distributed is very different from just on CPUs⦠At least technically speaking. What happens if you have half precision there? What happen if youāre using stochastic weight averaging? What happens if you have truncated back steps, right? There are a lot of details that go into it.
So all of that is handled by the trainer. And this is the stuff that youāre gonna do over and over again. It doesnāt matter if youāre doing audio, or speech, or vision, youāre always gonna have a backward pass, youāre always gonna have a training loop, and so on. The model is the thing that changes. The model is not just ā I like to think about models⦠In Lightning we have this concept of a module, and to me a Lightning module is more of a system.
We can think about a model like a convolutional neural network, or a linear regression model. Just like a self-contained module. Todayās models are actually not models. We need a new name, because thereās something that doesnāt exist, and I think the Lightning module, which is a system, because models now interact with each other. Like, what do you call an encoder and a decoder working together to make an auto-encoder or variational encoder. a Theyāre not models; itās collections of models interacting together. Same for transformers.
[16:07] So thatās really what the Lightning module is about - you pass these models into it, and then how they interact together is abstracted by that. And I think thatās a missing abstraction that was not there, which is why people were jumping through so many hoops, to be like āOh, well how do you do GANs? How do you do this other stuff?ā
So itās important to decouple that, because now I have this single file thatās completely self-contained, that I can now share with my team across in a different division, and their problem might be completely different, with a different data set, and they donāt have to ever change the code on that model; all they have to do is change what hardware theyāre using and then what the dataset is. As long as it conforms to the API that the model is expecting, it works. So it makes code extremely interoperable.
I think people come to Lightning because they wanna train on multiple GPUs and so on. And under the hood we have this API called Accelerators that lets you do that. But thatās only a very small part of it. I think once you get into it, you see that the rest of it is the ability to collaborate with peers, and be able to have reproducible and scalable code.