Practical AI – Episode #194

Evaluating models without test data

with Charles Martin, AI Consultant with Calculation Consulting

All Episodes

WeightWatcher, created by Charles Martin, is an open source diagnostic tool for analyzing Neural Networks without training or even test data! Charles joins us in this episode to discuss the tool and how it fills certain gaps in current model evaluation workflows. Along the way, we discuss statistical methods from physics and a variety of practical ways to modify your training runs.


Notes & Links

📝 Edit Notes


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist with Lockheed Martin. How’re you doing, Chris?

I’m doing very well, Daniel. How’s it going today?

It’s going well. I’ve been training quite a few models recently, NLP models for question answering and other things… And one thing that always comes up in that is “How long do I train this thing? Am I overtraining it? Am I undertraining it? How do I test it appropriately? Am I testing it right? What else should I be doing?” All of these thoughts are running through my mind… And I’m pretty excited, because today we have joining us Charles Martin, who is an AI and data science consultant with Calculation Consulting. And this is basically one of the things that he thinks about a lot and builds tooling for… So welcome, Charles.

Hey, great. Thanks for having me, guys.

Yeah. Well, I think the thing that I saw of your work that really interested me was this WeightWatcher tool, which is an open source diagnostic tool for analyzing neural networks without the need for access to training or even test data… Which is super-interesting, and I want to get into all the details about that… But maybe just describe for us kind of pre WeightWatcher - what led up to WeightWatcher? What were the sort of motivations that were going through your mind, and maybe the things that you were encountering in your own work that led you to think about this problem?

Sure, sure. So I do consulting in AI, and I had some clients working with me to do text generation. So this is years before GPT, and all these amazing diffusion models that existed. We were training LSTMs to generate text, for things like weight loss articles, and reviews on Amazon, and stuff like that. And I realized that as I use these models, I can’t really evaluate them… Because if you’re training like a classifier, like an old SVM, or XGBoost, you can look at the training accuracy. But if you’re trying to design a model to generate text or some other natural language processing problem, like say designing embedding vectors for search relevance, it’s really hard to evaluate whether your model is converging or not.

Now, I had studied statistical physics of neural networks when I was at the postdoc, in theoretical physics, so I knew that there are techniques from physics that make it possible to analyze the performance of these models, and to estimate how well they’re performing. And what I realized is that nobody in the machine learning or AI community really knows about this stuff, because it’s from the early to late ‘90s, where a lot of this research was done… And the people doing AI and machine learning are not theoretical physicists; they’re computer scientists. They don’t know about the works I said “You know I’d–

Except for you and Daniel there.

[04:27] Well, yes… You know, it’s a very broad field, and there’s so many people doing AI now that it’s really fun, because there’s so many different backgrounds… And I was at a conference, maybe ten years ago, maybe nine years ago, and I met an old friend of mine, Michael Maloney, who’s a professor at UC Berkeley. It was at MLConf, it was run by the guys at that time who were doing – oh, what was the name of that company…? They had a recommender system, a recommender product. They were Turi AI They were eventually acquired by Apple. And I was talking to Michael, I said, “You know, there’s a lot of theory around deep learning that is very similar to what we see in protein folding.” And my advisor was actually - him and his student, John Jumper, developed the first version of AlphaFold. So what happened was Google acquired AlphaFold… Excuse me, they hired John Jumper, who was this student from Chicago, and basically souped up his thesis. And that’s where AlphaFold comes from; this amazing technology from DeepMind that can predict protein folding.

So there was a lot of theoretical work I had done as a postdoc, and I was talking to my advisor about some of the stuff they were doing in protein folding way back before AlphaFold was released. And I thought, “You know, I think I’d like to try my shot at doing research again, and see if I can develop some theory that would allow me to understand why deep learning works.” And that project - it’s been about seven years now of research, and that’s led to the WeightWatcher tool.

Cool. So it’s probably very typical for people to think about “Oh, I’m gonna evaluate my model. I have a test set.” But could you describe a little bit about two things - one is like why, from your perspective, at least in certain situations, a test set doesn’t give you the indication of behavior or performance of a model that you’re wanting? And then how that connected to these things from the physics world.

Right. So let’s say we’re training a model to generate text. There’s no test set, right? You have to read the text, and ask, “Okay, does it look human or not?” And that’s sort of where the first problem came, is that there are many problems in generating things. Another would be - let’s say you’re doing search relevance. So I’m trying to predict what somebody wants to click on. I have clients like Walmart, for example; we built these systems for them. It’s very expensive to run an A/B test. So you can test things in the lab, and you can like make a model, like an SVM model, to predict what people would click on. But you don’t really know how it’s going to perform until you put it in production. And there are all sorts of biases that exist in the data… Because there’s like presentation bias; people tend to click on things that are in the first element, and that screws the model up.

So there are many cases… Another good example is in quantitative finance, when you’re trying to predict the stock market. And you have models where you would like to train some neural network to learn something about how the news predicts the market, but if you train it directly on the market, you’ll overfit it. Always. So you have to have some way of evaluating whether your models are converging properly or not, without just looking at the test sample; a lot of data is out of sample, or you can’t really evaluate it without human judgments, or it’s very expensive.

It would kind of infer that we’re probably seeing a lot of practitioners running into these kinds of issues over time… And, you know, in a lot of cases, if you look over the last few years as everyone’s kind of ramped up in the space, and have been learning how to do different types of deep learning training… Do you think that in terms of those accuracy issues that a lot of practitioners are kind of missing it altogether?

Or do you think they know that it’s there, and they just don’t know how to solve it? Can you give us the lay of the land with it?

[08:04] Well, let me give you an example. Let me give an example. There was a recent paper that came out of Google DeepMind, on the scaling properties of very large language models. And it showed that what we thought we knew about large language models from two years ago, from OpenAI, a paper that they wrote, was totally wrong. They misunderstood how the scaling properties work. And the question is things like, “When you have a model and you’re trying to train it, should you be trying to optimize the hyperparameters, or should you be adding more data?” You can think of it like in that sort of very crude sense; you’re trying to train these models, and essentially, what was happening at OpenAI is they’re training these large language models and they didn’t realize that they should be adapting the learning rate to the dataset size. And when you change that, when you adapt the learning rate to the dataset size, you get very, very different results than if you don’t. And we know that a lot of these large language models, like BERT, for example, are just not properly converged. There are a large number of layers that are simply under-trained. I think that basically, there’s the theory that people were using - there’s no way to look at a model and ask “How close are you to convergence?”

If you think about something like an SVM… Let’s go back - I’m an old guy; let’s go back 10-15 years ago; we’d run SVMs. There was something called the duality gap. You can look at the duality gap in an SVM and you can ask how close are you to the bottom of the – it’s a convex optimization problem, and you can tell how close is your solver to actually being at the optimal solution. You can tell that. That’s theoretically known. So it’s somewhat puzzling that now you have sort of deep learning, people understand that deep learning is sort of like a convex optimization, or a rugged convex optimization, because they know you don’t have local minima, and there’s an issue that there are lots of saddle points, but no local minima… And yet, there’s no theory which tells you whether you’re converged or not. And so it’s like, “What’s going on?” So people are trying to solve this. And I think this is where you start training a model and you don’t know, have you trained it enough? Do you need to train it more?”

Let me give you a really practical example. We have a user who’s using WeightWatcher to train semi-supervised models to determine whether the land you own qualifies for carbon credits, right? So they’re trying to use AI to help with climate change. And one of the biggest problems they have is how much data should we add to the model? We have a model, we have data… Acquiring data, acquiring good, high-quality, labeled data is very, very expensive. You could easily spend millions of dollars on a dataset. Maybe 10 million. I know self-driving car companies will spend easily $10 million on a dataset. So it would be nice to know, given the model that you have, if you add more data to it, will it help? So we can answer that question with WeightWatcher.

If you can kind of talk a little bit about some of the underlying – because you’re pointing out that there’s a lot of opportunity for people to not be optimal in their approaches, and kind of miss some of that… So it almost raises kind of a bigger issue that we may have as a community if that’s the case, in terms of like, how do we solve some of those problems in the large? Aside from the specific tools, what are you thinking in terms of how should people approach these problems different?

Well, look, I think the first thing you have to ask is “I’m beginning to train a model. Is my model big enough? Is it small enough? Do I really want to spend millions of dollars doing brute force hyperparameter tuning? Should I be tuning it?” Here’s a basic question that comes up with every client. I have a model… Forget about deep learning, SVM. Should I add more data, or should I add more features? Let’s say you have XGBoost. Should you add more data, add more features, or do more hyperparameter tuning? It’s all expensive. What direction do you go? And these are difficult problems. If you add more data, is the data the right quality? Is the data mislabeled? Are there duplicates in the data? Is the data too similar to the data you’ve already added? Is it too different from the data you’ve already added? Basic questions that we just don’t have a – very, very basic, broad-level questions that we have almost no answers to. Everything is brute force.

[12:07] If you want to train a neural network, you go out and you get Weights & Biases, or you go to Google Cloud and you just spend a fortune on hyperparameter tuning? Do you really have to do that? Or isn’t there something better you could do?

Here’s another example… When we started this project, there were maybe 50 open source pre-trained models. Open source models, right? The VGG series, ResNet, things like that. You go to Hugging Face now, there are over 50,000. Which one do you pick? Should you pick BERT, or something else? Everyone uses BERT. BERT is highly under-optimized. If you compare BERT to XLNet, XLNet is much, much better. Not only do the academic papers show that XLNet performs better on at least 20 different metrics, you can use WeightWatcher - I have a blog post - and you can see that it’s just night and day between XLNet and BERT. But is it worth the money to spend to try to optimize XLNet? Why does everybody focus on BERT? Because it is a cute name, and it’s made by Google. I mean, you know… It’s really hard to know which model to pick. And these models are very hard to improve.

So there are a lot of just broad open questions like this. Which model do I pick? How much data should I add? How do I evaluate the quality my data? Do I really need to do brute-force searching on everything? If I put something into production, how do I know if the model doesn’t break? I don’t know if you guys worked in production environments; I work in environments where things break every six weeks. Thanksgiving comes, the model is broken. Christmas morning? Model is broken. How do you monitor these things?

So I think machine learning and certainly AI is in the infancy of engineering, certainly compared to where we are in software engineering. We’re 20 years behind software engineering.

So Charles, it’s interesting, these scenarios that you bring up, because it’s definitely something that happens. I mean, sometimes in an actual real-world setting, like with my team, it’s like we have what data we have; what model is appropriate that fits that level of data, right? Or maybe you have a whole bunch of data and the question is, “Do I need all of it for this model that I’ve already kind of decided on?” or all of these sorts of things, and then you get to the training questions that you’ve brought up.

I’m wondering if you could just give us a sort of high-level overview of – because I think the main thing, if I’m understanding right, the main kind of tool that’s come out of this train of research that you’ve been working on is the WeightWatcher tool… Could you just give us a kind of broad overview of what the tool actually functionally does, and where it fits into a researcher or a developer or a data scientist’s workflow?

Sure. So the tool can be used both when you’re trying to train models, AI models, or you’re trying to monitor them in production. From a training perspective, the tool gives you insights into whether your model has converged. And it does so at a layer-by-layer basis. So I’m not aware of any other technology that allows you to look at the layers of a neural network can ask “Has one layer converged, and has another layer not converged?” So there are cues you can look at. You can look at something called the alpha metric, which is the amount of correlation in the model. And if the alpha – usually, if you have a computer vision model, your alpha should be down around two. In natural language processing, transformer models, alpha should be between three and four. If your alphas are larger than that, chances are the layer is not properly trained. You can then visualize each layer, and you can look at the layer, its correlation structure, and that correlation structure should be fairly smooth; it should be linear and smooth on a log-log plot. If it’s choppy, or has sort of a strange shape to it, something’s wrong. If your layers have lots of rank collapse, or lots of zero eigenvalues, something’s wrong. We’ve identified something called a correlation trap, which in deep learning the language would be you didn’t clip your weight matrices; you didn’t regularize that layer correctly.

[16:35] So you can use the tool during the training of a neural network to monitor the training, you can find the layers that are basically broken, they’re not trained correctly… Think of it like you’re building a house, and there are cracks in the bricks. You put a brick in and it’s cracked, you need to replace it. You can adjust regularization up and down on the layer, you can adjust learning rate up and down on the layer… You might find that when you’re training a model, some layers are beginning to – they’re well trained, and they begin to overfit, so you might want to freeze them. So you can freeze – people talk about early stopping; I talk about early freezing. So you might freeze some of the early layers and let the later layers converge.

So WeightWatcher allows you to do all of this – it’s very much a… You have to do it by hand, you have to go in and visualize it and see what’s going on, but it allows you to inspect your models to determine whether they’re trained correctly.

It also allows you to look at models in production. So if you’re deploying AI models in production, and you know, maybe you’re retraining your models regularly. It would allow you to – it gives you like a warning flag, like a model alert system that would tell you “Hey, you broke this layer.”

We have an example in our paper - we have a paper in Nature, where we show that in one of the Intel systems they applied a data compression algorithm to compress the model to go on the hardware, and they screwed up one of the layers. And you can see this with WeightWatcher. We’ll flag it for you.

So as you’re deploying models in production, it can monitor them for you. And remember, it doesn’t require any data, so it’s a very light touch, very simple integration to integrate into your ML/AIOps monitoring pipelines. I think of as sort of like an AI uptime tool. It gives you like an early warning.

So this is how you use the tool… You can use it during training to make sure your models are converging well; or if they haven’t converged properly, you go back and fix them. Or you can use them after training in production, to monitor for problems.

So I was trying to think of analogies in my head while you were talking, and you gave a good one in terms of the house and the cracks… One of the things I was thinking about - you mentioned BERT earlier, which no doubt in the time when BERT came out, it was quite an advancement, and many people have built amazing things on on BERT… But I was thinking about that and where we’ve come from there, and also thinking about – my wife owns a manufacturing business, and they’ve got this principle in manufacturing about “Find the current biggest bottleneck in your process, address that… And as soon as you address that, there’s going to be a next biggest bottleneck that you address next.” And you kind of just keep working your way through.

Sure, sure.

So I’m wondering… BERT, obviously, was a good advance, but then you can analyze that model and see maybe where the next biggest sort of offending area is, and kind of address that. And I was also thinking about the tool that you were mentioning - all the things you could do with it. You could probably analyze your model in development for years, fixing all sorts of things, and doing all sorts of things, right? But at some point, you have to ship your model, right? So maybe there’s this process of - and I’m wondering your thoughts on this - like you using the tool to find sort of these like worst-offending parts of your model, addressing those, and maybe like at a certain point, you get to a point of diminishing returns, or something like that, right?

[19:55] Yeah, this is a coarse-grained tool. It’s not meant to go in and study epoch by epoch and try to fine-tune exactly what’s going on. I’m really glad you brought this up, Daniel… Because you know, you work with the academics and they only want to use it as a regularizer, they want to optimize the loss… No, no, no. That’s not the point. It’s an engineering tool. It’s an engineering tool. It’s designed to go in and find out where the cracks are. So if you’re – I don’t know if you guys in San Francisco you know about the Millennium tower…?

My little nephew, he’s all into construction, and he’s always talking about “They’ve gotta tear the Millennium tower down. Tear it down! Junk it!” Because it has – they built this tower and it’s like the Leaning Tower of Pisa. It’s tilting. And if you go into the basement of the Millennium tower… And this is like, you know, condos, like multimillion-dollar condos… I think probably Marissa Mayer may own a condo there… I mean, it’s ridiculous. They built this thing, and downstairs you look and there are cracks in the steel. It’s like, “Guys, it’s gonna fall down. It’s cracked.” And it’s like, this is what WeightWatcher does - you go into your models and ask, “Are there gross problems that should not be there?” This layer is overtrained. This layer suggests that the data is mislabeled. This layer has a correlation trap. This is what you’re trying to do.

And frequently, in engineering, you’re under time constraints. So you’ve got to get this thing out and into production, and you want to make sure it’s not crazy. And it allows you to – WeightWatcher allows you to detect problems that you cannot detect in any other way. And that’s the key, it allows you to find a major problem.

So one of the things I was wanting to ask you… Because you said something a moment ago, and kind of circling back to that, that I’m very curious about… To bring me and other people in our audience along that may not be as familiar with that - I often rely on Daniel’s expertise on this, and I want to rely on yours in this… You mentioned when we’re talking about kind of testing those layers, as you did, going back to the alpha, and you specified for ranges of two for the visual, and the three to four for like natural language models, and stuff… I’m assuming that that’s one of the mechanisms that you’re using in the software. Can you talk a little bit about what are the other mechanisms that are there along with that, and maybe how alpha is used? If somebody is not familiar with that concept, what is it about alpha that’s identifying that, so that they understand that a particular layer might be brittle, in the sense of it’s not fully converged? How are you approaching that? Kind of bring us along to try to catch us up with you on how you’re thinking about that.

Like, why does it work?

Yeah, why does it work? What is it about alpha and other things that you’re using in the software that yield that level of insight that you’re describing?

So what we know from – where does deep learning work? Deep learning works on natural things; natural images, voice, texts, things that are really part of the natural world. And the natural world exhibits a multifractal structure. If you look at a tree – I don’t know if you remember L-systems in computer science, or Mandelbrot’s work… Most natural systems have – or just think about texts, Zipf’s law, parallel structuring in texts and documents. All natural data has a power law structure, a fractal structure to it. And the way neural networks learn is they learn the multifractal nature of the data. And that’s why they work so well on things like text and images, and why they don’t work great on tabular datasets.

So what you’re doing is there are correlations in the data. The data is correlated. You’re trying to learn the correlations. And frequently, you’re trying to learn very subtle correlations you couldn’t find in some other way, using some simple clustering algorithm, or an SVM, or something like that.

So what we’re doing is we’re measuring the fractal nature of the data, and every layer of a neural network gives you some measure of the fractal properties in that level of granularity. And so alpha is like a measure of the fractal dimension. And what we know is that it measures the amount of correlation in that layer. In other words, you’re learning that the data is obviously not random. It can’t be random. You’re trying to learn patterns.

[23:57] So what we’ve discovered empirically - and there’s some deep theoretical reasons for this, but qualitatively, what’s happening is you’re learning the natural patterns in the data. And those patterns - they have to be there. So if you’re looking at text data, and you start seeing outputs around six, or seven or eight, the layer hasn’t learned the correlations. It just didn’t learn anything, and it’s just sort of there. Or it learned it, but the correlations are so weak that’s not really contributing anything. And we know that many of these models have these extra layers, they’re way over-parameterized…

So that’s what’s happening. And if the correlations – if there are strange or spurious correlations - there are things that cause alpha to be small for spurious reasons. Like, you didn’t regularize your layer correctly, and so there’s a giant weight matrix; you didn’t clip the weight matrix on, and so the regularizer failed. So it can detect the difference between when there are problems with the optimizer, and when there’s actual natural structure in the data. It allows you to distinguish between these two. That’s what it’s doing.

Am I correct, just for clarity’s sake, in terms of – when we say like “It’s doing this without the test data, or the training data”, really you’re doing these calculations and you’re detecting these parameters, these metrics based on the weight matrices, right? Is that correct?

Yes. Only on the weight matrices. You don’t need to look at the data.

So in that case, the tool itself, in terms of how people would run it - because it’s doing these matrix calculations, is it necessary… Like, could you speak to like the computational aspect of it? Am I gonna spend five hours waiting for WeightWatcher to analyze my model? Or is it going to happen in five seconds?

The current model right now – it depends on the size. It runs a singular value decomposition on each layer. So that’s a high-memory, CPU-intensive task. It’s not optimized for GPU. So you’d run it on a normal CPU. It does require some memory. Most layers aren’t too large, so it could take anywhere from a couple of minutes to an hour, depending on – you know, if you’re trying to run it on GPT and you have 1,000 layers, it’s gonna take some time, right? If you just have a few layers in your model, and your training like a small model, it’s very, very fast. Generally, you would hope that it is faster than an epoch in training, but it’s not GPU-optimized.

So one of the things we’re working on, that I’d like to, if I commercialize the product, is to make a version that’s very, very fast. It’s like, you would distribute all the calculation on the nodes and come back to you. So that’s the kind of – so this is an open source tool, but it runs a simple SVD calculation. So it’s a little compute-intensive, but again, my theory on this is that if you’re training small models, it’s pretty fast. If you’re training really, really big models - well, chances are you have the compute resources anyway. And you’re not renting a GPU for it. You don’t need a GPU to run it. So that’s sort of the takeaway.

Well, Charles, when I first saw the tool, I was very interested in it, and I did take time to go ahead and just pull it in one of my notebooks and look at one of my own models, because I did want to get hands-on with it. It was a question answering model based on XLM BERTA. And I analyzed it with WeightWatcher… I did not do every single thing that you describe on your repo, because I’m still dipping my toes…

That’s great. Did it run?

It ran, yeah. It’s a PyTorch-based model, it ran. I didn’t time it, so I don’t know exactly how long… But I did find out – at least I found out… According to WeightWatcher, ten of my layers are under-trained, so…

That could be.

Yeah. So I at least found that out. So could you speak a little bit about the tool itself? So you mentioned how people can integrate it in their workflows… Could you mention a little bit more about the open source project and how people – like, if I’m like I did, and I want to do this on one of my models, how would I go about doing it and how easy is it to get it running on a model?

Well, this is just – it’s a tool I’ve been writing in my spare time, based on my research; there’s no funding for any of this. I published with UC Berkeley, but they’re not funding any of this. They’re just sort of like – I’m just there just kind of “Help me out a bit.” I’ve written it all myself. It’s all open source. One of my staff guys helped me out early on. Pip install WeightWatcher. The way it’s written now, you probably need to have both TensorFlow and PyTorch installed in your environment. If you want, I can make a version that doesn’t require both of those. No one’s asked yet.

One of the challenges I have with the tool is that I have 60,000 downloads, and I have no idea who’s using it. So if you’re using the tool, let me know, so I can help you. I don’t know what you’re doing with it, and I’m not gonna – I don’t want to end up in feature creep, and design features in the wild. I need to know what you’re doing. So if you tell me, I’ll help you.

We have a Slack channel. You can go on Slack, and you can ask me, and I’ll help you. But basically, it’s pip-install WeightWatcher, and you just give it a model, you say, ‘watcher=weightWatcher model=myModel’ and you say ‘watcher.analyze’ That’s it. And it will return a data frame with quality metrics. If you say watcher.analyze plot=true, it will generate a bunch of plots. It will generate the plots it’s meant to be – I’ve been running it in a Jupyter Notebook. That’s how I run it. In principle, you could run it in a production environment. Again, it’s not even an alpha one tool yet; it’s still like 0.56, 0.57… So if you do that, reach out to me; we can make a version that’s more stable if you need to run it in a production environment. But I’ve mostly been using it in – it runs in the Jupyter Notebook, you get a data frame, you analyze the data frame… You run a Google Colab notebook, you say plot=true, it gives you a bunch of plots… If you add some other options, it will give you more plots, and then you analyze the plots.

[30:30] So let me ask you a question that’s kind of a follow-up to what you and Daniel were just talking about… If you’re looking at the workflow – so Daniel said there were like ten layers that had not converged sufficiently… How does that change the workflow? For someone who hasn’t done what Daniel’s done, and gotten his hands on, someone just listening, talk a little bit about what they were doing before, versus the workflow they’re doing now, now that they have the insights that WeightWatchers bringing to it. What does that look like for the practitioner?

Well, here’s the first thing, and this is exactly what happened with one of Michael’s postdocs, and students. Go back and look at the regularization. Did you add enough dropouts on your layer? Are the learning rates too large? Do you not have enough data? Is your model just too big? Are the earlier layers converging if the later layers are not? Maybe you should freeze some of the earlier layers and give the later layers time to converge. Maybe you need to run it longer; you need to run SGD longer. Maybe you need to adjust some of your hyperparameters, because you’re not getting tuned. Try to adjust your hyperparameters till alpha goes down, not that it goes up. Those are the kinds of things you need to do during training.

Yeah. Maybe you could also mention the workflow… I find it very interesting what you’re saying about like the workflow of potentially using this within the training loops as well, like as you’re training the model…

So one thing you could do is definitely run your model, like I did, and then look at it afterwards and see, “Oh, I need to do something about this or that.” And then, of course, probably the harder part of the problem is connecting with like, “Okay, does that mean I do one of those things you just mentioned, or another one of those things you just mentioned?” But what about that workflow in the training loop? How might that work? Maybe some people have heard of certain things related to optimizing, either not doing brute force, hyperparameter tuning, but doing some sort of AutoML type of stuff, or something… People have thought about these things… So when you’re pulling WeightWatcher into the training run, how would you think about that being used?

If you want to give Google Cloud a million dollars to do AutoML, and then have them own your models for you, and feed them back to you, knock yourself out. I don’t want to do that; I don’t want to be trapped. That’s what the AutoML offering is. It’s an offering to blow millions of dollars. Or if you want to get some tool, like H2O, and auto-tune a model, and then find out it doesn’t scale, and then you have to redo it… We’ve had clients with that problem.

I think there’s this wider field though of, I guess, meta learning, and kind of learning on that… And I don’t know if the WeightWatcher stuff would fit into that larger space of research, I guess, but…

Look, what are you trying to do? Like, what does it mean to be optimal? If being optimal means that your alphas are close to two or three, then you should adjust your hyperparameters such that the alphas go down. That’s what you do. Now, doing that analytically, typically, doing what are called analytic derivatives, meaning you try to compute the gradient from that - that’s somewhat difficult (it could be done), because you have to compute the eigenvalue spectrum, and then you have to fit it, and then you have to figure out the derivative. And that’s a very complex, nonlinear calculation. It’s very iterative. It could be done numerically, or it could be done analytically with some work. It’s a lot of work… I would love to have VC funding like Hugging Face to do that… But I don’t. It’s just me. Me and you. So you just try to tune your parameters; if alpha goes up, go the other way. If you turn your learning rate up, and you find your alphas are going up, turn the learning rate the other way, and hopefully they’ll go down.

[34:06] Obviously, it’s a complex optimization problem, because you have 100 layers, you have 100 alphas, and so you’re trying to tune different layers, you’re trying to tune your layer learning rates, and your amount of dropout, and the amount of momentum… So in principle, you could try to do that algorithmically, in a way, using like a Bayesian type approach, where you try to get your alphas to go down on every layer. In principle, you could do that. It’s a complex optimization problem, but that’s what I would recommend. And I think it’s theoretically well-grounded. I mean, the point is that you want to learn more correlations.

Typically, what I’ve found is that it’s a good tool for newbies, because you get into a model, you start doing something, and things are totally wrong. And you can go in and fix some problems. Okay, now we’ve fixed it; we’ve found – like, what did we not do? Like, I didn’t put the proper regularization on these layers. Let me add regularization and try again. And you can see that, okay, that’s much better.

So from a newbie perspective, it’s a very good tool, because it helps you get started. Now, it does work – keep in mind, the tool works at the end of training, not in the early stages of training. You’ve got to let the thing bake for a while. And once it’s about halfway through training, then you can start looking at things… It’s got to have some correlations. But this is what it’s for.

Typically, trying to do large-scale meta learning would just mean you’d have to integrate the tool into some sort of process that allows you to look at the alphas, or look at more details in the layer - the shape of the spectral density, the number of spikes, the alphas, the volume of the spectral density - and figure out how to tune from that. I mean, this could even be used in a reinforcement learning situation, where instead of the reward being something that the agent takes, the reward is Oh I got smaller alpha, So I have rewards on every layer, and I sum the rewards in some average way to try to get the optimizer to work, even in situations where I don’t know what the reward is for a reinforcement learning situation.

Obviously, that would be nice in areas like you’re trying to trade in the markets, because you can’t take actions – that trade, you can’t trade on historical data and expect to learn from that. So this gives you a way of sort of doing things in a supervised or semi-supervised way, that doesn’t require peeking at the test data to optimize. And then that’s it… I hope that answers the question, but that’s sort of the idea. And there are lots of things people I think want to try. I think it’s great if you try them.

Yeah, I definitely appreciate you being transparent about where the tool is, and all that… And really, the possibilities that might happen with the tool. There’s a lot of opportunities to explore usage and further development.

Part of what I wanna do with the tool is build an open source community. I can’t do everything myself, and there’s lots of things to do, and if people want to get involved in the community, join the Slack channel. We can build things. That’s what open source is. And I think a lot of people may have ideas, and will be able to contribute in ways that we’ve just expanded. Again, right now to me the way you train neural networks now - it’s like you build a bridge, you drive a car over the bridge, you see if the bridge falls down. And you do it again and again and again. How many cars are you going to crash into the ocean until you get the bridge right? No, people don’t build bridges like that. You build bridges by having engineering principles. You understand, “Here are the engineering principles that go in, and this is the load it can take, and this is the wind shear, and you try to build bridges that actually stay up. And right now, I think deep learning is so brute force; it’s like, you just spend as much money as you can, do as much brute force as you can, and if it doesn’t work, you try it again. And there’s no principles behind what you’re doing, and we’re trying to add some – and principles that are based in deep theory… Like, there are empirical rules of thumb, but there’s also deep theoretical reasons why they work, just like in any other field of optimization.

[37:50] I’m curious… I’m kind of going back to the engineering, talking about, as this matures, and trailing the software engineering world - but one of the decisions that we all make as engineers, that we’re doing, is kind of like as we’re creating open source community, and we’re trying to provide the value for that community that you’re talking about… Do you see the future as being community specifically built around WeightWatcher? Or is there an opportunity potentially to add the value that WeightWatcher is bringing, those new insights that you described, and roll them into some of the other existing communities? Do you have any opinions or thoughts about how you integrate this in for the value of the larger community?

Well, look, I think what I’d like to have is a community of people who are training models, and getting them to interact with each other. A lot of the people - like I said, it’s hard to get feedback; people are doing things in industry… And because they are constrained by NDAs, they can’t really talk about what they’re doing. And I think it gives people an opportunity to really get into this space and learn how training of neural networks works, without being constrained by your employer, or your contract, so you can really do stuff. That’s really a lot of what this is.

I think there are other communities doing things like people building hyperparameter optimization tools, or people building reinforcement learning tools… We’d be happy to integrate the tool in. The challenge is always you want to make a tool that is self-contained. if people fork the tool and begin changing it, it ends up – I don’t know if you guys know the story of Emacs. I was at Champaign-Urbana when this happened; they wanted to port Emacs to basically X Windows, and Stallman didn’t want to do it, and they forked it. You have ex-Emacs and you have Emacs. It killed it. Forking Emacs killed it, because you have the ex-Emacs crowd… And these guys went off and started Netscape, and they’re probably all retired now, or they’re sitting at the top of – you know, hanging out at the top of Google or Netscape… But this is the problem - you want to make sure you have an open source community. You don’t want – I mean, I want people to contribute and feel they can do things. If we fork it and it goes into other communities, it kills it, because now those contributions don’t come back. You end up in these sort of weird battles, and there’s no value in that.

What we want to do is help people, and if it’s necessary at this point, commercialize the tool… I would then turn it into something which we can support, like Hugging Face. Hugging Face is a lot of open source, but any sophisticated technology needs maintenance; you buy a copier machine - it’s not open source, because it needs maintenance. So even a tool like WeightWatcher needs maintenance. So I would love to be able to work with people who would like to put it into production, and develop it, and then if at some point we realize “Look, we really need to put a service contract around this, so that we can maintain it and solve some of the harder problems for you, we’d be happy to do that.” And I think that that’s really what we’re trying to do.

There’s also a lot of opportunity for scientific research. WeightWatcher, a lot of it has come from doing research, statistical mechanics and learning theory. We have papers in JMLR, Nature, ICML, KDD… There’s a lot of opportunity for students. We have one student who is at a bank, who just did his master’s thesis on WeightWatcher… And so there’s a lot of that kind of opportunity as well, and I think there’s a lot of room for improvement.

[41:00] As we get to the end here, I was wondering just quickly, as we close out - I know you’ve spent a lot of really valuable time and investing in the areas maybe that people aren’t focusing on in the AI community, in terms of the training side of things, and ways to help them, and in those gaps… As you look forward to the future of where the AI community is going, what encourages you about the direction of things, or what’s exciting for you in the community right now?

For me, I’m a physicist at heart. I did theoretical chemistry, I did theoretical physics. You know,instead of somekind of runt of the litter one of my colleagues went off and started AlphaFold, which solved a 50-year grand challenge. I have another who has started a company who’s going to label all the world’s translational medical data… So I’m used to – for me, this is an opportunity to really show that we can use theoretical physics in a way that can have a broad impact. You can use theory to build sophisticated engineering tools; and a connection between a lot of the deep sort of Cold War education I have, to build tools for engineers.

There’s a very famous statement by Carver Mead, who was a very famous electrical engineer from Caltech, who said, “Every useful experiment eventually becomes a tool.” Everything you can measure eventually becomes a tool, if you give it to an engineer. And so I would just like people to realize - you can do deep theory; there’s a lot of interest, a lot of fun and interesting stuff to do… And we can turn theory into tools that people can use, and build a community, and just have a broader impact.

I did AI in the ‘90s. People thought we were crazy, like “This stuff doesn’t work.” Nobody believed it, right? “Why are you doing neural networks?” People think neural networks were invented by computer scientists, but there’s been a whole group of theoretical physicists doing this stuff for years… And understanding sort of who we are, how the brain works, how we think, what’s actually going on up here. And I think it’s a very exciting time. And that’s why I’m doing this. I think there’s a lot we can offer from the scientific community. There’s a broad – I think there are really deep, broad connections between general science and what’s going on in AI, and that can connect back to the engineering world. And I think that there are big problems.

One of the things I’m really proud of with WeightWatcher is that there are companies using it to help climate change. I think it’s a huge problem. If you can use it to find some way to solve this massive problem we have, I think that would be fantastic.

That’s awesome. Well, I think that’s a wonderful way to close out. I really, really appreciate your perspective there. And yeah, thank you so much for taking time to join us, Charles. It’s been a pleasure.

Hey, I really appreciate it, too. I’m glad we were all set this up, and I look forward to the podcast, and I really look forward to anyone who tries to use the tool. If you want to use it, please reach out to me. Let me know how it’s working. Complain to me if you don’t like it. I’m not going to fix it if you don’t tell me; I don’t know what’s wrong with it. I can’t fix what I don’t know is broken… And I would love that people join the community, and build something great together.

Awesome. Thanks so much.

Bye. Thanks, guys.

Thank you.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00