Practical AI – Episode #217
Accelerated data science with a Kaggle grandmaster
featuring Christof Henkel
Daniel and Chris explore the intersection of Kaggle and real-world data science in this illuminating conversation with Christof Henkel, Senior Deep Learning Data Scientist at NVIDIA and Kaggle Grandmaster. Christof offers a very lucid explanation into how participation in Kaggle can positively impact a data scientist’s skill and career aspirations. He also shared some of his insights and approach to maximizing AI productivity uses GPU-accelerated tools like RAPIDS and DALI.
Featuring
Sponsors
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with extended episodes, make the ads disappear, and increment your audio quality with higher bitrate mp3s. Let’s do this!
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome to Practical AI | 00:41 |
2 | 00:41 | Christof Henkel | 01:15 |
3 | 01:57 | What is Kaggle? | 05:37 |
4 | 07:34 | How has Kaggle helped? | 02:04 |
5 | 09:38 | Deep Learning 5-6 years ago | 01:27 |
6 | 11:05 | What were the changes like? | 02:47 |
7 | 14:10 | Sponsor: Changelog++ | 00:59 |
8 | 15:09 | How Kaggle compares to real life | 06:47 |
9 | 21:56 | Any Kaggle highlights? | 02:08 |
10 | 24:04 | How to climb the Kaggle ladder? | 02:35 |
11 | 26:39 | Accelerated GPUs in Kaggle | 03:46 |
12 | 30:25 | Speeding up my process | 04:12 |
13 | 34:37 | What comes up in the discussions? | 02:59 |
14 | 37:36 | Getting started | 03:30 |
15 | 41:06 | What's next for Christof | 01:51 |
16 | 42:57 | Outro | 00:54 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host Chris Benson, who is a text strategist at Lockheed Martin. How’re you doing, Chris?
Doing well, Daniel. How are you today?
I’m doing great. Chris, have you ever been called a grandmaster in anything?
No, but I really wish I had, because it’s a frickin’ cool name, man. Or title.
Weren’t you like a street fighter, or something? You were like a black belt, or something?
Oh, don’t go there… Something like that, 30 years ago… But yeah, once, when I was a kid. But you know what? I was never – I was never a grandmaster at anything. I was just trying not to get pummeled. Yeah, I was just trying not to hit the mat, and that’s it.
Okay. Well, today we have with us an actual Grandmaster, a Kaggle Grandmaster, Christof Henkel, who’s a senior deep learning data scientist at Nvidia, and a Kaggle Grandmaster multiple times –
Triple Grandmaster, by the way…
Yeah. In multiple of the different categories. So welcome, Christof. It’s great to have you here.
Welcome, Daniel. Welcome, Chris. Very happy to be here. Awesome.
Yeah. Well, for those that aren’t familiar with this concept of Kaggle Grandmaster, could you kind of give us the briefing on what exactly that means? And in the context of also Kaggle, what generally – I think a lot of people are familiar with that, but just in case, what is Kaggle, and what does it mean to be a Kaggle Grandmaster?
Yeah. So what is Kaggle - I would say it’s like a platform for machine learning in general. It started off as a platform for hosting machine learning competitions. That’s how it became popular. But in like the recent years, it also expanded for like being a platform for discussions, being a platform for sharing notebooks… They’re hosting millions of datasets, so they are trying to become really like the go-to community for every topic around data science. And it’s free to register for everyone, and they also provide some free resources, where you can run code, and try different stuff, and competitions.
On this platform, they introduce different tiers in order to gamify a little bit, so to incentivize users to post content, or to participate… So there are four different areas in which you can reach like different levels. There are like competitions, which is the most famous one, there’s also notebooks, where you just progress by sharing notebooks with others, and the progression is based on upvotes on your notebooks. Then there are discussions, which work in the same format, so you post an answer to a question, or you post an interesting topic… You can also post just memes, and generate upvotes in this way…
Then there’s datasets, so you can also post an interesting dataset, why a dataset you think might be helpful for others, and then people can upvote your dataset, and by this, you progress. And you basically progress by earning medals. They’re like bronze, silver, and gold medals in each of the four areas, and then with these medals you can reach like different tiers. So you start as a novice, I think, then you’re a contributor expert, then at some point you’re a master, and like the very last stage is a grandmaster. And to put that into perspective, so from the 10 million users that are registered on Kaggle, there are 280 competition grandmasters. So it’s really like the elite of the elite, the top notch people in the area, I would say.
So I have to ask, because we were talking about it… Which of the three categories are you a grandmaster in, and what’s the fourth one that you’re not? And of course, I’m going to ask you when you’re gonna become a grandmaster in the fourth one.
Well, I’m a Grandmaster in competitions, and that’s the most difficult one.
Indeed.
Then I’m a grandmaster in notebooks, because I shared some high-value notebooks, and then I’m also a grandmaster in discussions, because I like to discuss stuff. That’s also why I’m here.
Okay.
But I’m not so fond of curating datasets and uploading datasets…
I can’t blame you…
That’s why only a beginner in the datasets.
That would be the one I would choose first. [laughs]
See, that’s Daniel. Daniel loves to do data crunching, and stuff. It’s sick; that’s terrible. But - so I understand, I give you a pass on not being a Grandmaster on the fourth one there.
What got you into Kaggle in the first place, and what was the journey like towards where you’re at now? Some people might just be jumping in on Kaggle, and like trying things, and they have a vision of how far this could go, but what was the journey actually like for you?
[05:48] I think it’s quite interesting, because my journey began right in the last month of my PhD. So I did a PhD in mathematics, and in the last few months, so after I sent out everything and I just was waiting for my defense, there was suddenly some free time; and also free weekends. I wasn’t used to that during the PhD. And I was also always curious about the AI topic. Back then - it was five, six years ago - it was not so hyped as now, but it was a rather niche area, with neural networks, and so on… So I was just curious about that; I watched some YouTube videos, started a Coursera course on what are neural networks, and so on… And due to that, I quite quickly found out about Kaggle, and then just started with my first competition right away. And since then, I’m hooked in the system.
And how long has that been?
Six years now, I think. And during those six years, also my professional life progressed more and more towards machine learning and deep learning and data science. So six years ago, when I joined Kaggle, I was working as a risk analytics consultant. So I had nothing to do with machine learning, I had nothing to do with data science. I programmed a bit on risk models, so I had some background in like R programming, or MATLAB, but I’d never used Python before. And then due to Kaggle, also my professional career shifted towards machine learning and deep learning. Until right now, I’m working as a deep learning data scientist at Nvidia, which is like one of the top notch companies in this area.
Yeah, that’s like the gold standard of jobs in the AI world right there.
So do you feel like the experiences on Kaggle and your success there - in what ways did that kind of contribute to your own sort of career advancement, and also your understanding of what you wanted to do as your career advanced?
Yeah, it really had a lot of impact. So step by step, I moved into the position I’m right now. So when I started, I was doing Kaggle before and after work a bit; not too much, like half an hour after work, or before work, and on weekends. And then I made some – and of course, I did horribly on my first competitions, because I had no clue of anything… But the nice thing is that you really progress step by step. So in the first competition, you do horribly; in the next one you do badly, but not horribly. And then you progress more and more, until you become better and better.
I quite quickly realized that it’s a lot more fun in like machine learning and deep learning, than on risk consultant, just because you can be more creative, I would say. I moved within the consultancy company - I was lucky that they also had like a data science team, so I moved to the set data science team there, and I had my first synergy effect between Kaggle competitions and what I learned there, and what I was using in projects, so I could use my skills in the project, and I could also use skills I gained in the projects in Kaggle competitions.
But that was kind of five, six years ago; there wasn’t much deep learning in the industry, especially in the insurance industry, where the focus was in my consultancy company. So I was not challenged enough, but I wanted to do more and more in this field, and also my skill set grew more and more, so I decided to quit this job and found my own deep learning consultancy, just to have like even more synergy between projects and between Kaggle.
Tell us a little bit about what that was like in those days… Because as we’ve grown up with deep learning over the last few years, I would guess that at least in the beginning it was a little bit challenging to land engagements maybe. Or did you have them from the start? Because I know for me early in that phase, about the time Daniel and I started the podcast, people were like “Deep what?” So did you have any challenges in those early days, that have obviously evaporated as the world has taken this on?
[10:10] Certainly. Not only in terms of projects… So people, especially the decision-makers, I would say - they were really cautious about the, let’s say, possibilities you can do with deep learning. Especially like five, six years ago there weren’t any resources around… So I talked with customers about what amazing things you can do with deep learning, and then they didn’t have a single GPU they had access to. So that’s really like two worlds clashing against each other. So there were a lot of interesting and challenging problems around that.
But as soon as they basically gave me a chance, and I could do some prototype, and I can really show what you can do, then it was easy to convince them. But to get to this point, especially as like a young startup, a young consultancy startup, that was quite difficult.
So I definitely want to get in too many things later on, but I’m also thinking about these people out there that are maybe inspired by your journey, and wanting to get involved in Kaggle, and other things… I’m wondering if you can share a little bit about – you and Chris were talking about perceptions around deep learning that have shifted; also during that time the tooling around deep learning has shifted, and like the accessibility of maybe – thinking about four years ago, if I was to train a deep learning model for a Kaggle competition, versus being able to do that now… How have you seen that shift over that time period, in terms of this sort of ability for people to - I guess people use the word “democratize”, or whatever… The ability for people to hop in and do something advanced like that very quickly.
There are like two aspects, I would say. One is like software-wise and framework-wise. There has been a lot of progress there. So when I started, it was still like TensorFlow zero point something, which was working, but it’s really like low-level programming. So there was nothing like an RNN layer, or a transformer layer. You needed to code everything from scratch. But that also helps a lot for understanding the things. So I think nowadays people don’t really understand the granular aspects of deep learning, because you just do something like “model.fit”, and they don’t have any clue what’s happening behind the curtain.
So certainly, it’s easier nowadays to train a model just by these higher frameworks. Just calling by name, there’s not only stuff like Keras, PyTorch Lightning, there’s like a lot of different frameworks you can use, which are really high-level and accessible for beginners… And there’s also a lot of training material for these frameworks, so a lot of tutorials… So it’s really easy to train a simple model for a simple task. But also in terms of resources, I think they are more beginner-friendly, because on Kaggle, for example, five years ago they didn’t give you any resources. There was no Google Colab… So you basically had to have your own GPU at home, you needed to build your own desktop machine or something, or spend your own money on cloud resources… But now, for beginners, you can get access to Colab, which gives you a free notebook to experiment, you get some free resources on Kaggle, there’s a lot of student credits and student programs… So it’s really easy to start your data science journey, I would say. And there’s also a lot of more material online where you can really teach yourself, I would say.
So Christof, as you were kind of leading in, talking about your entry into the world of deep learning and your career shift to accommodate that, and you’re talking about kind of learning from Kaggle competitions, and engaging in that, and then it was increasingly applicable in your professional life, can you talk a little bit about how that happens? Like, when you’re thinking about a Kaggle competition, and you’re now working in a job in this field, how did the two relate? How are Kaggle competitions relevant to solving real business problems, in a real job, and getting that synergy? What is that like? What is the connection between the two like?
I would say there are a lot of synergy aspects. So doing a Kaggle competition is really very similar to doing a project at work, which is about performing a first prototype. So in a Kaggle competition you get a problem, which you’re not familiar with, often from a different domain; it can be from biology, it can be from astrophysics, it can be from chemistry, it can be Bengali language, sign language… Just so much different problems that you have no clue about when you start. And then you finally have like three months’ time to find like the best possible solution, and also compete with other data scientists.
This prototype project character is very seminar. So you have like this three-month time window. Then you have a collaborative part. In Kaggle you can also form teams, so you can participate in competitions in a team, which is very similar to working in a team in your job, with all the ups and downs, I would say, with working in a team, under pressure often… So Kaggle competitions can create quite some pressure, more pressure than you might feel in your day to day job. So you also get used to working efficiently with others, in terms of coding, in terms of reading their code, in terms of structuring the project… So really, all aspects of project management are also important. And also things like optimizing runtime, and optimizing code structure. You wouldn’t think that it’s quite important, but I think it’s quite important also for Kaggle competitions, because recently, they run the competitions on a restricted hardware; so you just submit your code, and they will run your code on their infrastructure, using their Kaggle notebooks. So you need to have your code in a way that it’s kind of production style; that’s also what you would do in a project - so you would develop ideas, and so on and so forth, but at the end, you will want to productize your code, and you need to think about all these MLOps problems as well. And you also train those skills in Kaggle competitions.
[18:08] So I really like the parallel between the two worlds… That said, I’m gonna say that two things are really different between Kaggle and the real world project. First thing is data acquisition. That’s like a very big topic. In like the real world it’s no topic– not no topic, but a very minor topic on a Kaggle competition you already have your training data. Of course, you sometimes can expand your training data by looking for more data online, but in general, you already have like a fixed training set you can work with… Whereas in the outside world, on the real world, that could be like the main problem, just to require some data.
And the second thing is definition of the metric. So in Kaggle, people are evaluated based on some metric, and this metric is predefined before the competition starts. Whereas in the real world, that can be a discussion which takes for ages between like data scientists, the business, and just creating a metric that is representative of the business problem can take a lot of time, and you don’t have these issues and discussions in Kaggle.
I’m curious, as you were describing that, I have an idea that came to mind… So recognizing the limitation of you already have data provided, and recognizing the fact that the metric is well-defined on a Kaggle team, and both of those are kind of optimal situations compared to the business world… But from the perspective of an organization out in the world, any organization that is keenly interested in data science, and stuff - would forming Kaggle teams or participating in Kaggle teams be a good recruitment tool? Because if you can find people that are performing well on teams in that capacity, it does doesn’t check every box for what the business role is doing, but it kind of gives you a sense maybe of “This might be someone who could fit in with us.” We’re gonna throw the messiness of datasets and the messiness of metrics on top of that, but what do you think of that idea? Is that something that people might be thinking about in terms of trying to build data science teams for their organizations?
Certainly. I think that that would be a great idea if people do this. And some companies already use Kaggle as a hiring tool. So in order to run a competition, those competitions are sponsored by someone. And there are sometimes companies who will sponsor the competition, but also tell the participants that they are hiring. And that if you are finishing like in a top spot, you can apply for a position there. So getting a position is kind of part of the winning price, sometimes. So they already see that Kaggle is very good for finding good candidates. But as you said, you could also - and Kaggle nowadays even offers the concept of a community competition, where you host the competition by yourself, without any Kaggle interference, and you could do this as kind of an assessment center for filtering potential hires, so see how they interact on a problem, or see how they work together.
Normally, Kaggle competitions are like three months or so, but there are some formats, for example Kaggle Days, which is like a conference type of thing - they host like these conference-specific competitions, and they just go like one afternoon. And people get like a simple data set, and they have one afternoon to get like a good solution. And you could definitely see how this would benefit an assessment center, for example, because they really see the whole range of skills people can bring to a company.
I have to ask, of the sort of competitions and the notebooks that you’ve contributed to Kaggle, maybe the discussions too, what are some highlights for you? Like, of all the things that you’ve done, what are some highlights of the things maybe either you’re most proud of, or that you would like to highlight?
[22:16] What I’m most proud of certainly are the Google Landmark competitions. So there’s a competition which was hosted three times yearly by Google, and this is about classifying popular landmarks. So you have a dataset of 5 million images, so it’s really like large-scale, and in this 5 million images you have 80,000 classes, so 80,000 different landmarks, and you need to classify between those landmarks.
And the difficulty, especially there, is that for some landmarks you only have one or two images, which makes it quite complex to classify. And another complexity is some landmarks are quite looking differently from different angles. You can think of a museum, for example - people take a picture outside of the museum, people take pictures within the museum, and you still would classify it as the same landmark, for example.
So the competition is quite tricky, and I was able to win it three times, and two times of that without a team. So just solo. And that’s something that’s even harder in a Kaggle competition. So without participating within a team, but soloing; that brings a lot of additional, let’s say, mental stress… Because you don’t have a team you can talk about your problem with; you’re just like isolated, working on a problem for three months, with like high pressure, and so on and so forth… So that brings another level of like mental component to the game. So I was quite proud that I could win two of those, or win three competitions, and two of those without any team.
So I’d like to follow up on that… If you’re talking to people out there that might be either already participating in Kaggle, not at the level that you’re at, or thinking about jumping in, what are some of the attributes that you – and I want you to take a moment and harp a little bit on yourself, I’m asking you to, and say what are you bringing to the competition that really has given you an edge in getting to that Grandmaster level, and being so competitive at that level? Do you have anything that you can offer people that are kind of maybe a little bit intimidated by it, or they’re trying to think “How can I level up a little bit?” What would you say?
I mean, I definitely have some analytical thinking just from my study of mathematics, because the old study is there to basically learn how to think efficiently, how to solve problems efficiently… So that definitely helps. And coming from natural sciences, in the broader sense. Also, a sense of solid experimentation is very important. So really having a clean workbench, so to say, logging your experiments, following up on ideas, and so on… So really like thinking like a researcher in natural sciences, and following your experiments in a clean and reproducible way, that’s also quite important. But I think what really pushed me to the top level is the curiosity of different domains.
Even like top people, they tend to, let’s say “lean back” and do what they’re good at, and not expand and learn further. But I would say one more edge that I get is that I really try a lot of different ideas, and at the end, in different areas; I try to explore very different competitions, very different domains. And at the end, every now and then I can leverage from something that you would think has nothing to do with the other, but you still can leverage some ideas and apply some concepts.
[26:15] For example, you can transfer knowledge from audio classification to biology, or to astrophysics, or from NLP to computer vision, and vice-versa. So there’s a lot of synergy people wouldn’t think about, and therefore it’s quite helpful to explore as different domains as possible.
You alluded to this a little bit in what you were saying about used to – with Kaggle competitions maybe you had to build your own machine with a GPU in it to sort of operate in that. Now there’s good resources with GPUs… But I’m wondering, from your perspective both as a competitor and a Grandmaster, but also as a really senior data scientist at NVIDIA, how do you view kind of GPU acceleration as kind of important in playing a role in Kaggle competitions? Probably most people think about it in terms of training a model, but how do you think about that more holistically, in terms of the accelerated process that’s key to performing well in competitions?
So certainly, GPU-based programming/calculation is like the bread and butter of training any model nowadays, but also, especially NVIDIA, they are looking more and more into moving other parts of your data science pipeline onto the GPU, just to make it faster. And especially for Kaggle competitions, the speed of which you can run your stuff and try ideas is very important. So when a lot of people on a top level compete against each other, one of the edges you get is when you can do more experiments than the others, which are just bound by, of course, your ideas, but most of the time I’m not running out of ideas, but I’m running out of time, and the competition ends. So as long as I can run more experiments that other people can do, because I have a more efficient pipeline, or I can run more parts of my pipeline efficiently using GPUs, that gives me an edge. And some examples of this are like data pre-processing – or let’s start even one step ahead… The first step is just data loading; just loading your data frame for doing anything can be GPU-accelerated, and then it’s just like 100x faster. So every time you’re working on the problem, you get a 100x speed-up just in the step of loading your data. And that’s what Rapids, for example, is all about. Rapids is a media tool stack which is all about accelerating those parts which are not like training the model, but are what is normally handled with Pandas, for example. So they have a part which is called cuDF, which is basically the Pandas on GPU. They have something which is cuML, which is basically a scaler on GPU. So things like clustering, all this stuff you can do on GPU nowadays.
Other examples… For example, NVIDIA DALI - that’s a tool especially for image processing, but they also support audio and video. But an example there would be decoding of JPEGs. So people wouldn’t think about that, but something like adding a JPEG on your disk; and just loading the JPEG involves some decoding step, which basically decodes the JPEG format. And this can already be done using GPUs, and can be accelerated by GPUs, and it also gives you a significant speed-up during your training, during anything which uses the images.
[30:05] So there’s a lot of different steps in your pipeline that you can accelerate, and that’s what all accelerated data science is about. So NVIDIA tries to move the complete pipeline from loading the data to saving conclusions, results, all end to end on GPUs.
Yeah, that’s really interesting. And I’m guessing that some of the things that you’re talking about, like loading images, or loading data frames, or manipulating data frames maybe, doing certain operations, doing clustering… I don’t know that this is the case, but I would guess those things pretty consistently show up across competitions too, or in the real world you could think about them as showing up across many different business problems.
You were talking about your pipeline of processing, which I think is a really – I’m wondering if you can dig into that a little bit; not a specific pipeline, but how you think about solving a problem… Because most people might come to a Kaggle competition or a real world problem and say, “Okay, here’s my data, my main step is this sort of like training of the model, and maybe evaluation. How good is my model? Retrain it. How good is it? Retrain it.” How do you think about the data sort of pipeline around – you’re talking about running experiments… What does that sort of like data pipeline look like in your mind, and what are some of those reusable components, or things you find yourself doing over and over again, that are accelerated, that you’ve found accelerated ways to do those things using GPU tooling like Rapids, or this DALI?
It really depends from project to project, I would say, where it’s applicable or not. So I would say that Rapids, for example, is even more applicable to the real world, because there you might have way larger data frames, for example. So if you’re like a bigger company, you have like user data, or you have client data or whatever, because the Kaggle competitions often are packed into little problems that people can work on, and not at like this company size, large-scale datasets with like millions of users, or thousands of users… And things like Rapids especially shine in like these large-scale datasets.
For me, my pipeline is, I would say, modular. And that developed through the years, coming from the competitions. So of course, I try to reuse as much as possible, just to be efficient. So I have a really modular setup, where I have one part which is just a model training, one part which is about the storage of my data, one part which is about logging the experiments, and tracking results, and visualizing results, one part which is about the framework’s setup, so to say… So I use Docker, with a specific PyTorch image, to have always the same environment, and I also can replicate my experiments and I also can use the exact same environment on different machines, so in the cloud or locally. That’s all things I learned during the years. So it’s a little bit complicated to explain the old pipeline now on the podcast… I actually gave like a one-hour presentation two weeks ago just about this topic. So it’s pretty difficult to condense into a few sentences.
It’s hard without a diagram, for sure… But it’s super-interesting to me - like, the things you’re talking about that you’ve made modular I think are things people operating in a real world data science environment eventually need to make into sort of like components that work within their team, right? My team, we love using, for example, Streamlet to do some data manipulation, visualization, interactive stuff on the other end, and we reuse a lot of those components. And we have certain models, multilingual models that we train over and over. So we’ve got modules around that, and then like pre-processing, and other things.
[34:17] So it’s interesting how much what you’re talking about overlaps with the efficiencies you gain over time as a data science team operates together, and they learn how to make their own processes more efficient. So I think that that’s really interesting.
I have played around with Rapids a few times, and it is really cool. I’m just looking at the latest stats here on the Rapids website, and it’s talking about performance on 300 million rows by two column data frame, with like the highest speed-up being for like group by operations like 80 times faster than not you using Rapids. So I don’t know how long that saves you, but also, like you were talking about, if you are doing experiments over and over, and you want to rapidly do experiments, even if that saves you, let’s say it’s something small-ish, like in minutes, right? A couple of minutes… Like, you’re able to do things much faster and automate things; like, your automation goes faster, you can learn things much faster, and reduce that cycle time. Although I’m also assuming for many people for their data it might be more than a minute’s long speed-up potentially, on some of those operations.
So yeah, I don’t know, when you’re helping people - and you mentioned the discussion groups, and the notebooks that you’ve worked on on Kaggle… Is this something where you’ve seen like light bulbs come on for people when they’re saying like “Oh, I’m trying this group by operation, or something, on this data, and it’s taking me like 15 minutes every time I run through this”? Is that something you’ve been able to bring in those discussions, and notebooks, and such on Kaggle?
Yeah, certainly. So loading data frames is a good example. So 80 times sounds not that much, I think, but it’s like one minute or two hours. That’s like the scale you’re talking about. Like, loading your data frame in two hours, or loading it in one minute. That’s like an 80x speed-up difference. And especially in Kaggle, those discussions get a lot of traction, because on your inference you actually have like a time limit of like nine hours. So people try to get as much stuff into their submissions as possible. So loading data frames, manipulating data frames, loading images, all this stuff, if you can speed it up, the people will be very, very gratefully adept, whatever you give them to speed up their stuff. And that’s only the inference side. So that’s even more true for training, because as you said, my day to day is doing a lot of experiments, and those speed-ups accumulate. So the very first thing I ever do in a competition, like the first two weeks or so, I just optimize my workflow. So I optimize all the runtime, optimize how I load my things, accelerate all the pre-processing, post-processing, whatever I have in my pipeline, so I can then leverage the remaining time from like the most perfect setup, or the most perfect code. Because then I can just run more and more experiments.
So I’m curious, because as you have been talking about optimizing and being able to do all of these iterations on your experiments, there are people out there, including myself, that are thinking whether they are wanting to jump into a Kaggle competition, they’re psyched up because they’ve been listening to how you’ve kind of mastered this process… Or they’re working for a company and they are trying to get their own systems better and better, and early teams really struggle with that.
[38:07] And so either way, with you talking about you’ve done, and Daniel is jumping in and talking about it - there are people that want to be there with you; they want to at least get on that path. Do you have some concrete recommendations on somebody who’s at the beginning of that, and they’re like “Okay, I’m doing data science, but my God, it’s taking me a long time to get through each iteration, and I’m listening to this Grandmaster just cranking out productivity so fast.” What are a couple of specific things that you would say, “Go do this, and that, and that”? …recognizing that they’ll find their own path forward, and they’ll make their own adjustments, but how do they get on that path to begin with?
The first thing - and I’ve told this several times - is just to start your very first Kaggle competition. So you go to kaggle.com, you look at the ongoing competitions, which is like 15 to 20 ongoing competitions… And just choose any topic you find interesting; you don’t need to be an expert in this topic, you don’t need to even know about the domain, or something. But just starting is like the first step.
And as soon as you start, just by the sheer amount of knowledge which is chaired within the forums and the notebooks, you will see that you will learn very, very efficiently how to improve your code, how to improve your skill set, and you get like immediate feedback on the leaderboard, for example, while in discussions; if you add like a comment and it doesn’t make sense, then people will tell you. If it does make sense, people will also tell you.
And the leaderboard is an objective way in seeing your performance and seeing your progression. So that’s the very first advice I would give someone - try to find an interesting competition and just start. There’s basically nothing to lose; you just can gain knowledge. As I said, you will perform poorly on your very first competition, no matter where you come from. But just starting is like the first step.
And as you start, I think the best advice is that you start simple, as simple as possible, and just try to progress from that. You start with a very simple model, with a subset of the data, or with images which are downsampled to a low resolution, just to find like an efficient pipeline and to work on your code… Because all this is like an investment for the future, and all this gives you an easier setup to work on and to improve on.
Yeah. Really good advice. I think that part you talked about, about like spending a couple weeks optimizing the sort of inputs, outputs and those portions of your pipeline, so that you can really put a lot of your focus on fast iterations on the model, or that middle bit - I think that’s really, really good advice.
This has been a really fascinating competition. I have a long way to go to be a Grandmaster, that’s for sure… But as we wrap up here this discussion about accelerated data science, and the Kaggle competitions, what are you excited about sort of looking to the future? You mentioned that you’re curious about all of these sorts of different domains, you’ve worked on a lot of different problems… What really excites you right now, as you look towards the future, in terms of things that you want to try, or just in general, things that you’re excited about in terms of the tooling, or the community around what you’re involved with?
I would say in the short-term I’m definitely excited about or interested in how AI will support my work. So something like GitHub Copilot or other natural language models which help me code. I haven’t tried them much, but I think that in the near future, or the short-term, those tools will support our everyday life in some way. But I’m even more excited in like the long-term prospects, like what will happen in 10 years, in 20 years. And that’s really exciting. Because if you think back like 10 or 20 years in terms of AI and what systems could do, and where we are right now, and you extrapolate that into the future, that will be very exciting and amazing, what will happen then.
Yeah, yeah. I think that’s a great way to wrap things up. Thank you so much for joining us, Christof. Really looking forward to following your progression and the things that you work on in the future, and the great things that continue to come out of NVIDIA. So thank you for your work, and thank you for taking time to join us.
Thank you for having me.
Our transcripts are open source on GitHub. Improvements are welcome. 💚