Practical AI – Episode #296
scikit-learn & data science you own
with Yann & Guillaume from :probabl.
We are at GenAI saturation, so let’s talk about scikit-learn, a long time favorite for data scientists building classifiers, time series analyzers, dimensionality reducers, and more! Scikit-learn is deployed across industry and driving a significant portion of the “AI” that is actually in production. :probabl is a new kind of company that is stewarding this project along with a variety of other open source projects. Yann Lechelle and Guillaume Lemaitre share some of the vision behind the company and talk about the future of scikit-learn!
Featuring
Sponsors
Timescale – Purpose-built performance for AI Build RAG, search, and AI agents on the cloud and with PostgreSQL and purpose-built extensions for AI: pgvector, pgvectorscale, and pgai.
WorkOS – A platform that gives developers a set of building blocks for quickly adding enterprise-ready features to their application. Add Single Sign-On (Okta, Azure, Google, Microsoft OAuth), sync users from any SCIM directory, HRIS integration, audit trails (SIEM), free magic link sign-in. WorkOS is designed for developers and offers a single, elegant interface that abstracts dozens of enterprise integrations. Learn more and get started at WorkOS.com
Shopify – Sign up for a $1/month trial period at shopify.com/practicalai
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Welcome to Practical AI | 00:34 |
2 | 00:35 | Sponsor: Timescale | 02:17 |
3 | 03:05 | What is :probabl? | 04:29 |
4 | 07:33 | Stewarding open source projects | 03:39 |
5 | 11:13 | What is scikit-learn? | 02:06 |
6 | 13:19 | The data science landscape | 03:43 |
7 | 17:14 | Sponsor: WorkOS | 03:22 |
8 | 20:50 | Scikit's role with general purpose models | 04:36 |
9 | 25:26 | Further development | 04:40 |
10 | 30:07 | :probabl's open source relationship | 05:10 |
11 | 35:31 | Sponsor: Shopify | 01:32 |
12 | 37:19 | Fun & interesting use-cases | 07:23 |
13 | 44:42 | Getting started for new devs | 02:31 |
14 | 47:13 | Future of scikit & :probabl | 04:02 |
15 | 51:15 | Outro | 00:46 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am the CEO at PredictionGuard, and I’m joined as always by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris?
Doing very well today, Daniel. How’s it going?
It’s going great. I was saying that I’m really pumped to be talking about something that’s near and dear to my heart over many, many years… Because today we have with us Yann, who’s the CEO at :probabl, and Guillaume, who’s an open source engineer at :probabl. Welcome.
Thanks for having us.
Well, Yann and Guillaume are working on data science that you own, including projects like Scikit-learn, which is, of course, very near and dear to me, along with other data scientists all around the world. So Yann, if you could, since you’re coming from the CEO perspective, help us understand a little bit, maybe for those that have heard of Scikit-learn or some of the other projects that you’re involved with, but they haven’t heard of :probabl… If you could give us a sense of what is :probabl. As you mentioned in the lead-up to this conversation, it’s a slightly different kind of company, that came about in different sorts of ways than other types of startups… So yeah, if you could give us a little bit of context, that would be great.
Well, very glad to be on the show with you today. :probabl is a company that is typically known as a spinoff from a research center in France called Inria. And Inria is the place where this technology, Scikit-learn, has been developed over the past 10, 15 years. Not many people know that, and the project has been somewhat protected and sort of incubated within that research center. And after all that time, as you know, Scikit-learn has been adopted, or even probably participated in creating the field of data science, because it is applied math, and essentially has created a sort of paradigm for how data scientists approach data science, typically through two functions: fit and predict. And the French government has a national strategy for AI many, like many, many countries. And the government decided to double down on Scikit-learn. And they came up with a budget, they entrusted the research center with that budget, but then they also asked for the project to be breakeven at some point. And the team said “Okay, break-even is fine, but we don’t do that in the research center. We don’t breakeven. So why don’t we call an entrepreneur to try and help us figure it out?” And they called me.
I have a track record as a software engineer and an entrepreneur in tech for the past 25 plus years, but I’m not a data scientist. So I did my due diligence and I sort of dug deep to find out what this project was all about under the hood. Is it any good? Is the community any good? And of course, Scikit-learn is this quite amazing jewel of a technology, that every data scientist on the planet uses. I discovered that it was downloaded 1.5 billion times cumulatively, 80 million times a month, 22% in the US, only 3% in France… So this is a project that is used all over the world, and :probabl is essentially the spinoff that takes all of the team, including Guillaume here, from the research center, and turns it into an open source company that inherited the mission that was initially given to the research center. And the mission is to build a suite of open source technologies, including Scikit-learn, but above and beyond Scikit-learn as well, for data science. So the scope is large, the mission is noble, and this is what we’re building, essentially.
So :probabl is a one-year-old company, that has already started doing many, many things, and Guillaume is the representative here for Scikit-learn, this technology that is used, again, by every data scientist on the planet.
Well, this has brought up a lot of interesting questions on my end, and I really love the part of your pitch, and at least how you framed it on your website and in your materials online about data science that you own, and the open source side of this… Which I know from experience, there can be some interesting challenges around finding business models that really work with open source technologies… And we’ve seen technologies where companies start with a posture towards open source, and then gradually become more closed over time.
[00:08:09.27] So I’m wondering, from the leadership perspective, it sounds even in the way that this company was formed, that there is a posture towards stewarding Scikit-learn and these types of projects… But from your perspective, what is your posture towards stewarding these projects in the open source side? And how do you view the business element of this to make it sustainable in the longer term?
So that is the hard question, but it is the one that is important here. Scikit-learn is a technology that is, again, applied math. It’s not rocket science, but it’s applied math, and it’s quite intricate. The thing is, the scientific community uses it day in, day out, and everyone depends on this.
Typically, when I discovered the scope of the project and the mission that was entrusted to the research center, I realized that this project is bigger than me, number one. Number two, the mission is to actually create more open source. In other words, in 2024 it’s even more acute. Typically, big tech keeps on amassing so much power, concentration, and we could argue that they do not distribute as much as they should. So that’s not a judgment, but it is a fact. And Scikit-learn is precisely the contrary. It actually enables so many companies to do data science. So with that in mind, before creating the company, we decided to craft a sort of architecture for the company that would respect that. And so before we created the company, before Guillaume joined as a co-founder, before I even incorporated the company, we had a template that actually created the governance, the shareholding structure, and also leveraging a new law in France that allows us to do a sort of B-Corp. So a company with a mission where the mission is clearly stated in the bylaws. And that mission is to create open source for data science.
So in a way, we’ve created a sort of constrained environment that is unlike many companies, because it’s by design. This company by design has created guardrails so that the governance cannot take this company too far on the right, let’s say, proprietary technology, or even changing the license. That’s not in the cards. And we’ve created a sort of mechanism where if we do not uphold to the mission, then we can actually lose some of the assets, such as the brand. We are the official brand operator, but the brand belongs to the research institute still. So there are many mechanisms, trigger mechanisms that force us, including shareholders, that we would bring in to actually bind with the mission long-term.
Gotcha. You’ve raised so many questions for me that I want to ask… I actually want to take just a moment and kind of go back, because it occurred to me as we’re talking about this… For some folks listening who may have never even used Scikit-learn - they might have heard the name and stuff, and you talked about it being applied math… Could you guys expand on that a little bit for somebody who hasn’t had a chance to ever actually utilize it themselves, in terms of what it’s doing, and kind of catch them up to us in the conversation a little bit? And then I’m going to pepper with a few more questions, because you’ve got me really interested there. You hit so many topics on that last.
Yeah, so maybe I can give a bit of background. So Scikit-learn basically as a tagline is “Machine learning in Python.” So let’s go back to the, let’s say, statistical roots. So the simple answer is we try to make predictive modeling. So we try to use mathematics from data viewable in the future, to give answer to a specific question, to a specific paradigm.
[00:12:12.08] The big difference with generative AI or deep learning is that all the statistics that you have, there are simple steps. So they are fundamentals. And deep learning builds on those, but are just much more, let’s say, costly to train, costly in the inference states, or not in the same scope as well. And Scikit-learn is the de facto choice when you want to have like tabular data, so Excel spreadsheets, data structuring this way… So that’s the de facto way of training, to be able to eat those spreadsheets and give back some labels or some regression, let’s say.
And whatever is like image, or NLP, this is like, let’s say, deep learning and transformers. It’s more in that area. So we are more back to what was machine learning a few days ago, but that have many, many, many applications, basically.
Well, considering how incredibly popular and foundational in the data science world, could you kind of give me a little bit of a landscape view? And I’m not sure which of you would be the right one to answer, so you guys pick between yourselves… But a little bit about kind of how that fits into the data science landscape, with AI coming in, just so that with people listening, they can kind of go “Ah, I see how it fits into the many organizations and tools that are out there.” How do you think about that for that? And then I’ll get back a little bit more to the organizational stuff that Yann was talking about a few minutes ago.
So maybe I can answer partly, which is by giving use cases. And to see that, for instance, with a partner as well that we worked over the years, to give like “Where do you find machine learning?” And for instance, machine learning can be found in healthcare, where you want to know if a drug works or not, then if you want to find diseases as well in some type of data… It could be as well like fraud detections in banks, in insurance, predictive maintenance, and all those application that we have since many years.
So let’s say the use cases are very, very large, and what brings Scikit-learn on that is that this is not – I mean, it’s not for one of the use cases. I mean, it was thought from the beginning to be general enough so that you can apply to any of those use cases, and to come back to, let’s say, classification and regression programs, let’s say, or unsupervised learning as well, that you can apply the tool anywhere in that field. So maybe, Yann, you have something more to add?
Perhaps also at a macro level is to say Scikit-learn does a lot of things, including deep learning. But to be frank, when you want to do deep learning, typically you’d go to PyTorch, or TensorFlow. But for everything else, Scikit-learn. In other words, in the great AI family of algorithms, there is machine learning, and within machine learning you have deep learning. Within deep learning, you have other categories of algorithms, such as transformer-based models, that lead to LLMs. So it’s basically Russian dolls of sorts, and Scikit-learn is the biggest provider of algorithms in the machine learning space.
[00:15:42.05] And in fact, if you look at the downloads, typically Scikit-learn is downloaded as many times as PyTorch and TensorFlow combined, which is crazy, because now everyone is talking about LLMs, of course, but also deep learning, because deep learning is currently in a spring state, not quite a winter yet. So of course, deep learning and gen AI is a wonderful breakthrough. That being said, I like to simplify sometimes the 80/20 Pareto distribution. So I had the intuition that 80% of the use cases out there use Scikit-learn when it comes to machine learning. And people actually tell me “No, Yann, you’re wrong. It’s more like 90%, 95%.” Because in terms of technology, that is robust, that is tried and true, that is used to actually turn a profit or return on investment.
Banks and insurance companies, right? Guillaume was mentioning fraud detection. Fraud detection typically uses Scikit-learn. And that actually saves money. Banks would be losing money without that. So it is actually quite essential. But again, it’s applied math, so Scikit-learn is only a facilitator to this category of problems.
Break: [00:17:03.10]
Yann, you’re kind of already going there, and I love the direction that you’re going with this, but I think maybe I could tee up a softball for you here, because I’m personally passionate about the answer to this question, and you probably have a better view on it. But there might be people out there maybe listening to this podcast who are thinking “Well, now that we have gen AI, we have large language models, I could put in a prompt to one of these models to do fraud detection, or to find entities in text, or to make some prediction of a classification.” And sometimes that works. And so maybe there’s people thinking “Well, there’s these general-purpose large models out there… How does that change the way that something like Scikit-learn plays in an industry?” And I personally would argue and think that this actually makes scikit-learn more valuable, if anything, rather than less valuable, in terms of the ways that it can be combined even as a tool that’s orchestrated with Gen AI models… But I’m curious your perspective on this from the business side, and maybe Guillaume has some ideas on the technical side.
Yes. So Scikit-learn typically is this one technology that is patrimonial. In other words, it belongs to everybody. In fact, there’s another stat when you look at the figures that are public, by the way, the number of dependencies. So Scikit-learn is actually used by nearly 900,000 projects on GitHub. So there’s nearly a million projects that depend on scikit-learn. And there’s a new law that I discovered recently, someone mentioned that Lindy effect, which means that something that’s been used long enough will remain important for long enough.
So I’m not saying that Scikit-learn will go the way of COBOL, but Scikit-learn is here to stay, and we are with the community, the guardians of that. So we’re going to make sure that Scikit-learn remains there forever for companies that actually need it in a stable version. And of course, Guillaume and the team are building up new features as we go. So there’s a dedicated effort, and I should say that we have carved out – nearly 10 people in the team are doing only that, contributing to scikit-learn and the other associated libraries.
Now, your question, Daniel, is whether Scikit-learn will be obsolete in, say, a number of years because general-purpose technology has made it irrelevant in some ways. Number one, Scikit-learn is extremely frugal. It actually works on CPUs, and it is well-controlled, well-understood. It’s actually quite predictable in some ways, whereas deep learning is usually known as a black box, where it’s really, really hard to introspect. And so scikit-learn does produce for certain categories of problems things that are actually working quite well, more so than large language models for sure today, and more so than any sort of deep learning-based technology that we understand today.
[00:24:16.18] Now, it is possible that with additional data, additional training and techniques, and even evolutions on the transformer-based model, we could improve and probably render obsolete Scikit-learn. But to us, and Guillaume and I, we talk about that, and with the team we also experiment with other LLMs, and we are also trying to figure out how we can use these new technologies to actually help our first persona, and that is the data scientist.
So we are a technology provider to help data scientists, and increasingly so the data scientists in enterprises, because we will be creating value-adding services and solutions so that we can generate revenue to sustain our mission. So the goal for us is to actually project ourselves while contributing to open source, but also create a sort of business value proposition not dissimilar to Red Hat… Because that is the closest type of company that we identify with in terms of spirit.
To that point that you’re making right there, I’d like to get back to something that you said earlier, that feels like you’re kind of tying back to it anyway there, and that’s that you talked about the mission to create more open source, and the mission that – you’re trying to create this environment that you’re describing by design, you said… And with scikit-learn here to stay for the long haul, it’s going to be something that is not going away soon; it’s solving such a high percentage of the problems. Could you describe a little bit about kind of what you’re thinking around that in terms of further developing this particular set of software and the ecosystem around it, so that we have the benefit for many years to come? How are you approaching that?
So the company is built with multiple business units, if you wish. That’s a big word for startup, but we have multiple revenue lines and multiple activities, even within the open source team, which is dedicated. So Guillaume perhaps can elaborate on some of the other libraries that we support, that complement Scikit-learn. So that’s one way to answer the question. But also, we are building a new product, which I call ReversibleSats. So we are building a product that will provide additional value to data scientists, and the goal is to create a sort of – I don’t want to use the term Copilot, because that is too close to LLMs… But it is the spirit. We are building a companion to augment the work of data scientists, all the way to teams. So that is an additional product on top of Scikit-learn, because Scikit-learn just works. So we don’t want to change that.
And contrary to a company that would build a SaaS solution with a proprietary approach, we want to say “Okay, whatever you guys use is fine. We need to find a way to add new value.” And some of it will be open source; fairly modular. But for those companies that have more money than time, that need more service than be on their own, we’ll have a solution for you, and we’ll make your life easier.
Data scientists are a new breed. It’s a new type of job. It’s not been around for very long. And in a way, when I talk to people – so I’ve been in code forever, and you know this, the developers, when they get hired, they are turnkey in some ways. They have their Git environment, and they know how to peer-code, and that’s all pretty standard. But when you talk about data scientists, it’s actually quite artisanal. It’s an art and a science at the same time. And you’re manipulating two objects: actual code - but data scientists are not coders - and you’re manipulating actual data. It’s not code, it’s patterns.
[00:28:14.12] And so data scientists have a difficult task, which is to combine these two things and create value for the enterprise. And then they talk to business units and they’re like “What do I do with this model? How do I put it in production?” So there’s a huge conundrum to solve, and that’s what we’re going to do, additionally, to building open source, that are modules that people can use. Maybe, Guillaume, you can elaborate on some of the other libraries that are key to actually help.
Yeah. So within :probabl we have the open source team – so we worked for many years on Scikit-learn already, but we see the importance, and as a community, we see the importance of putting models into productions, and as well getting closer to the data sources. So we are just like working on libraries that should make those come together. So for instance, we have a library that is more on the MLOps side that is called Skops. We work a bit to make the persisting more secure in some way. But we look as well on how to bring databases, like SQL words, closer to the machine learning models. So how can you transform data with states, with different tables, and how you can be in [unintelligible 00:29:27.25] We’re caring so much about like SQL, for instance, and how you can bring this into Scikit-learn.
And within scikit-learn as well we want to improve whatever is visualization, evaluations, inspection of models, which is on the top of just like training and algorithm. So we want to augment all those aspects beyond those, and either it’s in Scikit-learn or either this is like a library connected to Scikit-learn, let’s say. The one before is called Skrub, by the way. So it’s like scrubbing data. So Skrub and Skops are two libraries that we look at.
So as we’re kind of talking about the libraries now, you have this robust open source contributor community built up around Scikit-learn and the various projects within it… How does :probabl work with those? How have you guys set up that relationship? What does the governance look like on that? Because you have both your core team that you alluded to earlier, that’s working at :probabl on this, but you also have that larger open source community. How does that all work? Can you kind of tell us how that’s evolved? I imagine it’s quite mature by now.
And that’s the point. The maturity means that, by design, we decided to not affect the license of Scikit-learn. We’re not branching it out, we’re going to care for it. And so the governance of Scikit-learn being so sane already means you don’t touch it. If it ain’t broke, don’t fix it.
So the governance is unchanged. The center of gravity was at Inria, the research center, but also involving people all over the world. I don’t know, Guillaume, how many contributors? Maybe 200?
Or even more. I think in a year you have more than that. You have maybe 300, 400. The core team is, let’s say – half of it might be around France, around Paris, around :probabl, but then there’s another half [unintelligible 00:31:28.26] each person around the world that contributes almost every day, let’s say, by communicating with the community. And as Yann mentioned, we didn’t want to change that. Nothing changed in that regard. So the only thing that we actually did more - we did more to bring transparency. So to explain to people – so now that we’re in :probabl, we feel that because we’re a private entity, we need to communicate what are we doing, and what is our roadmap, and which community items are we going to work on… Just to bring more trust, such that we don’t go in the dark and that nobody knows now what we’re doing.
[00:32:12.04] So we tried really to pay attention every six months to mention which of the items that are defined by the community - these are not defined by :probabl. But from the items, which one we have the capacity to work with the human resources that we have, let’s say, at hand. So we really want to show that.
And then by design, the open source team that is full-time on Scikit-learn and other open source libraries, means it’s a cost center to the company. So that cost center is by design, and we know that’s a cost we have to cover. So we will cover it through different types of activities… For instance - and this was something that was done in the past, where brands were sponsors. So either they hire someone that becomes a core developer, and they’re naturally sponsoring someone to build up this technology, or they were giving money as a donation to the research center. But now that the team is with us, we are translating this into a contractual sponsorship framework. So brands who want to contribute to Segatern and help us compensate for salaries will get something in return - exposure. And if they actually put more money into it, then we’ll have conversations around the roadmap. Find a way to make it converge in a win/win kind of way. Because Guillaume, for instance, can say “This brand wants us to do something, but it makes no sense for the community.” Then we won’t take their money for a sponsorship type of business. However, if companies want to pay us to do a certain type of paid for software, we’ll look at it. But that’s a different branch of the company.
So we’ve really clearly separated, and by design we know there’s a cost to it… And that cost is actually - if we are doing well, it’s compensated by the fact that we have done good by the brand. In other words, hopefully the community will actually resonate with what we’re doing, and so they’ll pay us back by actually appreciating what we’re doing, which will carry the message further.
So we think that there is a self-fulfilling prophecy if we actually keep adding value to the whole scheme, as opposed to removing value. And I will not name certain projects that have chosen a different way. But on the other hand, going back to the governance of the company, when a company flips and becomes VC-funded, or only VC-funded, VCs require a sort of return on investment that is too radical. And so that sort of forces a change of posture vis-a-vis the community and the licensing scheme. In our case, we’ve actually created a structure that is balanced in terms of shareholding groups, and so we will ultimately have - or that’s the goal of the structure and the architecture - as much money from public support than from private support. So it’s, again, sort of balanced.
Break: [00:35:19.19]
So as we come back out of break here, I want to turn to kind of a fun question for you, and I’d like each of you to take a swing at it, because it’s not specific to being the CEO, or doing the technology itself. If each of you could describe kind of a cool use case, something fun, or interesting, or that’s really captured your imagination with Scikit-learn, and kind of share that with the listeners in terms of something that just kind of really took you as your thing. I’d love to hear – I’m expecting it to be a bit different coming from each of you and your different roles, but I’d love to hear kind of how you see that, and what’s the thing that sticks out in your mind.
Guillaume, you start, because I have to think about it now.
So it’s a very technical one, let’s say, but… So during my PhD I was doing classification, which is something that – I was trying to find people that have a specific type of cancer, prostate cancer, versus people that didn’t have it. And inside that space, you had one fairly specific problem, which is called imbalanced data. And it’s what introduced me basically to Scikit-learn, because I had that problem and I was using Scikit-learn for the specific issues and how to tackle those type of issues. And what is really funny is that – so how I got introduced to Scikit-learn [unintelligible 00:38:39.26] for instance, with the developers, and I developed one library which is called Imbalanced Learn, that is merging as well with Scikit-learn, and is compatible in some ways… And for many years, I maintained that package even when I was maintaining as well Scikit-learn. And over years, years after years, we did everything by the book, basically, in that library. We implemented the arguments that were inside the literatures, and everything was fine… Until that, as part of Inria and now :probabl, we have as well time to educate ourselves and to try to as well then bring through the documentations of Scikit-learn to explain some concepts to people.
And by doing this, we find out that most of the research there didn’t look at the prem properly. And by communicating with other core devs, we just found out that a huge part of this thing was just wrong, and that you should look at it in another way… And then it’s pretty funny, because with this, we’ve found some useless stuff that was, for instance, inside Imbalanced Learn. But then now we have better content, we went to conferences to explain these programs, and people start to tell us “Oh, yes, actually, that’s right.” And it’s fun that you come and say that whatever you were doing 5 years ago or 10 years ago is actually obsolete, or not good… We don’t expect from there. And it’s something that I find very fun when you do open source, because you are just here to contribute to something and just to bring the best of what you do to everyone, and everybody will be thankful for that. And you are not defending your own, let’s say, scientific paper. That’s all what is true. And for me, that’s one experience that comes from my PhD, from now eight or nine years ago, to up where I am now. And then I see an evolution where I was with very good people, and you could correct errors that you do in the past [unintelligible 00:40:38.05] that will benefit everyone afterwards, because that’s landing inside the documentation of Scikit-learn, or even inside the library, and then everybody will just – let’s say a million of users will be affected and say “Oh, actually, that’s good.”
And this is something that if I would have stayed in Academia, for instance, probably it wouldn’t have happened, because you wouldn’t have time or be critic enough, because you would have been in the [unintelligible 00:41:04.23]
It’s good.
I might be the CEO, but I do have the imposter syndrome… Because Scikit-learn is so impressive. Day in, day out – I mean, that team… And Guillaume is very humble and very discreet, but the amount of knowledge and the amount of technicality that is trapped inside this library is mind-blowing. And you haven’t met the other members of the team. It’s pretty much very, very hard to compete in terms of the amount of CPU cycles that go in there. So Scikit-learn is the gift that keeps on giving, in some ways, and the team is just out of this world, and nice, and it’s just a pleasure to work with that team all the time.
[00:41:55.02] Now, the more I discover scikit-learn and the more I find it amazing because of what the brand means to people. And so last week - and today, actually - we just released, and if you allow us to actually put the link in the notes…
Of course, absolutely.
…we released the very first official Scikit-learn certification program. And what’s amazing is that we – so this is the first time, so we’re doing it step-by-step, and the system works, people can register, they can pass or fail the test… But without advertising, we had like within a couple of days 600 registrations, all over the world. A lot from India, actually, because people in India, they do work also remotely for other clients across the world, and so they do need a stamp of approval to showcase their ability to provide a service. So very interesting that this brand almost instantly can promote a sort of service that is value-adding. So that’s the one thing.
But then on a more technical level, I fell in love with one new feature that came out with 1.5 of scikit-learn, developed by another co-founder and core developer, Jeremy… And that is the callback feature. Why? Because Scikit-learn in fact is a platform - it is a platform - and the callback feature allows us to provide extensions, if you wish, where people can hook into the inner workings of Scikit-learn as they are building new models. And in fact, I find that to be essential, because we are entering an age of liability with regards to AI. Companies need to be able to introspect. They need to actually find out why the model is producing such and such results. And so introspection is critical.
And as I said earlier, deep learning is sort of a black box type of approach, which I love, by the way. Again, in 1992 I was building deep learning models in the middle of the winter of AI at the time. But Scikit-learn is actually quite introspective, quite transparent, frugal, as I said… And so callbacks are yet another feature that provides actual introspection into how we build models. Because - talk about insurance companies, fraud detections - you’ve got human beings at the end of the spectrum being handled by algorithms. And so that is critical. And I think we fulfill a very important need with these features. So again, Scikit-learn, the gift that keeps on giving, and I’m impressed every day, with a bit of an imposter syndrome, because that team is just so powerful with this tool.
And speaking of this team, Guillaume, I’m going to throw a question at you… People out there have been listening to this, they’re kind of going “Okay, I want to dig into this.” So you’re going to get some new developers that are going to come… How should they engage? How should they find, and get started, and the projects to develop? What’s a good onboarding path for those developers?
Probably the best onboarding path is if you have a chance that inside your local community there’s some people that do what we call first-time contribution to open source or coding sprints, go speak to those people, because they will help you to get on board. But then, if you are behind your computer and then you don’t know where to start, it’s where we have documentations that describe what do we call contribution… Because contribution is not only coding. It could be speaking, debugging, documenting, organizing sprints, and those type of things. And we have what we consider as contribution, and how you can help basically, and where you can help.
[00:45:47.16] Of course, the natural thing is to come and code, and then we explain you how to start with that. So this is on the documentation channels, the documentation webpage. And afterwards, everything is online and public. So there’s nothing private. We have different channels of communications. The main one is GitHub, and it’s going through the issue tracker, or the pull request, depending on which side you are… And the core developer will be, I would say, 24 hours over 24, because we are around the world. So if I’m sitting with somebody else, in Australia or in the US, that probably can just answer to you. And then we’ll just give you feedback. And it’s where your journey starts.
You should not be shy, and you should not be scared of making a mistake, because we are not judgmental. We all started by that stage of saying “I don’t know what I’m doing, and I need to ask people what should I do”, and that’s a normal step. And afterwards, you just grow with the community, and then the community brings you over.
But the most difficult thing is the first step, engaging and saying… So I have the imposter syndrome as well, but people say “I don’t want – I mean, those very skilled people, they will never want to speak to me.” And that’s not the case. So just come and just try your best, and then people will just communicate with you, for sure.
Great guidance there. As we wind up, I’d like to get, for each of y’all, for both :probabl and for Scikit-learn, kind of what you think about for the future. And I’ll let you define what time span the future is, and whether it’s a few months or a few years out… But I’d really like to wind up… Paint us a picture of when the duties of the day have finished and you’re just relaxing, and you’re thinking about what’s possible going forward… What do you think about?
I’ll go with the mission. The mission is bigger than me, bigger than us, and so that’s why the governance creates a self-sustaining model. So of course, it’s not trivial. There’s a lot of work to achieve the mission long-term. But that mission ends up with an IPO. In other words, this company is not meant to be sold or wrapped up. The goal is to do an IPO, so that this company can carry on with the mission, and allowing people to invest, and be part of that story. And that’s why earlier Daniel asked a question about investors, and all that. So we do have 70 individual investors, including people who were contributors or are contributors to Scikit-learn, who don’t have the chance yet to be employees, full-time, of the company.
So the goal is to create this sort of dynamic vehicle, and if we look at the North Star, there is no such company today that is the provider of open source machine learning technology. That company does not exist, and we aim to be that. Because we need that in an age where there’s too much concentration within just a handful of players. That’s not okay.
It’s not okay for the global South, it’s not okay for Europe, which is lagging behind, but it’s not even okay for the US. The US may have big tech, but that’s not okay as a single model. We need people to own their data science. That’s why that is our tagline.
That was good. Guillaume, what are your thoughts?
Yeah, so maybe more on :probabl, I’m really thinking that we have a mission, let’s say, to help more data scientists. But I will speak more about Scikit-learn and the ecosystem. So for me, the mission is - we should stay focused on what’s happening out there, and make sure that Scikit-learn is still relevant. So we have the foundational models, that’s fine. But we need as well to understand where this is deployed and how this is used, because we can make such progress that [unintelligible 00:49:52.10] for instance, make it easier to bring databases to Scikit-learn, or to bring Scikit-learn models into productions, and to reduce friction and everything. And as well, bring values on understanding the model. I mean, we are speaking about AIX as well in Europe now… So I’m sure there’s plenty of, let’s say, areas where we can have really an impact, and then there’s technology that moves very fast. So for instance, before we knew Pandas, now this is Polar. So we need to move like in a fraction of seconds, saying “How do we deliver value to the user that just makes a switch, and still can use Scikit-learn? Can we accept those things?”
And then we have to make this audit of what’s happening. So this is difficult to say where we will be in five years, because in five years, we have all those things that can, let’s say – we have the full chain of machine learning that probably will be here, so we should be aware, but we should be aware of whatever moves very fast around us to stay relevant.
That was well said, too. Gentlemen, you guys have done a fantastic job of teaching the rest of us about this… And thank you very much for coming on the show today.
You’re welcome. Always a pleasure.
Our transcripts are open source on GitHub. Improvements are welcome. 💚