Pausing to think about scikit-learn & OpenAI o1 (Practical AI #287)

All Episodes

Recently the company stewarding the open source library scikit-learn announced their seed funding. Also, OpenAI released “o1” with new behavior in which it pauses to “think” about complex tasks. Chris and Daniel take some time to do their own thinking about o1 and the contrast to the scikit-learn ecosystem, which has the goal to promote “data science that you own.”

Changelog++ members save 9 minutes on this episode because they made the ads disappear. Join!

50 minutes
Recorded Sep 15, 2024
Published Sep 17, 2024
Download (48MB)
Transcript
🎧 28,863

Featuring

Chris Benson – Website, GitHub, LinkedIn, X
Daniel Whitenack – Website, GitHub, X

Sponsors

Assembly AI – Turn voice data into summaries with AssemblyAI’s leading Speech AI models. Built by AI experts, their Speech AI models include accurate speech-to-text for voice data (such as calls, virtual meetings, and podcasts), speaker detection, sentiment analysis, chapter detection, PII redaction, and more.

Fly.io – The home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes.

Speakeasy – Production-ready, enterprise-resilient, best-in-class SDKs crafted in minutes. Speakeasy takes care of the entire SDK workflow to save you significant time, delivering SDKs to your customers in minutes with just a few clicks! Create your first SDK for free!

Notes & Links

📝 Edit Notes

Join the Practical AI community — it’s free! Connect with us in the #practicalai Slack channel. The community is here for you as a place to below, to bounce ideas around, or to get feedback on a conference to attend or talk you’d like to give.

Chapters

Chapter Number	Chapter Start Time	Chapter Title	Chapter Duration
1	00:00	Welcome to Practical AI	00:37
2	00:48	Sponsor: Assembly AI	03:26
3	04:21	Filtering throigh the noise	06:13
4	10:33	Probabl seed funding	03:24
5	13:57	scikit-learn	03:54
6	18:01	Sponsor: Fly.io	03:06
7	21:29	OpenAI o1	04:20
8	25:49	Latency vs reasoning	10:25
9	36:33	Sponsor: Speakeasy	00:53
10	37:30	o1's place in workflows	06:05
11	43:34	Probabl's manifesto	03:01
12	46:35	Learning resources	02:30
13	49:05	Outro	01:05

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Daniel Whitenack

Welcome to another Fully Connected episode of the Practical AI podcast. This is Daniel Whitenack. I am the CEO and founder at Prediction Guard, and I’m joined by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin.

In these Fully Connected episodes we try to keep you fully connected with everything that’s happening in the machine learning and data science and AI worlds, and hopefully share some things with you that will help you level up your machine learning and AI game. How are you doing, Chris? It’ll be fun to catch up on a few things that have been happening over the past couple weeks today.

There’s so much going on, oh my gosh.

Daniel Whitenack

Always, yes.

We’ll have to pick and choose what we have time to hit here.

Daniel Whitenack

How are you hearing about AI things these days, Chris? Maybe that’s a even something that people might be interested in knowing. We’ve talked about this a few times on the show, but there may be new listeners who kind of got into the show after Gen AI stuff, and they’re trying to figure out where to keep up with news, and learn things…

The question’s the inverse. It’s like “Where are you not hearing about it?” Because we’re getting it from every angle.

Daniel Whitenack

Or how do you filter through the noise?

That’s the issue there, is there’s so much noise now. There’s so many – I know both of us have our kind of workflow on how we’re consuming new things going on out there, and have for many years as we’ve been doing the show… And I think there’s so much more that is coming in, and the quality varies hugely in terms of how you might do it. I know we’re going to talk about a few topics today, and I know that on at least one of those, to be seen in a few minutes, some of the info is kind of straight down the line, gives you the facts, and others are very hypey. We’ve talked a lot about how hypey things are. But I think that’s one of the challenges with this field - it’s moving so fast, and it’s in. It’s just one of the dominant topics in mainstream media now, and you have a lot of folks out there reporting on it, some of which know something about it, some of which don’t know as much. So yeah, the filtering is the big challenge.

Daniel Whitenack

It’s been slightly different for me, I think, in the sense that - well, one, I’ve founded a company, so my attentions are slightly maybe distracted with one thing or another, or maybe exposed to things in different ways… But also, I think my habits online have changed even over the past year or so… Whereas I was maybe more, at least - not always posting, but looking kind of for things on Twitter or X, as it is now, and seeing a pretty active AI community there, of course, where I had… Well, you as well; we both kind of had different journeys into this, but definitely, I had this sort of more background on the data science thing, and a lot of kind of data science discussion was, at least in my world, happening on Twitter or X.

Then kind of all the whole AI world took over, and also Twitter changed to X, and things are still happening there to one degree or another, but I sort of started ignoring that a little bit. And so at least one of the things that we’ll talk about today I kind of looked at and learned through LinkedIn… Which, to be honest, if I’m completely honest, I always hated kind of scrolling through LinkedIn throughout the years.

Well, it’s very corporate-y, yes.

Daniel Whitenack

[00:08:13.06] But yeah, maybe it’s just the people that I’ve learned to follow or look for things from, or… Yeah, I don’t know. It’s hard to find the proper channel through which you’re getting a good signal to noise ratio, but I’ve found it a little bit better there recently. I don’t know if you’ve seen a similar thing.

No, I think you’re very right about that. LinkedIn’s gotten better about producing good information in that way. And it used to be – once upon a time I had my workflow sources, and then you’d see it kind of showing up on LinkedIn in the hours and days afterwards. And I also would Twitter, but I also have moved away from X as well. It’s just not giving me what I’m looking for most of the time. And so a bunch of new sources and aggregators put together, with some filtering on that… But yeah, LinkedIn is definitely on the upswing, versus where it was maybe two years ago.

Daniel Whitenack

Yeah. And I don’t know, I guess I’m lucky enough to be part of a few Slack channels and Discords now, where either my coworkers, or other collaborators, or different - Discord channels, whether that be Latent Space, or the MLOps community… These things pop up things as well, and people find them around… And so yeah, I would definitely encourage people to – of course, think about… You know, we have a Slack channel with the podcast, you can find that at practicalai.com/community, but there’s also really great ones out there with Latent Space, MLOps community… There’s of course ones – collaborate with your coworkers, figure out where they’re finding good info… But yeah, it is, I think, a little bit more fragmented now, which makes it hard to pick apart some of those news stories.

I’m going to say, people need to join our Slack community, because we’ve had some really good conversations and questions and suggestions arise there… And the folks who are participating really know their stuff, and they’ve pointed me to a few things in recent months that have been really useful to have. So I’m spending more and more time taking pointers for that.

Daniel Whitenack

Yeah. Awesome. Yeah, well, one of the first things that I wanted to mention on the show here was something that I did run across on LinkedIn, which I just had missed till it was posted recently, as related to their recent seed round of funding. But there’s this company, Probable, which had an announcement of a round of seed funding… Yeah, I guess they would classify it as seed funding. As I’ve learned, these phases of funding are somewhat strange, and all have different definitions depending on where you’re at… But yeah, this is a company, Probable, which is - I’ll use their exact words, because they’ve chosen them well… It’s the official operator of the scikit-learn brand. And they talk about this funding representing “a powerful step forward in our mission to help professionals truly adopt our slogan, “Own your data science.”

So a few interesting things there… I am really happy… So of course, people are probably saying “Well, why aren’t you talking about OpenAI 01 as the first thing you’re talking about?” We’ll talk about it later, don’t worry. But this one I thought was really interesting, and –

Foreshadowing.

Daniel Whitenack

[00:11:58.11] Hey, I’m excited to talk about something maybe that intersects something other than OpenAI and the next-gen AI model. But scikit-learn, of course, has a huge place in my heart, throughout the years of operating in data science. So to see a brand or a company that’s really putting an effort behind that brand of scikit-learn, advancing that, and also advancing a slogan forward to the future of “Own your data science”, one, thinking about owning as in tools that you can operate internally and privately, and with your own data, but also tools that are around data science. So maybe there’s a future for data science after all.

I think there is. And I love – the reason we put this first is because as huge supporters of open source and people being able to kind of control their own destiny with data science, and machine learning, and AI technologies, I love having these companies that are out there, supporting open source… And it gives us options. And so if you just want to do the open source yourself, and maybe you’re just a data scientist working on your own side project or something, you can do it. Or if you’re a corporation and you’re looking to have dependable partners to work with around open source, you’ve got that, too. So in its own funny way, it makes it more accessible to a wider audience, and gives us those choices, and I just love it when companies are doing that.

Just as a two second aside, it’s one of the things that surprised me with Facebook and their models as they’re open-sourcing it, which the other big companies we’ve historically talked about aren’t. But back to Probable - what are some of the things you see in this announcement that really get you going, Daniel?

Daniel Whitenack

Yeah. Well, I think for people that aren’t aware, and maybe just coming into the AI world with ChatGPT and all of that, scikit-learn has been around for a long time, and is a primary open source set of tooling for the Python community and the data science community, that allows you to build a wide variety of models. So everything from neural networks, to decision trees, to random forest models, to support vector machines, and clustering algorithms… Just a huge number of algorithms and metrics and evaluations and models, all within this very consistent API, well supported API, widely used library that is Scikit-learn. And so I think I’m excited to see that, you know, one could think “Well, maybe everyone that was interested in supporting and contributing to those things has kind of jumped ship to what’s fancy and shiny with LLMs and all this stuff”, but I think I’m encouraged that there is a strong backing behind this, and I think it’s needed moving forward. We’ve talked about this on the show, where there is going to be a need for kind of hybrid systems between traditional machine learning and statistical learning and generative AI, and there will still be a need for kind of smaller, performant models in a variety of contexts. And maybe those are better, faster, more secure for a whole variety of things, as they’ve been useful for decades now. So I think that I’m encouraged for what’s there.

[00:15:53.21] Just so people know - and we’ll link to this announcement, but they talk about what’s next for Probable… So one of the things they talk about is an acquihire, securing talent, but also the launch of an official scikit-learn certification program coming in Q4 2024. And then the release of a product. So obviously, this is a commercial company, so they will have something productized.

They say that they’re aiming at augmenting the work of data scientists in the pre-MLOps phase. So this is very interesting to me. I think it’s an interesting niche that they’re focusing on there. Not trying to cover or recreate MLOps tooling that’s already out there, but focus on the data scientist role in that kind of pre-MLOps period, which in my mind involves, of course, data munging and model selection, feature development, all of these sorts of things.

It’s funny that you bring that up, because I think that gets lost, with all the hype on the AI, and particularly Gen AI these days… There’s still so much data science going on. And I would suggest that just core everyday data science is still by far bigger in terms of being present in the number of organizations. It may not be a big, fancy, glitzy effort, but it’s pretty core to most organizations. And yet the AI hype tends to get all the press.

So it’s really good to see them kind of acknowledging that that section is still there, and that it does need support, and being willing to do that a little boldly, and stepping a little bit away from where everybody else is going to… So yeah, I hope they do really well in that capacity.

Break: [00:17:53.13]

Daniel Whitenack

Well, Chris, one of the things we started out talking about - Probable and this funding that they have… Of course, this is related to scikit-learn, and owning the data science process with scikit-learn. They’re obviously very committed to the open source way of going about things. They have a long-term vision for people still to be able to maintain their own data science process, with their own data, with models that they own internally… That’s of course in contrast to some things out there, which - there’s a mix of this. I think there is a validity to people trying to create their own proprietary models, and that’s their way of owning certain things… But certainly, the models that people would think of right now with AI models are maybe those from Open AI. And we’ve got another one of those over the past - whenever it was; I don’t know, all the days blur together for me. But this rumored - what was it? Sort of Strawberry code name /01 model, which people have been talking about… And certainly, we want to cover that on this news show. So yeah, 01, Chris - how has 01 struck you in its first days on the AI scene?

Yup. It’s an interesting animal, and it is entirely proprietary. And on this show, as we’ve noted, we usually give those the second spot, not the first spot. But it’s interesting, I’ve been using it some over the past few days… Some of the new features that it talks about are advanced logical reasoning, and it has a capability where it slows things down in terms of processing… And so that can be – it’s not the same experience as the 4.0 that we’ve been used to, where we’ve been… I know on my iPhone, on 4.0, when I’m using that, I’m talking at it these days, and it talks back… And it’s fairly conversational, due to the speed. Can’t really do that with this 01 preview as it’s currently released. It’s a little bit too slow for that. When I’ve tried, I’ve had to wait a while for my responses… But it’s taking a different approach.

Daniel Whitenack

Yeah, I guess it brings up an interesting question. So for people that are not familiar with the 01 model, it’s a model that really operates on this principle of thinking through a series of stages of reasoning before giving a kind of final answer. Some of what this might have been called in terms of how a completion would happen or how you would structure a prompt, or a completion in the past might’ve been chain of thought processing… And so that takes a bit more generation, more text is generated, there’s a pause in the result… And this brings up a question - you were talking about this latency element, Chris, and here they’ve intentionally slowed things down. And I was wondering for a while, with Grok, and very quick completions out of these models, what role latency would really have. At a certain point, as text is generated, our human minds can still only process a certain amount of natural text at a certain speed. And then here, OpenAI has intentionally slowed down the generation process, which I don’t know if you’d call it – maybe it’s still generating at the same or a similar token per second rate, but it’s generating more. It’s generating this kind of chain of reasoning, or chain of thought steps, and then producing something on the output… Which I’m assuming is just a generation, and there’s then some special token that they have in their prompts that they train the model on, which is like “Now give the answer to the user”, whatever that special token is, which we don’t have a ton of information about. But yeah, I don’t know… What are your thoughts on this latency, versus reasoning, versus also the human interaction element of this?

Before I dive into that, I want to note that having read quite a bit about the model over the last week or so, a lot is a bit speculative in terms of - you’ll read articles where people are saying “Clearly it’s doing X”, but they don’t really know, because OpenAI has not specified exactly how they’re doing it. So I want to note –

Daniel Whitenack

[00:26:18.14] Yeah, so anything we say here may or may not be accurate. This is our best chain of thought on the topic.

And I think it’s important to say that. We’re not speaking in perfect factual –

Daniel Whitenack

We have no direct line to the OpenAI technical team and revealing their secrets.

Yeah. Nor do many of the authors that we’ve been talking about. So I guess my impression has been – it’s a little bit startling. The latency thing kicks in after you’ve been using the 4.0 model for a while. And it forces you to start realizing that there are definitely different use cases for using the 4.0 and the 4.01 preview. And I think that that is notable, because it’s really the first time that OpenAI has offered a new model that everyone didn’t just instantly switch to that as the thing to use.

It was kind of like - they said when we went from 4.0 to 4.0, “Well, there may be cases where you go to either one.” But in practice, I saw people just going to 4.0 pretty much nonstop at that point. Whereas this one, the latency kind of forces you to change your ways. And they also have given guidance on the prompt engineering, that the way you prompt the 01 preview is not the same anymore, because what they’re doing behind the scenes, and with the notion of possibly multiple concurrent addressing of your prompt in different ways, and they verify them against each other - all things that I’ve read that are unverified at this point in time - that because they’ve changed the way they’re processing on the back and the latency is now there, that there might be different types of things. They tend to highlight coding, they tend to highlight math and other critical reasoning skills where you’d want to go to the 01 preview, rather than back to the 4.0… And they’re claiming that it’s more accurate. I’ve seen numbers like 15% increase in complex reasoning tasks to support that. And there’s some drawbacks, which we can talk about in a moment.

But I think I am still trying to figure out the tasks that I’m going to assign to each model in my own mind. And I’ll find myself stumbling a little bit on that. I probably, at this point, over the next week, trying this stuff out when I’m coding - I don’t have a reason to do a lot of math in my day-to-day stuff, just on a constant basis… But for coding, I’ll probably be spending more time on the 01 preview, whereas I was using the 0.0 itself before that. So it’ll be interesting to see.

So with that difference, as you look back to the two between the 0.1 preview and just the 4.0 we’ve been using, I just want to throw out into the mix, at some point in the not too distant future, we’ve been led to believe that GPT-5, which is a much larger model, according to OpenAI, will be coming out. So it kind of leaves you wondering a little bit, which model that we’re on now is that going to be closer to? Are there two different sets of prompt engineering approaches that we have going forward at this point? There’s a lot of unanswered questions.

Daniel Whitenack

Yeah. And it was curious to me, and maybe slightly revealing… I don’t know, it’s hard to read into everything that’s going on behind the OpenAI curtain. But the fact that the sort of big advance here was apparently some sort of RLHF preference tuning around this sort of chain of thought generation versus kind of a different… You could think of any number of things that could change.

[00:30:10.16] We saw this in the past, we saw a wave of mixture of experts models, which - that was a big change in the model. We saw, of course, the model size. We saw changes at a certain point in terms of how models were trained and aligned. But here, this is just sort of a different prompt set, that is used in this RLHF process… And I think it’s intriguing and interesting that they’re applying it a different way in the UI and the ChatGPT UI.

You might want to define the acronym as you’re going there…

Daniel Whitenack

Yeah. So this is all inference with my own chain of reasoning, but when they say “Oh, we’re ‘pausing’ the model in the generation. We’re allowing it to think.” I think that’s somewhat confusing, because obviously the model is not…

There’s a little marketing thrown in there.

Daniel Whitenack

Again, this is my own personal opinion, but the model is not thinking. It’s just generating text. The difference is that it’s generating text that is more explanatory, or exploratory, representing a series of decisions that can be made leading to accomplish the goal or the task that you prompted it to accomplish.

So when it’s pausing like that, what I’m assuming is that they have a pre-trained model. Maybe it’s GPT 4-something internally. I don’t know what they call it internally; the parent model. And then they’ve curated a set of prompts that in the training set complete with the complete chain of thought, whatever, whether they synthesize that data or used humans to create it, or whatever, probably some combination of that… And then they fine-tune or preference-tune or align the pre-trained model to that prompt set that they’ve curated, which includes all of this chain of thought stuff, using this process… I mean, they refer to RLHF, reinforcement learning from human feedback. So this sort of reinforcement learning-driven loop to align the model. And then you get this sort of 01 model. That is the process I’m assuming happened on the backend. And when you’re in ChatGPT, it’s not like there’s some button, and they have it thinking, thinking, thinking.

They say that, yeah.

Daniel Whitenack

Really it’s just, I’m assuming it’s generating text. And then at a certain point, similar to people asking “These models just generate texts. How do they know when to stop?” Well, they don’t know. They generate a special token, which is an end of statement token. And then the program, the computer program stops it from generating more text after that token is generated.

In the same way here, I’m assuming after a certain level of text is generated in the chain of thought, it generates a special token - that’s what I was referring to before - of like “Now generate the answer token.” And then that’s how they control the UI.

[00:33:30.25] So all of that is inference on my part. Again, I could totally be wrong. But that really is a similar process to what we’ve been doing now for a couple of years with these models. There’s not anything fundamentally different about that process… And so I’ve found it interesting that the big reveal here was sort of a different model created with RLHF, which is basically what everybody else is doing. Maybe they’re using this interesting methodology and the UI… So I don’t know what that reveals. It could mean this is a cool thing to hold us over until GPT-5, which will be this fundamentally world-changing different process, methodology, architecture model that’s going to be crazy. Or it could just mean there’s very much a diminishing returns here in terms of the methodologies that are available to improve this wave of models.

I think that that’s as good of an educated guess, not privy to their internals, as I’ve heard. So I think if you haven’t nailed it completely, you’ve probably nailed parts of it.

This model being out there in the consumer market - so on my iPhone, I use it a fair amount - I find it interesting that they’re trying to create a user experience that’s a little bit mystic. As you pointed out, it says “Thinking while the delay is going on”, which - there’s an implication there, especially if you’re not like us, and in the industry where we’re talking about AI every day, and that’s what we do… If you’re somebody out there who’s just kind of just your typical average person consuming the technology, there’s an implication there. And especially when you talk about they’re using the word advanced logical reasoning, things like that, that I think there’s - it’s a little bit of kind of marketing hocus pocus that’s being applied to a fairly mundane set of processes, as you pointed out, using the technologies that have been… They may be constructing how their models are interacting, as you pointed out, in a way that’s unique on their side, but it’s probably not revolutionary. It’s an evolutionary decision that they’ve done to try to make their model more accurate.

I find that a little bit questionable in terms of kind of how they’re marketing that to the general public. A little bit worrisome. There’s some of the voices out there that have been out, talking about this coming out or expressing concern again… The lack of visibility into what’s actually happening makes it really hard to verify or not how they’re approaching it.

Daniel Whitenack

Well, Chris, just to give people a sense of this 01 thing, and then I think we can get maybe a last impression from your end, and then maybe a summary… But I was trying one – so I think the idea at least, or part of the idea with the way that they’re providing access to this model and promoting its usage is for a more kind of researchy, kind of deep reasoning type of things. So the simple prompt that I gave was “Determine a new problem in physics related to density functional theory for my PhD focus. Write a concise summary for me.” So that was my prompt. The only reason I did that prompt is because that was the subject of my PhD research. So I’m like –

You know something about it there.

Daniel Whitenack

“Hey, this would have maybe been nice back in the day.” And so the UI experience for those that have not tried it yet, and you’re just listening, it kind of paused, sat thinking for about 10 seconds, and then it gave the executive summary or the summary that I asked for, but there’s a little kind of drop down that I can click… And it thought for 10 seconds, and the steps that it says that it kind of talked through are formulating a research problem, breaking down DFT applications, considering quantum embeddings, changing gears, and pioneering new DFT modeling. And it gives a little summary of those, and seems to be more text-generated there, and that sort of thing.

So the problem statement is relevant. The text is relevant to some of the things that I would know about in that case, but also pretty much what I’m aware of from what people have already been exploring in the past. It’s not like all of a sudden this model shocked me with a “Wow, that would be a really interesting and profitable area of research” in this topic that I’m aware of. But it was generally in an area that’s interesting. So I guess, kind of good-ish, but not mind-blowing.

And so yeah, that kind of brings up the part of what I’m wondering here around “Where is the proper place for this in my day-to-day workflow?” Maybe we just haven’t figured that out yet, because I can definitely see a lot of places where I’d be like “Well, I don’t want to pay up for this”, because it is going to be way more expensive than the 4.0 or 4.0 Mini, right? I could see a lot of places I could apply 4, 4.0 Mini, or any number of LLMs that would operate in a similar way. Where am I going to apply this in the software that I’m building? I don’t know. I haven’t really pinpointed that yet.

And I agree with you. I want to go back to something you mentioned though, as you were talking your way through the example… And that’s that it gives you these kind of intermediary steps that it says it’s following along the way. And from my standpoint, I think that’s more - once again, on the marketing side, trying to reinforce that reasoning, marketing message that they’re driving on that…

You know, as we talk about what’s an appropriate set of tasks to use with each of these models, and especially with this 01 preview that’s out right now, it warrants knowing that there are a few drawbacks to the model, which we have not called out yet. One of them is that it still has a cutoff date on knowledge, which is tied back to October of 2023. So it’s almost a year since it has access to that knowledge. And unlike some of the other models before it, which had internet access to go out and kind of make up for that and get some more current information to throw into the model that was trained up until the cutoff date, this one does not have the ability to browse the internet. So that alone may change kind of how you use it because – so going back to what I was suggesting, if I’m coding, well, it may be that if I’m worrying about whether it’s Python code or Rust code or Go code, not a huge amount has changed in the coding world in that time in terms of what’s available library-wise, unless it’s just the latest, greatest thing to come out. And so I can probably use it really beneficially in a coding context there.

[00:42:11.24] But as we record this, for instance, on this model, just to talk about current events, a potential second - it sounds like a second assassination attempt on Donald Trump occurred today in the news. And if I wanted to ask the model to get information about that, this would not be the model. And I’m just pulling that one out simply because it’s a big news event that happened on the day that we’re recording.

And so in that case, if I’m curious to explore the story or whatever, using a set of models, I might have to go to some of the other models that have the internet access, and might be able to frame things that I’d be curious about for my consumption and my knowledge, whereas that current event would not be applicable here.

And then finally, I wanted to point out that we’re kind of used to being able to upload into the 4.0 series, the external documents from a RAG perspective, retrieval-augmented generation; that’s not available in the preview here. So that’s yet another limitation.

And when you add up the cutoff date, you add the lack of file upload, and you add the lack of internet access, those are some fairly substantial limitations on this model at the current time. So there truly are certainly a set of use cases that you might go to different models to see. Current events, it’s not going to be this one.

Daniel Whitenack

Yeah. To kind of wrap up this section here, Chris, on 01, bringing it back to scikit-learn and the Probable funding… I’ve found it interesting that Probable on their website has a bit of a manifesto as far as their values… And maybe I can share these and leave it to you and the audience to see how these compare in this sort of Scikit-learn ecosystem, open data science, own your own data science and the Open AI world of AI… So their values, they talk about supporting the whole long-term ecosystem, then individual stakeholder gain, openness rather than proprietary lock-in… I think that’s definitely an interesting one, especially - you know, I was talking to someone the other day even about this idea of model lock-in in the AI world, in this proprietary sense.

The other ones are interoperability rather than fragmentation, cross-platform rather than platform-specific, collaboration rather than competition, accessibility rather than elitism, and transparency rather than stealth. So if you’re interested in any of those types of values, definitely check out the data science community around scikit-learn and other projects, and give a shout out to what they’re doing over there. So I think it’s an important balance and cool effort to highlight in light of all the other crazy things happening in our AI world.

I think those are fantastic values just in general, and they’re very consistent with other open source and kind of commercial support of open source that we’ve seen in other companies that we’ve liked, whether they’re in the AI space, or the software development space. And I find myself certainly feeling very comfortable and gravitating, but I would actually conclude by saying - in my own experience working at some large companies, large corporations, and having a lot of friends and colleagues that I talk to at other big companies, they may use proprietary models to some degree, but nobody’s betting their business, at least in the conversations I have, they’re not betting their business on an entirely opaque proprietary approach.

[00:46:13.13] So you see it there in use cases, but not for the big things. For the big things they’re looking at, at open source models that they can rely on, and such. And I just wanted to call that out, because that’s been really notable to me over the last year or so, is how strong that sentiment is. And I would say more power to it.

Daniel Whitenack

Yeah. And we do like sharing some learning resources and experiences here on the show, and in particular as related to scikit-learn and that community. If you’re interested in those things, you can take a look… There’s some learning resources around DataCamp. So DataCamp has a “Supervised learning with scikit-learn” course, which I think you can try out for free. I think Codecademy as well has a course there. There’s probably innumerable blog posts around the internet in terms of scikit-learn and what you can do with it. So maybe if you’re coming from the gen AI world and have tried a bunch of things with Open AI, you could also dip your toes a little bit into the data science world. We’re happy to welcome you in and try a few things with Scikit-Learn. They of course have great documentation and examples on their site as well.

I’ll also mention, if you’re hearing this podcast right after it airs, I think there’s still time for you, but Purdue University, which I’m close here, but they’re working with actually a bunch of partners, including commercial partners, but they’re running a Data For Good competition, which is close to Chris and I’s heart. So they’re supporting the TAPS organization, which provides brief counseling and care for those that have lost family members that served in the military. And so if you’re interested in that, check it out. That’s for graduate students and undergraduate students. And there’s a huge - I think $45,000 in cash prizes, but also a bunch of free training and AI that you’ll get along with the effort. So I really encourage you - if you’re a student, that’s a great thing that’s happening this fall that you could be a part of. So check it out. If you just search for Data for Good Purdue, the link will be there, and I think you could register. If you hear this, right after it airs, go check it out and sign up right away.

Fantastic.

Daniel Whitenack

Awesome, Chris. Well, good to talk things through, and enjoy the rest of your week until we hear about -2 or GPT-5. We’ll see you next time.

There’s always another one coming. See you next time.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art