In this episode, Peter Wang from Anaconda joins us again to go over their latest “State of Data Science” survey. The updated results include some insights related to data science work during COVID along with other topics including AutoML and model bias. Peter also tells us a bit about the exciting new partnership between Anaconda and Pyston (a fork of the standard CPython interpreter which has been extensively enhanced to improve the execution performance of most Python programs).
Featuring
Sponsors
SignalWire – Build what’s next in communications with video, voice, and messaging APIs powered by elastic cloud infrastructure. Try it today at signalwire.com and use code SHIPIT
for $25 in developer credit.
The Brave Browser – Browse the web up to 8x faster than Chrome and Safari, block ads and trackers by default, and reward your favorite creators with the built-in Basic Attention Token. Download Brave for free and give tipping a try right here on changelog.com.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com
Notes & Links
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack, I am a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
I’m doing very well, Daniel. It is a beautiful August day… At the point we’re recording this we’ve just had a tropical storm come through – it was Fred, and so I’m just relieved that my house guest Fred decided to leave. I am joining this podcast from my [unintelligible 00:03:05.24]
[laughs] Well, speaking of interesting events, it has been one of those years, right? And if you remember last year, almost exactly a year ago, we were talking with Peter Wang from Anaconda about the State of Data Science Report that Anaconda puts out.
Indeed.
And we have Peter with us again. How are you doing, Peter?
Hi! Hi, Daniel. Hi, Chris. I am doing well, good to be here.
Excellent.
And Chris, I’m glad to hear that you are – well, I don’t know if you’re [unintelligible 00:03:33.17] or not. If you are, I hope that there’s no holes in it. If you’re not, I’m glad your [unintelligible 00:03:37.19]
[laughs] I’m above water, things are good.
Me too, me too. We had a really good conversation last time.
Yeah, for sure. How have things been for Anaconda over the interesting year that we’ve had?
It has been a really great year for us. Business is going well, we’re hiring and growing… We’re just firing away on all cylinders, so it’s been a really great time, and I’m very excited about the future.
Maybe just remind the listeners, the State of Data Science Report that you do… Just give us a sense of who contributes to that, what sort of audience is behind the graphs and the statistics that you gather.
[04:15] Yeah, so a brief rundown - it’s something we’ve been doing for many years now, but it’s a survey of our user base, but not limited to our user base. So we put a call-out, obviously, through the channels that we have the most access to, so it’s gonna be a lot of Anaconda users… But anyone’s free to take it. We put the call-out on social media and whatnot.
This year we had almost 4,300 participants from 140 countries. We ran the survey for about a month; it runs on the April to May timeframe. We had a third – not quite a third of our respondents were students…
Oh, that’s awesome.
…about 10% were academics, and the remainder, about 65%, were all practitioners, is I guess what we would call them. But they come from 140 different countries, the vast majority coming from North America, Brazil, and Australia, India, of course Europe… But there’s practitioners all over the world, so it’s really great to see that the data science is a globally-impacting movement.
Yeah. And you mentioned it’s your users, but my guess would be that the vast majority of data scientists that are practicing out there at least have come across or utilized Anaconda at some point… So that’s pretty cool that you’re sort of getting this slice across the whole industry. Has the geographic distribution of participants - has that changed over the years in terms of how international the audience is?
You know, we haven’t done a lot of deep dive into that, but it seems to be pretty consistent. I generally follows population densities, actually, for the most part, from what we can tell. India is always very strong. There’s a huge amount of data science happening there. South America is, I think – from what I can tell, South America is increasing a little bit more than the baseline… But really, it’s global. It’s everywhere. And there’s a lot of users in Africa, and in the Middle East… No, it’s really wonderful to see that the language and the tools have spread globally.
I’m sure there’s some questions that are carry-overs and are sort of always there… How (if at all) did you modify the survey and what you were asking this year in light of the pandemic and all the things that have gone on?
Yeah, we do carry over a lot of the questions… But then we also modify them a little bit. So a couple of years ago we had a lot more of a deep dive into what tools, what platform kind of things you’re using, and what is your orientation towards cloud, and have you heard of Kubernetes, things like that… This year we did ask about Covid, and we asked how it changed people’s budgets and the organization’s relative to data science… But this year we also did a bit more of a deep dive into what are the roadblocks to production, because that seems to be a topic that’s at the forefront of everyone’s minds now that people are learning data science, they’re building useful models… Now they have to get them into production.
And we ask those questions because we’ve heard different stories from different kinds of people in different job functions. So we ask that broadly of the folks. We can talk about that a little bit later, some of the deep-dive details there… It’s interesting; there’s actually some really interesting insights to be gleaned from that… But I think from me, one of the things that has changed year-to-year is we always ask people “What is your job function? What is your job level? Are you entry-level? Are you an individual contributor? Are you a director of VP?” And this year, the vast majority of our respondents are senior or principal manger/director, VP, C-suite… Less than 20% were either entry-level or other, which is very strange to me. Now, that’s only about 2,600 of the respondents answered the current job level question… But that one had me scratching my head a little bit.
So just as you’re going through your thought process, what are you attributing that to, or what’s your theory about it?
[07:54] Well, a quarter of the folks identify themselves as being senior, a quarter are manager, and then 10% director and about 8% principals. So I think what’s happening is there’s a little bit of title inflation in data science roles… And some data scientists, to be retained and to not get picked off by big tech giants, they’re maybe getting some titles and promotions and whatnot, and that might be what’s affecting this. And it can also be that the teams are growing a little bit. You naturally have to bump up the title of the senior person in the team, as you’re hiring more entry-level people behind them. That’s my hypothesis, but I haven’t validated that.
Another thing that was really interesting is that – so we ask about your primary job function; not all the things you do, but really what is your title or what is your primary job function… And over the years, the number of people who identify primarily as data scientists in our respondent pool - that number has been going down lower and lower and lower. And we get more people from all walks of the business answering our polls, and they’re using Anaconda in their jobs, but their titles are cloud engineer, or data engineer, I suppose data scientist, there’s product managers, there’s ML engineers… There’s many other kinds of people. Sys admins… So data scientists this year - the number of people who had data scientist as their primary job function was only 11% in our respondent pool.
I think that’s the maturing of the industry is what it sounds like to me. Year by year, as you’re seeing more diversity in terms of job titles and different levels of people, and not just everyone’s a data scientist… Would you agree with that? Would you agree that data science is finally getting its fingers into every aspect of the global economy?
Yeah, I do agree with that. I’ve held the opinion for a long time that this can’t just be technology in its own little ivory tower. For data science and the next generation of predictive analytics to be impactful, it has to spread across the organization. Everyone has to gain literacy. That’s something we’ve talked about in the previous podcast. And I think this is a positive signal that that is happening. Business analysts, VP of XYZ, product managers, sys admins, cloud ops, DevOps people - all these people are learning some of these technologies, and it’s a really good thing.
Maybe it’s also related to the fact that over the years the data science tooling that we data scientists love and have been willing to put in the time to learn - in some ways, that is becoming better documented, easier to manager sort of version, dependency-wise… The tooling and the ecosystem is just a little bit easier to onboard into. I don’t know if – Anaconda of course is part of ecosystem, but there’s a lot of teams that are really working on having better documentation, having better software engineering practices around the things they’re doing… So maybe engineers are a little bit less scared.
I remember my first data science position, DevOps Doug… If you’re listening, shout-out to DevOps Doug… It was just like a nightmare to get my stuff… Like, I would do this great thing in my Jupyter notebook and all of this, and then he’d have to build some Docker image with Pandas and all this stuff… And he hated it, because it took however long to build this image, and then it was super-bloated and huge, and… Maybe there’s just more understanding on the engineering side now, and better tooling… I don’t know, any thoughts there, Peter, in terms of this intersection of the tooling we use and this DevOps world and workflow?
I think I can say quite confidently there are more people using these tools and doing these things across the organization. I don’t know that I wanna go on the record to claim that they’re having a better time of it. [laughter] Or that it’s gotten easier. I think that you’re right, some of the tooling has gotten better… We keep pushing that boulder up the hill. We hope we’re making some progress… And we’re not the only ones; there’s lots of people. There’s the maintainer teams of individual projects themselves, as well as the broader Python community, and the core Python developers, [unintelligible 00:12:00.29] and people like that. But at the same time, the landscape has gotten more complex. There’s more kinds of hardware out. There’s more proprietary offerings with various cloud vendors. There’s more different variants of GPUs that come out every year, every generation being so much better than the previous one… But you can’t get rid of the old ones that you’ve bought last year. You’ve gotta keep using them.
[12:21] And there’s all these different – so the landscape is getting more complex even as some things are getting easier. And I think that trend will continue. So more people will be using this across the organization, there will be more people motivated to try to solve the problem… But at the same time, everyone’s busy, and these problems are at the infrastructure level, kind of below the radar or below the water line of what DevOps Doug or [unintelligible 00:12:47.08] are able to see… I’m trying to get the alliteration going there, you know…
That was good, I liked it. Keep going.
So in any case, I think the spread across the organization means more people are probably feeling some of the pain; however, it also means that businesses are taking this seriously enough that despite the pain, they’re still trying to roll forward with it. They’re not abandoning it and saying “This is a bad idea. Oh my God, we’re going back to just using SaaS” or “We’re gonna stick with just Excel.” No, everyone has to do this now, and it’s just like “Well, let me jump into the ocean.” I’m sure it’s cold, but they’re all gonna do it, so…
You’ve made a point in there that I wanted to draw out for a second, and that’s the fact that you said this is gonna continue, meaning that the number of capabilities for doing DevOps and deploying and getting the things that we people in the data science world are interested in bringing to the world and bringing to the markets, and yet the world itself is getting much more complicated. We’re no longer always deploying onto some server in our data center, we’re deploying edge devices and innumerable things… So do you think that’s just an indefinite trend? Because we don’t see the complexity going away any time soon.
Yeah, I think it’s gonna be indefinite. Well, nothing’s indefinite, but I think it’s gonna be for the foreseeable future, at least for the next five years, probably at least ten… If we kind of bump up a couple of levels here, zoom out to the 30,000-foot level above just the details of our survey results, I have maintained for quite some time that we are at one of those generational phase changes… You know, with the introduction of the PC, and personal computing, from mini computers and [unintelligible 00:14:20.25] and mainframes… That was one shift that happened almost 50 years ago at this point; and a lot of technologies we use are still ones from the early days of the PC. Software, hardware - you name it. Our programming models, our architectures, languages, operating systems - all of those things are inherited from the long shadow of the ‘70s. And what we’re seeing now, ubiquitous connectivity with supercomputers on-demand and rentable by the hour, and with now algorithmic capabilities that are far beyond what we’d ever conceived possible before - all those come together to create a new landscape that is completely different than sort of the [unintelligible 00:15:00.14] that’s been sort of a monoculture that’s persisted for about 30 years in enterprise IT… We’re now changing. So it’s at the point where these people, like principal, and senior, and even the C-suite CIOs, they may not even remember what it was like when they were cutting their teeth in the early ‘90s, when there were individual contributors… But we’re now back in one of those modes. You think deploying onto a variety of different serverless and Kubernetes containers is hard… Think about all the different kinds of sensor platforms for industrial automation. Think about when you have to deploy models that then take sensor input, make inferences, tweak models, and then actually have a cybernetic control loop remote from the big iron computer. How do you even unit-test something like that? Like, first you’ve gotta get the code running, and then you’ve gotta make sure the code is correct. How do you do those two basic things in that kind of deployment target? But you can’t not, because all your competitors are doing that.
So we’re entering this era of cybernetics, we’re just at the very beginning of it, and it’s gonna be completely different and so much more heterogeneous than the era of just personal computing, which settled up pretty quickly into x86… And it was actually x86 versus Mac, and PowerPC and the Mac… But it was mostly x86 and Windows and DOS on the business computing side. And it’s a sea change. So the changes will continue [unintelligible 00:16:19.22]
Break: [16:25]
So Peter, one of the things you mentioned was this question that you added in this year about how maybe budgets around data science have changed within a business due to changes related to the pandemic, and all of the global things that are going on. What were the results there, and what are some of your thoughts on those?
Yeah, so about a third of the folks said that their businesses decreased the investment. A quarter said the investment stayed the same, and a quarter said that investment increased. And then the remaining 12%-13% said they were not sure.
So the majority of people, it seems like their businesses kept their data science spend at the same level or increased, but a third definitely did say that their businesses decreased investment.
For my organization it was like, as soon as the pandemic hit – I mean, my organization deals with language-related issues all around the world, and all of a sudden that became really difficult and prioritized at the same time, because now we’ve got all of this health information that needs to go all around the world, in all of these different languages, people are more at home, so maybe connecting to them digitally is more important than in other venues… I don’t know how those translated across industries, but I’ve definitely heard a lot of people saying “We’re busier than ever.” It could be because they have less people to do the work, but I’m guessing it’s because some of these issues the data science really addresses well are those issues that are related to some of the things going on in the world. So I don’t know, do you hear those sorts of stories at Anaconda about people using Anaconda to really address some of these issues, and actually solving some of these complicated issues that we’ve been faced with uniquely over this past year?
Yeah, I mean it breaks – the way you’ve phrased the question, one could say “Well, some of the complicated issues are specifically medical in nature.” So in the area of genetic research, and pharma, and sciences, and all that, the Python data stack comes with the Python scientific stack, right? So that stuff gets used all over those places. So there’s sites that track the evolution of the genome of SARS Covid-2, and that site uses a number of the open source tools in our toolbox… And there’s just so many epidemiological studies, and all these other things… Those are areas that our stuff gets used in, and we see them mentioned, we see references or shout-outs on Twitter…
But on a broader thing, from an industry perspective, I think you make a good point that some businesses saw an opportunity to shift their business model to accelerate certain things that are digital in nature, digital engagement being one of those areas where it’s like, “Yeah, you’re gonna have to do that or not have any engagement, because everyone’s locked down.”
[20:09] So those areas were areas that then by its very nature of being a digital engagement it creates so much data exhaust. So then of course you wanna analyze that, and of course you wanna use that to feed back into improving the product and increasing engagement. So there was a natural kind of baseline (I would say) tailwind for some of that stuff, but we already used it across a lot of industries. So there’s brick and mortar, or more physical domain businesses where - yeah, their businesses were unfortunately negatively impacted, and there wasn’t budget available, so they had to cut some staff, or…
We did drill in and ask, if the organization decreases investment, in what way did it do so? And half the people said “Well, we just lost some budget”, half the people said “Our team didn’t grow.” 40% of people said “Yeah, we actually laid off some people”, and then about a third said that they had various project timelines put on hold indefinitely, or for some extended period of time.
So that’s kind of the way that that came down. But on the exact flipside, the people whose organizations increased investment, that increased budget, they were actively hiring, they had way more projects, additional projects, and they could buy more tools. So maybe there’s not information there; it’s kind of what you’d expect.
What I’m hearing you describe, if I’m right - it’s kind of innovation being driven by the circumstances of this bizarre last year and a half that we’ve had, where people are recognizing that they are in a constrained environment and they can either rise to it or not. I know it wasn’t a specific question you were asking, but do you think there might have been any of that in the response, like the organizations that are seeing big results from data science in a productive way probably are investing, they’re innovating, they’re saying “If we can’t go to the office, we’re gonna find better ways of doing data, we’re gonna change our workflows, we’re gonna change our pipelines and bring value to our customers in a different way”? Do you think there’s any correlation between that kind of innovation-driven mindset and levels of investment and lack thereof? And you can speculate as well.
Yes, I would say speculatively, based on the anecdotes/anecdata that I have, I think there’s some paths that you can see people going down. So if it’s a business that’s just dabbling, or just getting started with using data science techniques, you can sort of see it “Oh, this was an elective, sort of an experiment, and we just don’t have experiment budget this year, so I’m sorry. We decreased our investment there.”
For others, the attitude generally that I saw in business was everyone kind of initially, at least in the Q2 to Q3 timeframe, that summer timeframe - everyone was sort of holding their breath to see what would happen, but no one really thought it was gonna be literally the end of the world. It was clear we were gonna have to get through it, and we’d find new modalities of working, of feeding people, of just being… Whether it’s pod schools, or whether it was camping spaced 20 feet apart, or something… People were finding out new ways to live. So with that mentality, businesses recognized that they’re gonna be data-driven, it’s just which project should they put those data scientists on.
So if you have some data scientists who’ve done some work and they’re familiar with your business and your data structures and your data management, it didn’t make sense to let them go only to onboard new people nine months down the road that didn’t have no clue, right? So I think in this way it was more of a – that would explain the 25% where they were just like on hold. “We’ll keep doing some of these things that we know are critical. Let’s not greenlight any new projects until we see how this thing lands.” That’s kind of the anecdata that I would say speculatively that I saw.
Something you mentioned a little while back was a focus on understanding why and how it’s hard for people to get things into production… Maybe seeing some trends and some discussions… And I’ve even seen some, you know, over the past few months, blog posts and other things talking about “Hey, we’re however long into this data science and AI thing, and it’s so hard to get things into production.”
[24:04] Right.
So I don’t know what you asked specifically in the State of Data Science Survey, but maybe you could share with us some of your thoughts on that front. I mean, we have been doing this for so long… Is it all due to those sort of complicated environments and targets that we’re deploying to, or are there other things at play here?
There are other things. We gave people a list of options they could check one or more of the things, and then when we looked at the data, we faceted it based on people’s roles. So the leading, or most popular answers for folks were – 27% of people said “Meeting IT security standards.” That was the most popular of all the responses. There was no single one that was the biggest among all cohorts, but that one was the most common and it certainly had the highest ranking.
And then right after that, 24% of respondents said “Recoding models from Python and R to another language was a roadblock to production”, and then 23% said “Managing environments and dependencies”, 23% said “Recoding models from another language into Python and R.” So this language recoding thing is interesting… I mean, I caught wind of this stuff eight years ago, when Python wasn’t taken seriously as a production language, and people were like “Well, it’s a scripting language. We’re a serious Java shop. You must recode all of your scikit-learn into Java stuff.” So I was aware of this kind of thing going on.
But for 23% of the respondents to say “No, this is a problem we have at our organization” - that seemed large to me.
I’d like to ask you a question about that… Like Daniel, I crossed both in the data world and in more of the software development world, and we see Python owns the data side of things… And yet, we see these other languages that have been on the rise for a while, Go and Rust and such, that are out there and you see containers and whole ecosystems being written in them… And I am finding in practice there’s this – I kind of move back and forth between my data mode and my software development mode, and there is that context shifting associated with that, and in some cases performance shifting as well… And having snuck into your report before, we got to this point and looked at your data… I was looking at the uptake on Go and Rust at the very bottom of that. Do you know the graph I’m talking about? With all the languages…
Yes. All the languages, right.
…and I was dismayed by that a little bit. I’m kind of wondering, and I’d love your insight - are those two going to come together over time, or are you seeing that in the longer trend? Do you think they stay the same and I just need to settle into the fact that we have specific purpose languages for specific functions and I need to own that? What would you advise me to do in my thinking going forward?
There’s definitely different families of languages trying to solve different kinds of problems. And every language design decision is a compromise, from what I’ve seen. So as you start making collections of compromises that are coherent in some way, shape or form, you mold a language for a particular set of use cases.
Python, by making so many design trade-offs for readability, ease of getting started and things like that - it was easy to get started, and a lot of people learned it, and it sort of has this executable pseudo-code thing/nature, which people like… So it got that. And then there were another set of design decisions that said “We should make the VM as simple as possible, so we can integrate with C libraries.” C and C++ interop was an important thing. Okay, well that’s a really, really big design decision to stick through for like 25 years. And if you do that, what happens is you end up being like one of the best languages to script or integrate or embed into a C/C++ runtime environment, which includes all of those numerical libraries that people have been developing for forever.
[27:48] So oops, you happen to be a really great scientific computing and numerical language all of a sudden, even though you are not anywhere near – I mean, Python was designed to be a more friendlier Bash, and maybe a slightly more readable Perl… And so these collections of design decisions sort of put you into a particular niche, or maybe a very large niche.
So when you look at the design decisions behind Go and Rust, there are very sharp-pointed opinions as to “Rust is about that type safety. Let’s not have any more buffer overflows on streams. Let’s just not have that anymore. Surely we should get there in 2020.” So I think that design decision and optimizing for some of those usability and developer quality of life things - it put you into a particular spot.
Go is different. Go is like “We wanna be multi-thread out the wazoo, super-fast spin-up, and then we’re gonna vendor the world, make everything into a single binary… A really big binary, but a single binary.” So there’s just different design decisions that put you into different places. And for that reason, I think that it is more likely in the future for these things to interop with each other over APIs, or over data sets, or maybe over shared data abstractions like Arrow, or things like that… That’s probably the more likely long-term scenario… Because it’s about separations of concern of who’s writing the code. The person writing the infrastructure code to spin up kernels and containers and manage all these kind of low-level system things - their boundary of concern kind of ends there. Once you have a [unintelligible 00:29:12.23] process running, they don’t really care what you’re running in it. So they’re gonna write their infrastructure stuff in Go; it needs to be tight, fast, it’s gonna be all this great stuff that Go or Rust offers… But once you get up here into numerical, data science, “I don’t know what I’m doing, I’m writing a datascript in a Jupyter Notebook” kind of land, then ease of usability, and then the iteration cycle of trying different ideas - all of that becomes a dominant concern. And it’s a different design space.
There’s one other tiny little thing I wanted to append on, as you mentioned the concern about putting models into other languages for deployment purpose, for production.
Recoding them.
Could you address that a little bit, real quick?
Yeah, recoding them. Literally, taking Python code and saying “Nope, we as a shop are not going to deploy Python into production. You have to rewrite this as C++ or you have to rewrite this in Java or .NET, or maybe Rust or Go.” I have heard of some things being recoded in Go. I think the C++ thing is a lot of TensorFlow happens to go that way, because it has a C++ API, as well as the Python one…
[unintelligible 00:30:11.11]
Yes. Okay, right.
For inference.
Yeah, yeah. Because the inference stuff is more lightweight, so there’s no reason you can’t have a lot of frontends for that kind of thing. So I think that recoding – I don’t think anyone relishes having to do it, hence it’s considered a roadblock. But it’s a thing that people are doing, and something we should be thinking about “How do we make it so people won’t have to do that? What are the issues? Is it that they don’t know, IT does not really know how to deploy Python in a safe way that they can manage?” Some of our products certainly help with that, trying to give people a good, governed vendor of record to give them signed binaries that they can deploy to production… But that’s only one of the hurdles; there may be others as well. Cultural, knowledge gaps, things like that.
Just to follow up on that previous discussion about recoding models into other languages… Do you find speed of execution, efficiency, latency - these are sorts of things that people are quoting. And I wonder that because I often wonder myself “Am I really good at writing fast Python?” I’m not really sure I am. Like, I’m good at writing fast Python in the sense of I can code something up super-quick and get it to execute end-to-end… But the execution might be really slow. So I don’t know, do you see that as a trend in terms of – because I’m thinking, if people are recoding their things into C, C++, maybe they have that on their mind, or something.
That’s certainly one of the concerns, the performance aspect of it. When it comes to the numerical computing stuff in Python, the code, once it gets to the numerical part, it tends to run pretty darn fast. You can maybe improve it a little bit, but that’s not where your bottlenecks are. If you have a lot of pure Python code moving things around and you’re passing a lot of data back and forth, and you’re accidentally taking lots of memory as you move things around, then that’s where you get slowdowns. But the core algorithms themselves are highly optimized [unintelligible 00:32:04.27]
And it’s interesting - you’re a data scientist, and you might be concerned with performance… But when we look deeper at the respondents and we [unintelligible 00:32:14.18] by job, the data scientists are not the ones that predominantly identify recoding models as being a roadblock to production. Among data scientists, that is the next to least popular concern.
[32:29] The biggest concern the data scientists has was a skills gap in their organization, whether it was data engineering, or Docker, or something like that, and then managing environments dependencies, and then meeting IT security standards is another one. Getting access to computing resources. Those are all the things. Recoding models wasn’t their big blocker.
I’m just curious, which roles most worried about that?
What’s really interesting is it’s the ops roles. So cloud engineer, cloud security manager, cloud ops, MLOps people… When you look at the histogram of their responses of which things are impediments, all of them look the same. So if you were actually to do a cohort clustering based on the shape of the histogram of their pain points to production, all those four roles would look pretty much identical. And out of those four roles, MLOps, cloud ops, cloud security manager and cloud engineer - for them, skills gap in the organization was the least of their concerns; whereas that was the biggest concern for data scientists, that’s the least concern for them. For them, the biggest roadblock to production was recoding models from Python and R to another language.
So when you say the skills gap bit - is that the perception of the skills that I have in my role are either in deficit in our organization? …or inversely, the people on the ops side are saying “We have that.”
We were not specific in that. The little one-line response you can check there was “A skills gap in my organization.” And we didn’t ask if it was a skills gap like “They need more of me in the organization”, or if the organization or if my data science team needs more of this kind of expertise. We just left it kind of open to interpretation, to say “Talent and skills gap is the biggest impediment.” And my read on this is that these folks who are in the MLOps, ML engineering kind of roles - they kind of know what they need to do; in their organization, in the IT organization usually where they’re housed, they kind of know what they need to do. They’ve got the skills. It’s just a huge pain in the butt to do some of these things that they have to do, chief among them - recoding the models from Python and R into other languages, or vice-versa. Those two are the top concerns.
And what’s really interesting is we also ask people - because it’s a hot hiring market right now, we ask people “What is your job satisfaction? How long do you plan to stay with your current employer?” And the MLOps, the cloud ops, cloud engineer folks - those are the least happy. They are the ones where I believe three quarters are saying that they’re gonna be looking for a new job in 6-12 months.
Interesting.
So maybe the moral lesson there is the more you’re making people recode models from a language to another, the more likely they are to churn. Or it could also be that that’s just a really in-demand role and skillset. But that being said, if they feel like that recoding thing is an impediment and is a frustration for them in their job, and also they’re a very in-demand skillset, you should maybe think of other ways to make them happy, to retain them.
Anyway, that was another really super-interesting find. It was stark; no other role, no other sets of roles – Do you think data scientists are in demand, maybe there’s a higher thing there? No, data scientists - 50% of them are like “Yeah, I’m either here for the foreseeable future, or I might start looking in 2-3 years.” 50%. But when it comes to the MLOps and cloud ops folks, 3% said that they would stay at their current firm for the foreseeable future, another 25% to 30% said that they will start looking in 2-3 years, and the rest of them were all within the next 6-12 months, or “I’m currently looking.”
[35:53] That’s crazy. So you mentioned this sort of contrast between what the data scientists were concerned with and what these cloud/MLOps/data ops people were concerned with… On that spectrum was this element of efficiency, and also recoding models… I know, if I’m not wrong, Anaconda has some sort of recent news in terms of some things related to optimized Python and efficiency… Do you wanna share that with the listeners?
Yeah, it’s very exciting news. Basically, we have hired the Pyston team… And for those of you who don’t know, Pyston is an open source alternative Python interpreter that runs on your unmodified Python code. It can go 20% to 50% faster as an interpreter…
That’s crazy.
Yeah.
…which is cool. Now, it’s [unintelligible 00:36:37.22] so like “What percentage of your code is pure Python code?”, and that’s the percent that we would be squeezing the air out of. If the rest of it is numerical code, then Pyston won’t help very much, because that’s already quite optimized… And if you wanna optimize that further, to fuse loops or things like that, then you would use something like Numba. And in fact, it was really our Numba Compiler project that led us down this path. We had many different kinds of users coming to us saying “Hey, Numba is great. I wanna do Numba for my whole program.” And we’re like, “No, that’s not what it’s for.” It’s there to hit the hot numerical loops, it’s to allow you to write Fortran(ish)-like element wise stuff without having to go and break out C extensions for NumPy, right? That’s what Numba is good at.
But then, as we looked to see “Can we extend some of the ideas in the Numba optimization toolkit - can we extend those into broader program analysis?” and then we realized that it’s almost this different project… And Pyston is essentially a project in that vein… “Can we just make the interpreter itself much faster at a lot of these common things that people do?” And then there’s a 1% or 2% improvement here, 1% or 2% there, you start shaving off 1% or 2% all over the place, and you can start making something that’s quite fast… Again, without making people have to rewrite any of their code.
So we’re really excited about that… And of course, it’s an open source project, we’re gonna keep it open source; that’s kind of how we do… And yeah, I’m really excited about the team; they’re really sharp guys, and we’re really excited about what’s to come.
Congratulations.
Thank you.
Yeah, that’s awesome. And for those listeners out there that maybe they’re thinking they might wanna try something with Pyston… Could you describe, what do you have to change about your workflow as a Python developer to start utilizing Pyston? Where does it factor in and how does it change your workflow?
Well, it is an alternative Python interpreter… So instead of typing “python”, you type “pyston”. The website is pyston.org. It’s really, really simple; there’s good docs there that - basically, you just run Pyston on your code and there it is.
So the goal of the project is to make it as easy as possible to just drop in a replacement interpreter. Now, of course, the elephant in the room is “What about all of those wonderful extension modules that everyone loves to use?” So we are looking into what it takes to make sure that all of that is covered well as well. There is recompilation necessary for some of that stuff, but we’re Anaconda, we’re pretty good at building libraries and compiling them… So yeah, we’re really excited about trying to deliver something very awesome there for people.
[39:08] That’s so cool. Yeah, I’m really excited to follow that and try some things out on my own. So I guess to close this out here as we’re wrapping out the conversation… Predicting the future, you’re always gonna be wrong; that’s my experience. But when you come back and see us next year for the State of the Data Science Report, any predictions for what we might be seeing over the coming year?
Over the coming year… Wow - yeah, lots of interesting things. Lots of interesting things. I think that the information warfare and technological warfare between U.S. and China is going to start having – the first few drops of rain are gonna start hitting our ecosystem from that. I do believe that. I think that – well, depending on the amount of political capital the Biden administration has to spend on these things, I think regulation of tech is going to certainly, of course, have implications in our industry, because so much of what people use these data processing tools for are around user behavior and data analysis of a lot of that kind of stuff. So I think as an industry, we’re going to have to get more political sooner rather than later.
Now, one of the other things that came out of the survey is that a lot of practitioners are concerned about ethics. They are concerned about bias. They’re not naive about this. Their upstream business stakeholders and [unintelligible 00:40:25.14] might be somewhat naive, but the hardening thing is that the practitioners at least, the fingers-on-keyboard folks - those folks are aware that it’s garbage in/garbage out, bias in/bias out. So I think we need to, as an industry, as a community, we should make sure we’re constantly aware about that, and that we’re intentional about our practices… So I think over the next year we’re gonna see incidents and we’re gonna see some of these kinds of things that really force us to have a conversation around data management, privacy, bias, ethics, use of proprietary APIs for prediction and what that means… A lot of these things. Yeah, that’s what I think is gonna happen over the next year.
Well, if any of that does happen, you’ll hear about it here on Practical AI next year, so stay tuned… Peter, it’s always a pleasure to talk to you. I really appreciate you taking time and the work that Anaconda puts into not only this report, but to the Python and data science ecosystem in general. You’re appreciated, and - yeah, I just wanna pass along that thanks, and keep up the good work.
Absolutely.
Thank you guys so much.
Our transcripts are open source on GitHub. Improvements are welcome. 💚