Everyone working in data science and AI knows about Anaconda and has probably “conda” installed something. But how did Anaconda get started and what are they working on now? Peter Wang, CEO of Anaconda and creator of PyData and popular packages like Bokeh and DataShader, joins us to discuss that and much more. Peter gives some great insights on the Python AI ecosystem and very practical advice for scaling up your data science operation.
DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog.
Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with no ads, extended episodes, outtakes, bonus content, a deep discount in our merch store (soon), and more to come. Let’s do this!
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a principal AI strategist at Lockheed Martin. How are you doing, Chris?
I am doing very well. How’s it going today, Daniel?
It’s going pretty good, yeah. I got a chance to get outside a bit over the weekend, even if it was just in my yard, doing mowing and yard work, and that sort of thing… So it was good to get away from the screen a little bit. What about yourself?
I tried to do some yard work myself, and for anyone who listened to us a couple of weeks ago, I had the broken rib, and I discovered it is not as healed as I was hoping it was, so I stopped doing yard work…
That’s not good.
…and took lots of medication, and thus I’m here and everything is fine.
Okay, good. The other thing I did was actually some people asked me about my AI workstation build in our Slack channel, so that’s up and going. I have that pretty much running 24/7, with some type of model training. But I have a colleague who – there’s two GPUs on it, so he’s running some stuff on the other one, and I tried to set up… I got a new router for my – I intended to put in this new router at my house and set my other router in bridge mode, and I had this really nice plan, and VPN access, and all this stuff… And all of that completely failed.
[laughs] Of course, you know?
You know, more network things in the future, but… Yeah.
It never works the first time.
Exactly, exactly. Well, speaking of very practical things, and also setting up environments and all of those things, we’re really excited today because - you know, Chris, we’ve mentioned so many times on the podcast Anaconda, or Conda, in all sorts of contexts, because it’s just a pillar of the data science and AI world. Ever since I’ve known about data science, I’ve known about Anaconda, so I’m really excited that today we have Peter Wang, who is the CEO at Anaconda, joining us today. Welcome, Peter.
Thank you. Thank you for having me on the podcast.
[00:04:01.02] Yeah, great to have you. Before we jump into all things Anaconda, it’d be great if you could just tell us a little bit about your background and how you eventually crossed paths with this data science world and ended up helping found Anaconda, and all of those things.
Yeah, I’ll try to give somewhat of an abbreviated version of the story, but…
Sure, that’d be great. It’s a lot to cover all at once.
It’s a lot to cover. It’s about 20-something years here. But I started – actually, my academic background is in physics, so I was doing…
Oh, same here.
Oh, great. Yeah, quantum information and quantum computing and physics, and when I graduated, I decided to go into the software industry; I joined a startup. I’ve always been coding; I’ve been coding since I was a very young child, and I’ve always loved it. But just given it was the dotcom era, I thought that would be a good time to try my luck at that whole thing. But anyway, long story short, I ended up in Austin, Texas, working at a Python-based scientific software consultancy, and that was kind of the early to mid-2000’s… And it was there that I really – since ‘99 it was when I first started getting into Python, but then over the course of the 2000’s, when I started doing a lot of work in the scientific Python, scientific numerical computing with Python, that was… Basically, I was using Python before NumPy existed; it was still numeric, or NumArray, and those kinds of things… But basically, over that time I started seeing through more and more of my work and my consulting jobs that Python was being adopted in places that were outside of what you’d consider scientific computing.
We were initially thinking “Oh, this is a cooler alternative to MATLAB.” Then we started seeing it in business environments, started seeing it going to investment banks, and just see it being used everywhere to do financial modeling. I’m like “Okay, well this is very interesting.” So around the turn of the decade, I guess - the last decade, the 2010s - I think many of us in the community were having this realization, but personally as an entrepreneur I started realizing “Hey, with the joint evolution or transformational things of cloud computing, as well as big data, the big data creates a demand to do bigger analysis, if you will, and traditional SQL is not gonna cut it when you have the four V’s of messy, big data - SQL doesn’t cut it.” And then if you look at what hedge funds do, and hedge funds are usually kind of the leading edge of numerical modeling technology, they’re all doing very sophisticated kinds of predictive analytics, and they were choosing Python all the time. So there was that.
And then I realized with cloud computing - it meat that every company on the planet would be able to rent computers to do… Their data would end up in the cloud, and they would rent computers to do massive-scale, supercomputing-scale jobs, that would have been inconceivable before, because you’d have to ask IT, wait five years, and they might build your data center. So with these two combined transformational forces happening, I realized “A-ha! Python is actually a thing we should be pushing for data analysis.” So in 2012, when I founded Anaconda - of course, at the time it was called Continuum Analytics - I also started the PyData community movement, and that branding, if you will. It was kind of a branding exercise. Because all the tools are basically SciPy. Of course, you have Pandas in there and a few others, and stats models, things like that, but ultimately, it’s basically SciPy rebranded for a business audience…
But that pushing of Python for data analysis was something that me and a few other people - I’d say we were the pioneers, we were the vanguard, and everyone else was looking at us like we were weird. It was either Hadoop, or maybe it was gonna be R as the successor to Sass, but we came on the field and I was very vocal at that time like “Hey, Python’s the thing. Python’s awesome!” and then people would be looking at me weird. But now, I think we’ve proven that Python is a thing, and it’s a good, useful thing. Of course, it has its warts… But anyway, that’s kind of how I came to founding the company. And we creating, actually, Anaconda Distribution as sort of like a thing we had to do. We were interested in creating distributed computing, and interactive visualization, and compilers, and optimizing all this, and that, and the other, but we couldn’t get any of that into people’s hands, because they were still struggling to install SciPy or Matplotlib. So we built a distribution called Anaconda to make that problem - a particularly nasty problem, even to this day - easier. So that’s just kind of continued to be a thing… So yeah.
[00:08:06.01] To take you back just a moment, I’m curious, just for a historical perspective, as a big advocate for Python at the time, as you’re looking at this and you’re kind of saying “Hey, it’s going this way”, I’m just curious, if you can take yourself back, what was it about Python then that was really driving that passion? Why was that passion not with MATLAB or with R or with some of the others? What was it that really grabbed you? I’m just curious how you got motivated on that.
Like I said, I’ve been programming a long time. I know a number of languages, I started with BASIC and Logo, as so many people do, children of the ‘80s… But I learned a lot of languages. In fact, my professional work had been in C++, and I was a huge performance nerd, and all that stuff. When I first clued into Python, it was version 1.5.2, and it was a Slashdot post, and I was like “I’m tired of all these Slashdot posts on Python. Let me finally take a look at what it is, because Perl 6 looks like it’s gonna take a little while to get here, and I might as well play with Python.” But when I started using it, I realized “Oh, my gosh. This is very nice. This is executable pseudo-code, and furthermore, I can use Python to script my low-level C++ graphics engine and be way faster than sitting there, beating my head into C++ templates, which back at that time they were not very well-supported by any of the compilers… So I could get more abstraction, I could prototype very fast, and it was just pleasant to work in, and I felt like I could do stuff without slogging through a pile of syntax.
So that ease of use and that friendliness, even for a seasoned professional programmer like myself, was very nice… And what ended up happening over the course of the 2000’s was that you could see people who are not traditional programmers - and this is gonna be a very important thing as we talk about Practical AI and talk about the demographics of the next generation of practitioners - the people who made Python good for science, and data science and all this, they’re not professional software developers. So Jupyter was born as IPython, created by – well, I mean, Fernando, for instance, is an applied physicist… There’s a lot of physicists actually around the ecosystem.
Yeah, it’s a trend.
It’s a trend, it’s a thing. And then you’ve got Jake VanderPlas, another contributor to Scikit-Learn, the creator of Altair and whatnot; he was an astronomer. You’ve got Travis Oliphant, the creator of NumPy and my co-founder at Anaconda - he was an assistant professor when he made NumPy. So the tools that were built in Python for doing data science and analysis and things like that were built by people who could take what was there in the Python ecosystem, modify it so it’s fit for purpose, so it was pleasant to use for what they wanted to use a computing system to do. That’s very different; that sort of product development is inside the head, is coming from within the person making the thing.
For instance, Wes McKinney, the creator of Pandas, developed it when he was at a hedge fund. He’s like “This is the thing I need to bang around my data frames.” So that’s a motif in the Python ecosystem that took it from just kind of this cool, fun, easy-to-learn language that was bought in Guido’s vision of computer programming for everyone, and a nicer scripting system than Bash… And really, the scientific Python ecosystem took it to this next level, of like “Okay, this is the numerical quantitative computing system that we all wish we had… And it’s unencumbered by 30 years of legacy crap; it’s on a language that’s very nice to use, easy and approachable… But under the hood, you open it up - this nice, little, approachable little Honda Accord. You open up under the hood and it’s this incredibly modular warp drive unit.”
You can bolt on things like SWIG… Really weird pieces of software that do incredible things; automatically generate wrappers for any of these other languages. So you can bolt on a Fortran, C++, we have a just-in-time compiler we built for it… It’s a really interesting, upgradable piece of kit, to use a British term… So I think that’s what’s given it the sticking power that has now – you know, now we’ve seen that evolve more and more, so now it’s this really cool community… It’s almost like a standard language that’s very modular. Those are some of the key aspects of it.
Sorry, that was a very long-winded answer, but you get me talking about true Python evangelism, and I just can’t stop.
[00:12:15.23] No, I totally relate. I remember it was late in my Ph.D. when I was working on a piece of scientific computing code in Fortran, and I remember that was the first time that I was introduced to Python, because I went to go see some of our collaborators, I spent three weeks with them working on some experiments… I was pair-programming with this guy, and he’s like “Oh, I’m just gonna run a few things”, and he had this Python script around running all of this Fortran stuff, and I remember just being stun; I’m like “Oh… Why have I been doing this?! This guy has like a super-power of some kind.” [laughter]
But I’m wondering, maybe moving from there – you talked about how the numerical and scientific and data science tooling that was developed in Python, that was developed by these groups of (whether it be) scientists, or people in industry doing data analysis, or not really maybe the traditional programmer types… Do you feel like that contributed to some of the maybe struggle or inconsistency around managing environments, and installs, and how people managed all of their stuff? Was it just because people had a bunch of different views on those things, or what’s your perspective on why –
Why is packaging so terrible in Python, right?
Well, so I can answer that question, but first I would just have sort of a meta-critique of the question, which is - any system that does the number of things that Python does, I would assert, has similar kinds of software dependency issues.
Yeah, it can do so many things.
Right. But in the case of Python, think about what we were talking about - gluing to Fortran code. Alright, well, maybe Perl has nicer packaging; where does Perl glue to Fortran code, and where is that being used on a million nodes? It’s not. So I think there’s a thing to recognize - the reason why your duct tape has all this crap stuck to the back of it is because it’s sticky. It’s duct tape.
So Python as a glue language of course gets a bunch of cruft glued into it, and now we have a much harder problem to solve, because we do speak to so many different things.
Okay, now that being said, I think fundamentally one of the reasons why Python packaging is difficult has been that packaging was treated as a second-class concern by the BDFL. Guido clearly admitted very early on he just didn’t think about it that much and it wasn’t a big problem for him. It wasn’t exciting or interesting for him… And he’s not really apologizing for it, he’s just saying this is kind of the way it was. It was someone else’s problem to go and figure out, whether it’s the [unintelligible 00:15:19.00] whatever it might be, it was really not a thing he was super-interested in.
And then when we came along - actually, the very first PyData workshop I put together March of 2012, three months after we started the company, and Guido was working for Google at the time… He came by and stopped by, we were all very excited to see him; people gave him a lot of crap, like “Hey, when are we gonna matrix a multiplication operator?”, but then we also asked him about packaging. We’re like “Hey, can you help us get the core packaging folks in the core dev to work with us on packaging in the scientific Python ecosystem?” Because it had just been a mess for a very long time. And his answer was “Look, if it’s possible that your needs are so exotic that you should just go do your own thing. Don’t worry about it, just solve your own problem.” So we’re like “Okay…” So we did.
[00:16:06.23] You have to understand, by that time the people in the SciPy ecosystem had been fighting various multiple genealogies and multiple generations of Python packaging tools for ten years. It had never really been great. So by that time, we were like “Okay, let’s just solve this once and for all”, and we realized something very fundamental… And this is true for any system. It has nothing to do with Conda, this is true for any system that touches compiled code. And this is now the second part of my answer; the first part of my answer was that Guido didn’t really care, so it was kind of festering. The second part of it is that software development – basically, every single operating system that is a PC-based operating system sucks. So we have inherited the long shadow of 1970’s technical debt. If you’re on Linux - any of you guys use a thing called Docker, maybe? Just a little thing?
I love Docker.
Why did Docker come back?!
Docker - what’s that?
Because you even have RPM. What Linux distro doesn’t have a package manager? And yet, you use Docker, because the concept of having static linkage or dynamic linkage between software libraries, even though we have a tried and true, robust, dynamic linker system it’s terrible. You go to Macintosh, so maybe you use Homebrew; that’s kind of the preferred package manager on there, and of course, Apple with AppStore will kill all of that third-party stuff… But for now, we have HomeBrew. Well, you start building stuff with HomeBrew - what if you wanna have multiple different versions of libraries, and you have to build framework builds of them, and those framework builds are incompatible with each other, and so then you’ll need access to the raw GL context. God forbid you need to do any of that stuff.
And have you ever converted them? Have you ever used them to convert from a notebook to a PDF, or something like that?
Did you know - fun thing - that it uses a thing called Pandoc underneath, which has an embedded Haskell compiler? So in order to ship nbconvert on Windows you have to go and build a Haskell runtime on Windows.
What a web of things…
What a web of things, right? [laughter] But this is the kind of nonsense that we end up having to deal with. I call it nonsense, but really, we just realized that it was more than just a C ABI linkage, whatever. And this was back in 2012, pre-Docker… But even Docker then on Windows - not a really great story, right? And ultimately, we don’t need Docker, actually. It turns out that if you build things in the right way, you can have incredibly portable, side-by-side installable native executables that work just fine… And that’s essentially what the Conda system is about. It’s about rather that wrapping everything up in a hermetic sort of Docker environment, we create a very simple specification of “What packages do you want?” and then we have a recipe system that then has a build system behind it, that is able to build native binaries for every single platform, optimize for every single hardware version that we support… And that’s ultimately – you know, when it works well, it works well. When it falls over, then it can be a little bit harder to entangle exactly what the problem was… And we’re working on that, of course, but that’s ultimately the motivation there, and why it’s terrible. I think we inherited the long shadow of the ‘70s. C linker and loader I’m looking at you - and then also we inherited some of Guido’s preferences in language development. And I’m not blaming him; I love him and I’m so grateful for what he’s done with the language, but… Anyway. You guys asked, so… That’s the honest answer.
Okay, so I’m gonna start off the next section, as we’re starting to dive into Anaconda itself, with the obvious question you’ve probably been asked a billion times - I get Python, Snake and Anaconda, but I am curious why specifically Anaconda, and if there were any alternative names that might have been fun that you could share with us.
Yeah… So I will give you the origin story - it was at a moment at that PyData workshop when we were looking at trying to promote a Pythonic alternative to Hadoop. At the time there was a distributed MapReduce system built around Erlang that had come out of Nokia, and it was called Disco. So I was like “Wow, this is nice. It’s a nice Pythonic MapReduce, and people really want MapReduce. We should get this to people.” And then I realized, I had this sort of moment of truth, I was like “Wait, we can’t even ship SciPy to people after 10-12 years. How are we gonna ship Erlang runtimes to people?” So I turned around and I looked at Travis; we were sitting in the back of the room and I looked at him and I said “If we’re really gonna do this, a bunch of stuff is gonna have to be rolled in - we need to create a new distribution of Python for big data… So we’ll just call it Anaconda, because it’s a big snake.” It was like – literally, it happened in a flash. There was no great deliberation about this. [laughs]
And when you’re saying – so just for listeners who are maybe a little bit newer to this world… When you’re saying a specific distribution of Python - what does that specifically change about your local environment that you’re using this distribution of? Is it just around using Conda instead of Pip, or is it a whole different Python interpreter? What are the specifics when you say “specific distribution”?
Yeah, so the specifics are that we built the Python interpreter itself, and we built the installer around it, and then all the packages; because when you build packages that have binary or C extension modules or C++, they have to be built with a compatible compiler set as what you built the interpreter with, otherwise you segfault. And then every subsequent package that has C dependencies needs to be built with the same compiler set, otherwise you end up with runtime segfaults, which is no fun for anybody.
So we basically have created a normalized build system - it’s like a Lego ground/base plate that has equal-spaced studs, and is level, and then you can put everything on it. But you need that first base plate. So that’s what the Anaconda runtime really is. So it’s a Python interpreter that we’ve built… And then you can either get Miniconda, which is just that Python interpreter with the standard library, and the Conda package manager… Or you can get Anaconda, which comes with 250 or 220 packages pre-built as well, so a pre-populated base plate.
[00:23:59.11] But the idea of the Anaconda system is that using Conda you can then install packages into this that fit on that base plate and fit with each other. That being said, there’s nothing stopping you from using Pip. I mean, I use Pip. You can pip-install other modules in. But if they have C dependencies or precompiled binary components, it’s better to install those with Conda, because then you know that those are compatible. And especially as we’re talking about in the context of AI and ML and things like that, obviously accelerated hardware is a deeply important topic… So you wanna get the version of that package that’s built for your piece of hardware.
And sometimes when people go and build binaries and they make Pip wheels available, they have to build those with the lowest common denominator of hardware [unintelligible 00:24:41.11]. So you might be paying $3,000 for a Xeon processor, but only getting basically a $500 or $200 Xeon processor worth of capability, because certain flags are not turned on. And this is the kind of thing that is not important for a huge number of users, but really important for some users… And of course, as time goes on, more and more important for everyone.
So that’s what that means - we can install Anaconda into a user land directory, you don’t need admin permissions, and you’re gonna install stuff that’s all self-contained in there; if you don’t like it, you can blow it away with one directory remove command. And it stands separate than your system Python than anything else. People use Conda with Docker all the time. It’s a very common pattern.
Yeah, that makes sense. So a lot of these packages - you were talking about scikit-learn, or Pandas, or NumPy, or these other things… Generally, people think of this sort of landscape of open source data science tools - so what portions of the Anaconda system are open source? And when you entered into this journey, how did you go about navigating that open source landscape where there’s all sorts of things, that have all sorts of licenses? And building a product around open source is kind of an interesting thing, especially – I think it seems to be a lot of things like that are trending now, but I think when this whole field was getting started, there probably weren’t as many examples of businesses built around open source… So yeah, how did you all think about that, and how did that grow as you were starting out?
Yeah, so Travis and I are both ardent supporters of open source, because I think it leads to open innovation. So for me, that’s almost – open source is almost like a means to an end; it’s not an end onto itself. So if something’s open source, but it leads to single-vendor innovation, that’s still bad… And we have examples of that now, actually, in the burgeoning AI space. And I would encourage people to look beyond merely the astroturf “Oh, well this is open source.” Well, all the contributors come from one company, whether it’s a small one or a big one. It’s like, “Who’s gonna get in there? Are you gonna accept patches? Can we really fork it and you’re not gonna sue us for some thing?”
For me and Travis it’s been watching what the open scientific software ecosystem was able to produce. That collaboration was so generative and so amazing, we wanted to ensure that would endure. So in building a business and trying to build business models around open source, that was part of our entrepreneurial exploration of that. “Can we build a company, and have a good one, that fosters and sustains open source innovation?”
So everything in Anaconda is open source. The recipes are open source, the Conda package manager is open source… That’s always been the case from the very beginning. So we don’t make our money by holding back any of that stuff.
We started with a support and consulting kind of model, because again, Python for data science was kind of a new thing… And plenty of our projects were nascent, whether it was Numba, or whether it was any of these other projects. We had a lot of consulting demands for those kinds of things, so we did that.
As we shifted into a product-oriented mode, what we realized was that enterprise pains, especially enterprise pains addressed by software, they’re not really about proprietary closed source features; they’re about enterprises wanting to have roadmap transparency, having a vendor, a throttle when something goes wrong… You know, all these other kinds of things.
[00:28:00.15] Red Hat demonstrated how you can do this in a fairly sustainable way, so our first product that we actually shipped was a package server; you could have an IT guy that could say “You know what - I don’t want GPL packages coming in here, because… Legal. I wanna have all the data scientists internally be able to point to my internal mirror of the Anaconda ecosystem, but I get to blacklist/zero out all of these GPL packages. I get to set which versions are available in various channels, so the prod cluster that I manage will only ever get package updates from the prod channels, and I get to flip the bid on that one.”
But the devs, the data scientists who have the sandbox and they want the latest, bleeding edge version of something or the other, they knock themselves out. And now they’re not complaining that I’m holding them back from their work.
So that package repository server - we still sell that today; it’s a very popular product, that addresses a deep need that enterprises have. So that’s how we think about the product development. We also do have an enterprise machine learning platform, just as like a Domino, or an H2O kind of thing. So we’ve had that that we’ve been selling for a while… But I think that for us, looking at the growth of the packaging demand in the ecosystem, the package server for us is really kind of a no-brainer enterprise offering.
I’d like to actually follow up on that exact thing…
As you’re looking at organizations, companies, businesses out there that are trying to find their way into data science, and Anaconda being one of the major avenues on doing that, what is the value proposition that a CIO at some company should be looking at when they’re thinking about “Do I go Anaconda, or do I go some other route? Do we mix and match…?” Because that is a question that companies are dealing with every day right now… What should that CIO be thinking about when they’re trying to decide whether they wanna go with Anaconda or not?
Yeah, so there’s actually not very much that’s competitive with our package – we call it the commercial license of Anaconda… Because ultimately, what we’re solving is a very unique, but important problem, which is the software supply chain. Now you see a lot of companies – it’s actually shocking to me, because I’ve been involved in open source since ‘95, in the early Linux days, but even today, there are companies that are just starting to understand “Oh yeah, maybe we should figure out how to use open source in a governed way.” They’re starting to have that conversation at the CIO and the IT leader level.
When it comes to ML and AI, the ecosystem moves so fast… It’s a whole Wild West of things out there. Anaconda is basically the only company that is out there as your last outfit, or between civilization and the Wild West… So if you actually want a build of NumPy or of scikit-learn to go and run on your customer-sensitive PII HIPAA data, and not have it just come from some grad student’s server somewhere under their desk, you have to talk to a vendor; you have to have someone who will actually talk to your legal people, sign on some lines… We’re that. We’re it.
So we actually are compatible with a whole host of other – I talk about how our ML platform is competitive with things like Domino, or maybe a SageMaker, or some of these other things… But our package server in our commercial license for the Anaconda distribution is not competitive with those things. It goes hand in hand with those things. In fact, we have a partnership with Red Hat, a partnership with IBM that we’ve just announced earlier this year, where our package server in those commercial license packages - that is going out to the world via those channels… Because again, there’s not much competitive with that.
So what the CIO should be thinking about is “How do I govern the software bits that actually run? This Docker, this three-gigabyte opaque binary that my data scientist intern just handed me - how good do I feel about running that in production? If I want actually some transparency into it, if I want repeatability…” An aerospace manufacturer was talking to me at PyCon a couple years ago, and he said “We have to demonstrate to the FAA that we can run these wing models 50 years. 50 years after the last plane rolls off the line.”
So it can be some Docker file with a run command npm install this, or pseudo pip-update that. You’ve gotta have something that you can point to, and everybody up and down the governance chain feels good about. And right now, that topic of open source governance for ML/AI is not a broadly discussed topic, but of course, you guys as practitioners understand the importance of that… And especially as predictions and predictive models come under more regulatory scrutiny. So yeah, we have a relatively distinguished and unique offering in that regard.
[00:32:28.12] Yeah, I’m curious - you mentioned a couple of times… And I guess this is a product of this shift in first the hype around data science, and everybody is doing data science… Now we’re kind of all shifting, we all wanna do AI, and if we do AI, then we get bigger salaries, and that sort of thing…
So how have you seen this shift towards AI and wanting to do AI things, as opposed to maybe just data science? How has that influences the way that you’re interacting with clients, and the things that they wanna do, and the open source projects that you’re wanting to support within the Anaconda ecosystem? How has that shifted things and made you think about things differently, or the same…? What does that look like?
Well, without getting too snarky, it has made me very cynical about the tech business press…
But maybe I already maxed out my cynicism there already, you know…
Now it’s flipped around, because I used a signed integer, and now I’m in negative cynicism…
And the challenge is this - you know, as someone who kind of understands the technology in this space, I think we’re very close to some things that are really very close to what we could call AI. So there’s a substance there, but the second-order/third-order far-field wave of interest and hype is way bigger than what’s justified, I think… But at the same time, if we – I mean, GPT-3 is jaw-dropping. You look at some of these things, it’s like “Holy cow…!” And there’s deeper questions there. We have to think about – this technology, at an intellectual level, it’s like a nuclear-age revolution. We can’t just put this in everybody’s hands. We’ve gotta be serious about how we use this stuff.
But all that aside, I think that what people maybe sometimes miss in that business-level up-leveling of AI and all this kind of stuff is that it is a ladder of needs, or a Maslow hierarchy. If your business sucks at basic data management, if you can’t even run SQL queries over your stuff, you’re not gonna get to do data science. And if you can’t do data science, your AI is gonna be kind of crap. You’re not gonna do any real AI – you’re gonna spend a lot of money… No one’s gonna stop taking your dollars just because you have bad data structure. Because [unintelligible 00:34:40.07] will take your dollars all day long. They’re gonna produce some clickable BI chart and charge you a couple million dollars for it, and you won’t have gotten the value out of it. But then whoever has ever gotten fired for having a bad IT project? It’s just somebody else’s dollars, right?
So again, like I said, I have a lot of cynicism about how this stuff works… But I do think that legitimately – so to in a more serious tone answer your question in steering the ship in Anaconda and looking at where we make our investments, you know, we support the development of some of the fundamental tools. So we invest in things like Dask, which are the next-generation distributed computing in Python, we support things like Numba, which give us more performant – just across the board, it makes the low-level libraries very performance on next-gen hardware. So hardware manufacturers like Intel and NVIDIA partner with us to add improvements to the compiler. Pandas - we support fundamental development of things like Pandas.
So I think my thing is this - I’m fixated on empowering the practitioners, and helping them up-level their data literacy across the organization. So my investment is not gonna be at the cutting edge of the hype. The way I wanna steer the community, the way I steer my friends who are movers and shakers in the community is to really think about this - if we are to have this technology be something that’s transformative for humanity as a whole, then it cannot become an ivory tower, or there’s a few acolytes who know how to use a few privileged proprietary systems to go and tell the rest of us what the predictions are. This cannot be how this works. It has to be a democratized transformation of how every business, every person thinks about it.
[00:36:12.00] In fact, an underlying thing at Anaconda is that we wanna make sure that everyone gets data literacy. That’s why we’ll always have this free capability. We don’t charge for a few hundred extra rows on this library, or something like that. It’s always free and unfettered access, because I want every school kid in Bangladesh to be able to model quantitatively in a Jupyter Notebook why some hotshot politician enacted some policy. Everybody, everywhere. Math is empowering for everyone. And this is just computational math. So from that perspective, there’s a deep moral aspect to my mission, and to the mission at Anaconda.
Now, for AI/ML and data science, the transformation I’ve seen in the field is that - yeah, everyone talks about AI, but then when you get all the practitioners together, we all know to put the stuff on the side; we put the MBA speak to the side and then we all just talk about the real stuff. And it’s usually data engineering, it’s usually – well, software bits; who’s setting up the version environment, what version of Pandas are you using, GPUs etc. People are actually now starting to really model the hardware footprint as they’re approaching data jobs, which I think is fantastic. It’s what should have always been done. It’s actually a practice that IT has left behind for 15-20 years in the Java era. Now hardware matters again, vectorization matters again, and it’s a beautiful thing, so… Maybe I’ve sort of lost my point here. I’m just ranting… [laughs]
No, that’s really great.
It’s fine, yeah.
I think some of those things that I see, like you’re talking about, definitely resonate with me. I definitely think– like, when I first got into data science, the hardware wasn’t really… I wasn’t thinking a ton about it, and I was also, like you’re saying, shipping off that three-gigabyte software container to dev ops Doug, and he was figuring out and hating me because it took however long to build, and all of those things… So it’s good to hear you talk about some of those things, and I definitely see how a lot of these main components that you’re supporting are really fundamental.
I was just doing some speech recognition stuff end of last week and over the weekend on NVIDIA NeMo, and they were like “Well, you need to install Numba to speed up some of these data augmentation things for the speech files…”, which is like a main component of the thing; it’s really driving that. And that really influences the actual AI training and the quality of that… So these sort of preprocessing things and all of those are really fundamental.
I don’t know if this factors into your thoughts around packaging and distribution and that sort of thing. I know a lot of people are talking about these model hubs now - TensorFlow Hub, PyTorch Hub, Hugging Face’s model hub… You know, in addition to the code, there’s sort of – like, there’s the data, there’s the code, and then there’s these things that are kind of weird maybe in the software engineering world, that are like these different types of data, which are the serialized models that influence how this code runs… How do you think about that at Anaconda, and has that conversation been going on about packaging and distributing these things? …or are you mostly focused on the code at this point?
We’re still focused on the software supply chain, but of course, I’m very exposed to the kinds of dynamics you’re talking about, because I see these conversations happening in the practitioner ecosystem in the conversations. There’s a really important dynamic that’s happening here. Without getting to hyperbolic and biblical about it, data science and machine learning - this represents essentially the transformation of the software industry. So for the last probably 40 years, since the dawn of the PC era, but even prior to that a little bit, software developers or software engineers have been able to think of themselves as a distinguished class, like “We do software.” The hardware people, they’re writing Verilog and taping things out, whatever that means, and making chips that plugged in, powered on, now then we come in and we do our jobs. And then of course, that’s separate from the DBAs, or this other weird class of Oracle-licensed whoevers. They just sit there, speaking a bunch of weird SQL all day long.
[00:40:12.07] So this deconstruction of the information system into hardware, software and data management is deeply unnatural, and it’s actually something that was not the case. If you go back to the Norbert Wiener, into the early cybernetics era, no one thought about it that way. You listen to any of the founding fathers and mothers of this space, it was like “Yeah, I’m gonna focus on hardware.” But the PC era, and then everything that came afterwards, on the server side, on microprocessors, all these things led to this deconstruction of an information system into these three primary axes… Yeah, I guess it’s a decomposition into three axes.
And what we’re seeing again now, with data science and certainly with ML and AI, is that we now have a synthesis again; we’re forced to do a synthesis. We have to understand the runtime. And actually, the runtime characteristics of your software is data dependent. How weird is that? Imagine talking to the Java head 15 years ago to say “Well, okay, Mr. Java architect, check this out - if I pass in certain values out of this database, your code runs ten times slower.” That doesn’t happen, because you write a CRUD system that pulls a row, does some crap, and pushes a row, and it’s done.
So this idea that “Okay, runtime performance –” So the hardware footprint is dependent on data… And additionally, correctness is value-dependent. Can you imagine writing a unit test, and you – now we have these for models; we have model tests. But prior to models, it was all just code. Can you imagine unit tests where it’s like ‘Well, one plus one is two, so my add function works, but it only works for even integers.” That’s weird, right? And yet, we know that when we build these AI systems, these models, their performance, their correctness is actually value-dependent. And this is a point that I don’t hear anyone else making, maybe because I come at it from a physicist perspective, and I think about deconstructing everything to fundamentals… But deconstructing the computation concept into fundamentals - for the last 40 years we’ve had value-independent processing.
Jim Gray has written papers about this, and people have talked about this, but your average coder nerd doesn’t think about this at all. They’re like “I’m a software dev. I’ll learn this thing; I’m gonna learn Go this year, and I’m gonna do something else next year.” But the whole field of software is going away, it’s melding into – we might call it model development, we might call it something else, but I call it value-dependent or value-sensitive computing, and now your management of your upstream data is as important as managing the upstream code.
The previous approaches to data management don’t work anymore. But of course, checking in every row of a database into Git doesn’t work either… So we have to develop an entire new set of practices for this new industry. All of the previous components, those axes are important, but they can no longer be seen as separable. They’re now integrated. Anyway, thank you for coming to my TED talk about that… [laughter] But that’s the lens I look at all this stuff through. So it’s no wonder that we have model hubs emerging, but I think the management of those things and how we talk about versioning of data, the model performance and characterizing it - all that is a nascent and emerging area, and it’d be fascinating to watch how that really goes as it meets production in the real world.
Okay, so a few minutes ago we were touching on deployment, and I know Daniel made his dev ops Doug allusion there… So I actually wanna go back to that for a moment.
Some day I really hope dev ops Doug listens to this.
He’s gonna send you hate mail.
Just so you know, Peter, Doug was my first dev ops engineer that I worked with when I first started out at a startup, and he taught me all sorts of great things. Anyway, go ahead…
Yeah, we’re gonna get our first Practical AI hate mail from every Doug–
We should contact him and have him on the show…
[00:44:12.17] Okay… So here’s my question - I’m thinking about Anaconda, again, in the organizational structure, and thinking about having to put software together, and [unintelligible 00:44:20.00] one of the things that we’re seeing a lot now is using Python in doing the data science, and doing the modeling and stuff, but we’re seeing kind of a move toward deploying in other languages, where you may take a model and do that… And I’m kind of wondering what your thoughts are on how are you thinking – like, in a world where someone may, for performance reasons, maybe once upon a time, or maybe still now, they were deploying in Java or C++, but maybe they’re thinking of Go or Rust just for pure performance issues… How do you see it fitting into that in terms of pipeline? Do you think that’s not necessary? What are your opinions?
Wow… Yeah, that is a complex topic. So depending on which framework you’re using and what you’re doing with that, compiling down is always something that’s going to be a part of the Python ecosystem. Daniel mentioned Numba before… You don’t have to convince me about compile down; there are times when you just need to go lower…
What pains me though is this idea that there are things that – because Numba goes not down to C, it goes to machine code. We’re literally skipping a level; we’re going from Python to machine code. So in other places - I think in cases like TensorFlow, and there’s tools like JAX - there’s a whole bunch of stuff coming out now, where you can go from high-level Python to much lower-level runtime primitives. And I think that’s fine.
I think when people are doing rewriting… So compile down is something different from a translational perspective than rewriting. And I know that, for instance, when people – just the other day someone was complaining to me about the fact that they built some models in Torch, and then they have to basically go to TensorFlow Serving and they have to rewrite everything in TensorFlow. That’s deeply inefficient. It shouldn’t have to be done that way. The problems to solve there are not monumental, I think. I think they’re mostly ecosystem tooling, and some of these kinds of things that we will solve in time… So I hope that Python compiling down is not an issue, and we’ll keep doing that.
Where people do feel a need to rewrite - I’m not sure what all they’re doing in the Python; I think one of the problems with Python’s explosive growth over the last however many years is that there’s simply not been enough instruction about idiomatic – how to think vectorally, how to do idiomatic things in Python, or do things in Python in an idiomatic way that’s faster. I just see all sorts of code in the wild, that’s just like “Ugh…” You don’t have to do it that way, but you don’t have the time to educate everyone… But maybe we should. Because then what happens is you end up in organizations – businesses move at their own cadence, and now you’ve got a data science team that’s all relatively green, they write some code, it’s slow, the IT team/software dev team is like “Well, it’s Python. We know it’s slow. Let us rewrite it in Go, because I really, really like to use Go.” And they do it, and they don’t have to, and here’s the cost of doing that - now your iteration and your cycle time is way slower. Now if something goes wrong in production, you need to get two people on the line, and then “Where did the translation go wrong?”
For me, back to that point about “What is the mission?” The mission is to make data science literacy widespread, and to empower everyone to ask questions of their world, and to be able to use all this powerful infrastructure. If to do that they have to go and hire a dev team to rewrite their stuff from Python into Go or Rust or God knows what, then we failed in that mission.
So I don’t have a language bigotry of like “It must be Python everywhere, all the time”, but I would like to make sure that – Bret Victor has this concept of immediate connection. I want the data scientists, when they’re in a Jupyter Notebook or in a dev environment, when they’re doing data exploration, I want them to be able to feel like they can roundtrip, and that’s on their own terms. That’s a really important thing. So I think hopefully in time we’ll make sure that remains a possibility for most cases.
Great explanation, by the way. Thank you.
[00:48:01.02] And I think that kind of leads into something I wanted to ask… You started talking about some of the pain points that still exists between maybe data scientists and engineers, or maybe this sort of gap in data literacy, and these things… I was wondering, as you – and I also know that Anaconda does a State of Data Science survey, and of course, you deal with all sorts of people throughout the industry… I was wondering if you could talk a little bit about maybe certain things, looking back from maybe 2012 till now, that you see as really encouraging things in terms of data science tooling, and where we’ve come, and then maybe a few things that are still really open challenges that we haven’t been able to solve yet.
You mean specifically in the tooling? Is that the question?
Yeah, or just in data science workflows, I guess, that you see.
Yeah, let’s see here; in data science workflows… I think a lot of like some table stakes things since 2012 have been resolved; we have much more capable, just in terms of input handling, and just like a lot of the basic day-to-day quality of life stuff for data scientists has improved. People have settled on some sorts of tools as standard, and they’re good. So like using Jupyter Notebooks, which – there’s a mixed feeling about notebooks, and I can go on at length about them. But I think in general–
We had Joel Grus on the show, so…
Joel – oh, yeah, yeah…
We’ve had all perspectives.
All perspectives. I like to talk about three’s, and I think that the notebook rolls up three different things into one, and in doing so, unfortunately confuses the crap out of everyone… Because everyone thinks it’s something else. But basically, I think that at least with notebooks, the idea that people can do somewhat literate programming is way better than if it was a bunch of opaque code and then a PowerPoint.
So for all of the hate and all of the hate and all of the “Oh, line numbers out of sequence. You just gave me a notebook and I don’t know what to do with it” - for all of that hate aside, I think notebooks are a net good, because they show people… Ultimately, again, back to that data literacy thing, it gets people excited and interested. And here’s another thing about Python - it’s good that it’s Python that is in these notebooks. Python is very accessible and readable, even if you don’t know how to write it.
I totally agree, yeah. There’s actually - I’ve mentioned it a couple of times on this show - this group called Masakhane, who is promoting new baselines for machine translations for African languages, and they’re involving local communities in that… And they have a whole paper about how they develop the community, and all of that…
Oh, it’s wonderful.
But really, a central piece of that was Jupyter Notebooks, because they wanted to involve local communities in the work, and of course, like you say, it’s not like you’re just gonna go up to a new group of people and say “Hey, clone this GitHub repo, run this Bash script and all this stuff…” So they were able to utilize notebooks, and specifically hosted notebooks, or like Colab, and all of these things to really get people going, and like “Hey, you just open the notebook, there’s notes in there, there’s the explanation, and you can just go.”
And of course, you want people to advance from there and to really dig into things when there’s weird behavior, or something; maybe you learn some new things… But yeah, I was really impressed with their usage of that, and I think that resonates with that you’re saying.
[00:52:00.28] Well, the web has become unwriteable. Let’s just be very clear - I don’t know the last time you guys set up a website from scratch… I mean, you have a website for the podcast, obviously; maybe you had a web dev for it, I don’t know. But to set up a website from scratch - forget putting widgets in there, forget embedding interactive graphics; just a website from scratch - most data scientists won’t be able to do that.
I mean, just configuring NGINX, or setting up Apache, getting SSL Cert in there, doing all this other [unintelligible 00:52:24.23] that’s impossible. And even for a dev like myself, it’s really just annoying. So what the Jupyter Notebook did - believe it or not, a lot of the value is simply making the web writeable. Making a writeable web technology accessible for people who were not even programmers, so were not familiar or comfortable with the shell. That’s the other thing that’s true about data scientists - a lot of them are not comfortable with the terminal at all. A lot of them are on Windows, and now they can build websites on Windows, with interactive widgets, running massive-scale computation. Holy crap. To complain about the Jupyter Notebook at that point is like complaining about the cup holders in the Starship Enterprise, or something. It’s like, stop it. You’re moving at warp 9, just shut up. [laughter]
So I think that’s the thing that devs like myself, who have a dev background - we look at the space of technology and it’s a relatively flat landscape. “Yeah, I can go learn this language. I can do that thing. I can go grab a cert, spin up that AWS credential, no problem.” For the average person, every single one of these things is a cliff that’s insurmountable. And this is kind of to the point of Anaconda, as well - actually, a lot of advanced users do use Miniconda as their deployment technology; like I said, within Dockers, and whatnot. But there’s also others who are like “Well, I can do my own thing. I don’t really wanna use this package manager thing.” But there’s so many people out there learning how to do this stuff - they’re on Windows, they have no idea what a compiler is. They just wanna do their jobs, [unintelligible 00:53:44.18] So that is where accessibility, again, building on the Python motif of being accessible as a language; you want the tooling around this to be accessible…
And this is now - to answer the second part of your question - where we’ve fallen short. I think that as the technical space of data science and machine learning has grown up, put on a suit, got a real job… We’ve got software developers coming in and saying “Hey, I’m gonna retool as an ML engineer; I can learn the stats”, and they can, they’re smart people; they can do all these things.
Part of what’s being lost nowadays that I see in the modern tooling is kind of the taste is lost. The taste of making this accessible for people who are not ops geeks. I think some of that drive has gone away. So when I go and look at the documentation for any kind of ML framework - and this is not to put a ding on them; you’re doing complex orchestration, there’s gonna be some work involved. But just in general, that sense of like “How do we make this dirt simple for that poor atmospheric scientist trying to model hurricanes, trying to understand climate change? How do we make it simple for these guys trying to build better machine translation for African languages?” and they’re out in an African village trying to get the locals involved. How do we make stuff really easy to use for them? That kind of thing is, I think, being crowded out a little bit. That kind of sensibility and taste is gone, I think. And it’s unfortunate.
There’s also willingness to embrace really big, corporate open source, which - again, I salute the companies for open-sourcing their technologies, but I think the open source ecosystem around SciPy and whatnot never really had some of the corporate open source hegemony kind of thing happen to them that happened in the Linux space. So they’ve never seen weaponized open source… And I’m one of the few people trying to go in there and raise the banner and being like “Hey guys, recognize there’s community grassroots, open source community innovation”, and then there’s big companies saying “Here’s our big ol’ toolkit, a million man-hours, and it’s all your to use it, please… And by the way, it runs best on our cloud framework.” That’s – meh, I don’t know how to feel about that exactly.
It’s more than a license, right?
It’s more than the license, it’s about community innovation and open standards. Yes, absolutely.
[00:55:48.15] So as we finish up, I just wanted to get your sense of what does the future of Anaconda look like? What’s in your mind that you’d like to do, that you haven’t gotten to yet? What should your users be looking forward to at this point?
Yeah, great question, thank you. So we’ve been working on a lot of infrastructure technologies for the last several years, trying to help get the commercial adoption of data science and open data science successful… Which is why we sell a package server, and things like that. We don’t want IT raising exceptions, saying “Thou shalt not use Python, because we’re a Java shop.” I wanted to put that argument to rest. So I think we’ve been
pretty successful with some of the things we’ve done to help smooth those things out.
Going forward though, I want us to lean more into the practitioner community, help the community and thought leaders and practitioners - diverse voices across culture, across background and whatnot - to surface organically in the community and really drive a conversation about the practice of data science and quantifying our world, modeling our world, predicting our world, but doing it in an open, ethical way, and in an intentional way. I don’t want data science to end up where social media has, where it’s like “Oh, we accidentally destroyed democracy. Crap.” You want data science and predictive analytics to be sort of like “Hey, we know exactly – we know we’re going into it eyes wide open.”
So I wanna create community tools, so tools for ethical practice of data science, as well as then [unintelligible 00:57:14.15] some of the next-generation capabilities that are practitioner-facing. So we look forward to a lot more of that, where we’re trying to engage with the user community a lot more; we’ll be revving our product offerings there, as well as some of the things that we’ll be standing up on the website itself, on Anaconda Cloud… So I’m really excited about those kinds of things. It’s beyond tools at this point, it’s about people. It’s always been about people, but now the emphasis is kind of coming back around to being about the practitioners and the community.
Awesome. Well, thank you so much. I think that’s a great way to end. We’ll of course link to a bunch of Anaconda things in the show notes, so make sure and check those out. Connect with us on Slack and LinkedIn and Twitter, and let us know how much you appreciate Conda over the years and what they’re doing. Thank you so much, Peter, for joining us. It’s been a pleasure.
Thanks, Daniel. Thanks, Chris. This has been a lot of fun. Thanks for listening to my rants. [laughs]
It was fun, thanks.
Our transcripts are open source on GitHub. Improvements are welcome. 💚