Joe Doliner (JD) joined the show to talk about productionizing ML/AI with Pachyderm, an open source data science platform built on Kubernetes (k8s). We talked through the origins of Pachyderm, challenges associated with creating infrastructure for machine learning, and data and model versioning/provenance. He also walked us through a process for going from a Jupyter notebook to a production data pipeline.
DigitalOcean – DigitalOcean is simplicity at scale. Whether your business is running one virtual machine or ten thousand, DigitalOcean gets out of your way so your team can build, deploy, and scale faster and more efficiently. New accounts get $100 in credit to use in your first 60 days.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Linode – Our cloud server of choice. Deploy a fast, efficient, native SSD cloud server for only $5/month. Get 4 months free using the code
changelog2018. Start your server - head to linode.com/changelog
Click here to listen along while you enjoy the transcript. 🎧
Welcome to Practical AI. Hey, Chris. How’s it going, man?
Pretty good. How are you doing, Daniel?
Doing really good. I’m really happy today with the conversation that we’re gonna have, because we’re gonna be talking to my old colleague and still great friend, Joe Doliner, or as I call him, J.D. Welcome, Joe!
Hey, Dan. It’s great to be here. Hey, Chris. It’s great to meet you on your show.
Great to meet you, too.
Yeah, thank you so much for joining us.
Thank you for having me.
Why don’t you give us a little bit of background about what you’re currently involved with and how you got there?
Yeah, absolutely. As you said, I’m Joe Doliner; everyone calls me J.D. I am the CEO and founder of Pachyderm, which is a company that builds data science tools that we’ll be talking about today. Before that, I’ve worked at a number of startups. Probably the most relevant one to this conversation is that I also worked at Airbnb as a data infrastructure engineer, basically, just managing their AI and data infrastructure for the company. So I have a lot of experience on the infrastructure side of data science; last I was an actual practitioner, so that’s most of what we’re going to be talking about today.
Awesome. That’s a perfect setup. I think that we’ve done a lot of talking about AI, but we really haven’t got into a ton of infrastructure stuff yet, I don’t think… Have we, Chris?
Not really, and I think this is an episode long overdue. And just to note to the listeners - I know you said that you had previously worked with J.D. at Pachyderm… I have not. I’m familiar with Pachyderm as a newbie, so it will be an interesting conversation for me having a couple of experts on here. I’m gonna ask all the stupid questions, okay?
Well, and you know he’s not my inside man. [laughter] Dan might be, but Chris definitely isn’t.
Yeah, yeah. Full disclosure, I might be a little bit biased, but I don’t officially work for Pachyderm anymore, although I am a huge fan; I’m actually using Pachyderm on my current project, so I’m a huge fan and have that bias, but I’m excited to dive into the details and have you learn a little bit more too, Chris.
Yeah, absolutely. Over the time that we’ve known each other, since we’ve first met and you’ve been talking about it, I’ve adopted it; I have a long way to go to catch up to where you guys are in terms of using this tool… But as a beginner, it’s definitely something I’m interested in, so I can’t wait to hear more from J.D.
[03:55] Definitely, yeah. So with that, J.D., why don’t you give us just kind of a high-level overview of what Pachyderm is and the needs that it’s fulfilling, or what it’s trying to do for data scientists and people working in machine learning and AI?
Yeah, absolutely. Pachyderm is basically designed to be everything that you need to do high-level production data infrastructure in a box. What that means, if you’re used to doing AI workloads in Jupyter Notebooks, on your laptop, or maybe just in Python directly, using something like TensorFlow, something like that, Pachyderm is not in any way saying that you should stop doing that. Pachyderm is just giving you a way to take that code and deploy it on the cloud in a distributed fashion, so that you know it’s going to run every single night. Or hook it up with its processing step, so that you can have everything going in pipeline end-to-end. This is what companies turn to when they need to make that leap from a model that’s on somebody’s laptop to something that’s a core part of their business, that’s going to run every single night.
This all came out of my experiences at Airbnb, where I was basically trying to make a platform that did that for our data scientists. While I was working there, I had a couple novel ideas for what I thought that the world of data infrastructure was missing, and what I wanted to bring to it.
The first really unique thing that we did with Pachyderm is we needed a way to store data. We have a distributed file system that’s called the Pachyderm File System. If you’re familiar with the Hadoop ecosystem, this is probably something pretty similar to HDFS, or Tachyon, or something like that. What’s different about our file system is that it’s capable of version controlling large datasets, in addition to storing them. So you can have your training dataset, it can be terabytes of data, and this data is constantly coming in from your users on a website, from satellite imagery or something like that, and the Pachyderm File System will actually give you discreet commits, like in Git, where you can see “Okay, this is what my training dataset looked like a week ago, this is what it looked like a month ago, and things like that.
What’s really important for AI, that is not only do we keep these different versions, but we actually link them to their outputs using a system that we call Provenance. So at any time when you’ve trained a model in Pachyderm, you can ask the system “What is the provenance for this model?” and it’ll trace you back to all of the different pieces of training data that went into it, and all of the different pieces of code that went into training this model, so that you can basically see where it came from, and then you can reproduce your results. Does that make sense to you guys?
It does. I’m gonna dive in, since I’m the newbie on this, and ask…
I’m asking this on behalf of the listeners and partly for myself - first of all, quick question, is it a proprietary system, or is it open source?
This is all open source. We do have an enterprise system that goes on top of it, and I’ll talk to you later about what features are limited to the enterprise system, but nothing that I’ve talked about up until this point is in that. This is all open source, so you can download it yourself.
Okay, and to kind of wrap our heads around it a little bit, you kind of mentioned the File System, and versioning, and this – what sounds like a feature called Provenance, where you can go back and do that… Can you describe for someone who has never heard of Pachyderm what the feature set is, and what a typical use case might be? So that in their own shop where they’re doing data science they can figure out how it fits in with what they’re already doing.
Yeah, absolutely. I think it’s easiest to focus in on a use case here. One that I can talk about very publicly, because it was a public competition, was the Department of Defense was until recently running a competition where they were basically having people write image detection algorithms for satellite imagery that they had. So they had a bunch of satellite images that they had taken, and they wanted people to write models that would detect “This is a hospital right here, this is a school, this is a bus”, things like that. Interesting AI problem… And also an interesting architecture problem for them, because they have people just basically throwing code at them through this web interface, and they need to take that and run it through their pipeline and get results out the other end and give those to the users.
[08:16] The way that they set that up in Pachyderm is first they spun up an instance of it. They deployed it on AWS, and as the backing store they used S3, so ultimately all of this was stored in object storage, which made it very easy for them to manage. Then they loaded all of the satellite images into the Pachyderm File System; you can get stuff in there in a number of ways. You can get it in there directly from object storage, you can push it over HTTP… I’m not sure exactly which one they used.
From there, they now had a system where all of the data was just sitting there, in different versions. They could update it and have a new version, and then any time that a user’s code came in, they just deployed a new pipeline on Pachyderm, and that would then surf up all of those images and process them in parallel, and out the other end, after some processing, would come just a score report that they could report back to the user. That might include “Your code failed on these five images, so you don’t get a score”, or it might be “Your code succeeded on these five images, and here’s how accurate you were”, and it would get them full reports about “Here’s what you did well on, here’s what you didn’t do well on”, things like that. Does that answer your question, or do you wanna know more about specific features within Pachyderm?
No, that does help a little bit. As a follow-up - you talked about the File System and its ability to version… Are there any other high-level key things that you wanna name, that you really can’t use Pachyderm without considering those features?
In terms of the File System, that really basically covers it. It does all the standard things that you’d expect from a distributed file system, plus the versioning and provenance component, and that’s really the only quirk to it.
Now, on the processing side, things also start to get interesting, and here is where we need to start introducing maybe a few jargony words that I’ll explain. One of the key things that we use in Pachyderm is containers, and I’m sure most listeners at this point have heard of the company Docker, which has been a very successful Silicon Valley company, and they make this thing called a container… Which is basically just a standard way to ship around code.
Think of the problem that you’ve had where you write some script in Python that trains a model, then you send it over to your friend and they’ve got the wrong version of Python, or they’ve got the wrong version of TensorFlow installed, or something like that, and it’s all incompatible. A Docker container is a way to ship code that’s gonna work anywhere, regardless of what the user has got installed on their machine, or regardless of whether they’re in the cluster.
Pachyderm’s processing is all built on Docker containers. What that means is that you as a data scientist, when you wanna productionize your code and take it off of your laptop and into the cluster, then all you need to do is package is up into a Docker container, which means that there’s a little bit of a learning curve there to understand the tooling of Docker. But once you’ve got that, you as a data scientist are now completely in control of the environment that your code runs in, and all of the dependencies and everything like that.
Once people grok this, it’s actually very liberating. The reason that I wanted to build this on top of containers was because when I was at Airbnb, we would have these problems all the time, where a data scientist would come to me and they’d written some new piece of processing that they wanted to be in the company’s pipeline… It could be a machine learning model or it could just be something as simple as data cleaning, or something like that… And they would send me the Python script, and then I would realize, “Oh, this isn’t quite compatible with what I’ve got on the cluster.” And we didn’t have Docker containers there; we just had one big monolithic cluster. So if we didn’t have the right versions of Python installed, I actually would have to either redeploy the entire cluster just to run that one user’s code, which was very untenable, or I would have to have them change their code, to use different versions, and things like that.
[11:56] So it was this constant back and forth, where the data scientist couldn’t quite use the tools they wanted, our infrastructure people couldn’t quite maintain a cluster with a consistent set of tools… So I had this a-ha moment when I realized if these guys could just use Docker containers, then this impedance mismatch would totally go away, and we could both do our jobs a lot more easily. Does that make sense, Chris?
Following up on that, I was just gonna say - it’s kind of like, whether you’re using Python or R or Java, or whatever the different language you’re using is, essentially these containers unify the way that you treat each processing step. Would that be an accurate way to say it?
Absolutely, yeah. It allows us to basically handle the infrastructure the same way, no matter what code it’s written in. We have a lot of companies where one of the things that’s really appealing about Pachyderm is that all of their data scientists just know different languages, and they’re looking for some sane way to have everybody writing code in their own language, and tie it all together into a system that they can understand. Pachyderm allows them to do that.
Now, the key thing about this, of course, is that because we have the provenance tracking, you can still see the fact that “Oh, this data followed through all of these steps and came out the other end, even though one step was Python, one step was Ruby, one step was Java, one step was C++”, and you didn’t have to write any special tooling within those languages to track the data.
Yeah, that’s awesome. I’m gonna pose a problem, and I wanna see if you would go about things the same way as I would, J.D. Let’s say that we have a Jupyter Notebook - and I like how you brought that up before, because that’s where a lot of data scientists start out… So let’s say that Chris and I have been working on this Jupyter Notebook that has some pre-processing for images, and then we train a particular model, let’s say in TensorFlow, and then we output results, and then maybe do some post-processing. And to test it out, we’ve just kind of downloaded a sample dataset of images locally, and then we’ve kind of proven that “Yeah, this is a good way that we think we should do this, in this Jupyter Notebook.”
In order for us to get that scenario off of our laptops and into Pachyderm, what would be the steps that we should do, both on the data and the processing side?
That is a great question, and I think it will be a really illustrative answer. I’m gonna try to answer this with – rather than jumping straight to the “So here’s the end state of this, where you’re using all of the Pachyderm features”, I’m sort of gonna build it up piece by piece, which is how we recommend data scientists to do it.
So the first kind of problem that you need to solve when you wanna put a Jupyter Notebook into Pachyderm is the fact that Jupyter Notebooks are meant to be interactive; they’re meant to have a user opening up the browser and actually clicking the Run button, and stuff like that… So the first thing that you can do is you can actually run Jupyter inside of a Pachyderm service and you can just run Jupyter Notebooks all by themselves, but they can’t just turn into a pipeline that runs without any human intervention, because Jupyter isn’t designed that way.
Like in an automated and triggered sort of way…
Right, right. So the first step to do is just to extract the code from Jupyter. I’m pretty sure Jupyter makes it very easy to export a Python script at this point. So you would do that, and then you would put that in a Python container with whatever dependencies you need, and to start, I wouldn’t even tease apart these different steps - the pre-processing, the model training and the post-processing. You could just do all of those in one container, and you wouldn’t even necessarily need to parallelize the data, because if it was running on your laptop, it could probably run on a beefy EC2 node as well.
[15:45] That process would take you – if you had Pachyderm set up to begin with, you could probably do that in 20 minutes. Then you would have gone from a system that you can run manually on your laptop and edit, and to a system that now runs every single time a new image comes into the repository, or you change the code, or something like that. And also, of course, now it’s deployed on the cloud, so you can easily throw a GPU in there if you want, you can easily throw more memory at it, and stuff like that. So now you have the first step of a productionized pipeline.
Now, the next step is figuring out which of these steps does it make sense to tease apart, so that maybe their outputs can be used by other steps? In the future, you might wanna do the same pre-processing and then train multiple different models, and then do the same post-processing on them, or something like that.
So I would separate out the pre-processing step, the training step and the post-processing step into their own individual pipelines. So now I’ve got a chain of three steps, and each of these is doing something different. Now I get the opportunity to optimize each of these steps individually.
For the most part, the pre-processing steps that I’ve seen can be done completely in parallel. You’re doing things like cleaning up the images, you don’t need to see all of the other images to clean up one image.
Parallel as far as like in the sense of distributed processing, like processing things in isolation…
Exactly, exactly. That’s another of the important things that we get from a container - it’s very easy for us to scale that up. So you can say “I need to process all of these images. Here’s a container that does it, but don’t just spin up one copy of this container; give me a thousand.” So you’re now cranking through a thousand images at the same time, rather than one, so you’ll get done much faster, and you can handle much bigger loads. So I would do that with that step.
The training step - making training happen in parallel is definitely a much more complicated question than making something like pre-processing happen in parallel. Normally, we would still keep that as a non-parallel thing, because your code needs to see all the data to train on it. If that is not true, if you really want to start parallelizing that, that is when you wanna start looking at things like Kubeflow, which we integrate with, as you know, Dan… Although we’re still working on making that integration better.
Then the last step, the post-processing step - that one could sort of stay as is, unless you were anticipating having a lot of things that you wanted to post-process in parallel. For example, when the DOD did their pipelines, theirs is all designed around the fact of “We have one dataset, but we have thousands of different people submitting models that they want to get tested”, so actually the post-processing step could be pretty expensive, because they were just doing it for so many different entries, and so that was happening in parallel, as well.
From an infrastructure perspective, that’s basically the idea of these pipelines. When you segment these steps off into little pipelines, you then get complete control over the infrastructure on a pipeline-by-pipeline basis. So you get the ability to say “This one needs to run in parallel with 1,000 copies of the container up, and each of those containers needs to have a GPU accessible to it, and this much memory, and stuff like that… And this one over here is not doing it really much at all, so it just needs one container and we’ll fit that in somewhere”, and the system sort of automagically figures out how to make all of this work with the resources that it has.
Okay, J.D, that was a great explanation. As a beginner, I have a few questions I’d like to follow up with. First of all, you mentioned Kubeflow, so I take it that Kubernetes is part of the architecture that you’re deploying onto.
Yes. I guess I jumped the gun a little bit on that one, mentioning Kubeflow before Kubernetes, but yes… I think this is when we need to bring in one more jargony word, and this will probably be our last infrastructure jargony word, which is Kubernetes. If you’ve heard of Docker, you’ve probably heard of Kubernetes as well. Actually, at this point, I think if you install Docker, it just has Kubernetes built into it. You should think of Kubernetes as the puppet master for your containers.
A container is a really good way to deploy a single piece of code, like a program. It’s literally just a process inside of a box. To deploy complicated distributed applications, you need to deploy a bunch of programs on different machines, and make sure that they can all talk to each other, and that they have the right resources, and everything like that. That’s the piece that Kubernetes handles.
[20:09] Kubernetes allows you to speak in very high-level terms, that were a lot of the terms I was talking about Pachyderm speaking in, of basically being able to say “I want you to make sure that there’s a copy of this container running somewhere. You have 1,000 machines, you have the code to run… Just make sure that this is always up somewhere and I can talk to it consistently when hit this IP address, or something like that.” Kubernetes will figure all of that out in the background for you, and instead of one copy, it can be 1,000 copies, and they can have specific infrastructure requirements, like GPUs and stuff like that… And Kubernetes just solves all of that and deploys all these containers. That’s how we accomplish that with Pachyderm - we basically just take these Kubernetes semantics and then augment them with knowledge of the data that needs to be processed, and capture how that data gets processed and where it goes.
Gotcha. So just to catch up a little bit and make sure I’m on the right track - you have Kubernetes deployed for infrastructure, and you’re deploying Pachyderm on top of that, and you have the file system that it brings, with the versioning, and your capability for provenance tracking, and you’ve talked about the pipelines, and stuff… Just to ensure that I’m on the right track - I assume that the data is in the containers that you’re deploying, specifically?
Yeah, so that’s where it starts to get interesting. The data is in the containers, but it’s kind of ephemerally in the containers, because the containers themselves are kind of ephemeral. Part of the point of a system like Kubernetes and the reason that you give it 1,000 nodes to operate on is that any of those nodes could die at any time. And this is the sort of thing where this is technically always true, even when you’re just running your code on your laptop; your laptop can die at any time, it’s a physical machine… This isn’t such a concern when you have one computer, but when you’re running on 1,000, it’s almost guaranteed to happen once a day, just because you’ve got so many machines there.
So we put the data into your container for you to process, and then when you finish processing it, we write it back out to object storage. Once it’s in object storage, that’s when it’s actually persisted within our architecture… Because nothing that’s stored on a disk in container – any of that stuff could disappear in any moment, is basically how we operate.
This is also a great opportunity for me to talk to you about what the actual interface that your code gets to the Pachyderm data is. We really wanted to build a system that was going to be language-agnostic. One of the things that really bugged me about the Hadoop ecosystem was that you sort of had to write in Java to really get the most comfortable semantics. You could kind of use Python, but it was always a little bit kludgy. So when your code that you’ve put in the container boots up because Pachyderm wants it to process some data, you will just find your data sitting on the local file system, under a directory called PFS. These are just totally normal files; you can open them with a system call open, and you can read from them, and write to them, and stuff like that…
This, we thought, was just the most natural interface that your code could possibly have, and users often have the experience when they’ve just written a Jupyter Notebook to process some stuff on their laptop, normally they’re just getting that data from a local disk, too… So they have the experience when they’re getting onto Pachyderm, they’re like “Okay, I’m gonna need to learn the Pachyderm API, I’m gonna need to import Pachyderm into my Python code, or something like that…” No, you can just use your normal OS system calls to open data and write data out, and that’s the entire system; that’s all you need to do.
I have a follow-up there, and maybe there have been some updates that I’m not aware of, but I think one of the common struggles that I’ve seen people ask about is - this is definitely fundamentally different than something like Hadoop or Spark, where you have some concept of data locality here. You’re kind of like putting data into the container and then taking it out, but it actually lives somewhere else. Are there concerns with that, are there tradeoffs? What are the tradeoffs that you’re playing with there, especially as you get into larger datasets, and that sort of thing?
[24:16] Yeah, so there are absolutely trade-offs, because each time – that means that the data needs to be downloaded from S3, written to a local disk, which is durably faster than S3, so that doesn’t really incur a penalty, and then it needs to be pushed back into S3. Basically, what you’re trading off here is that this system could be more performant if it was entirely using hard drives, but it would be basically harder for admins to maintain… Because the thing that people like about object storage is that it’s just really dumb and simple; you’ve just got a bucket sitting there with all of the data in it. There’s no “Which hard drive is this on? Do we have all of the hard drives? Are they linked up to the right things?” and stuff like that.
The reason that we chose this architecture as our initial architecture is that this was a lot of the direction that we saw. We saw people basically making the same trade-off in Hadoop, even though they didn’t have to… So by far, the most common Hadoop cluster that we see today - and this applies to Spark, as well - is basically everything stored in object storage, almost always S3, and then MapReduce on top of that. And a lot of people are just by-passing actual HDFS at this point.
We have been making over the last release - and we’re gonna do a lot more of this in the upcoming 1.9 release - a lot of progress toward using hard drives to cache stuff. So we’re sort of going the other way that Hadoop went, where they were first a hard-drive-only solution, and then they started having S3 as a way to checkpoint stuff out to long-term storage, and then eventually that started becoming the only way that people ran stuff. We’re always gonna have object storage as the long-term place that we checkpoint stuff out to, and then we’re gonna use hard drives on top as like a cache; that will also allow us to use boatloads of memory as a cache-to, similar to Tachyon, if people want really, really low latency stuff.
Cool. Yeah, the times that I’ve interacted with Spark, I always defaulted to that S3 option anyway, because it was hard for me to figure out other things. I don’t know if that’s just my own ignorance, or whatever it is… But I definitely hear you on that front. There’s always trade-offs, right? You don’t get anything for free, but it’s really what you wanna optimize for.
Yeah. It’s always trade-offs, and actually, one of the things that we do a lot of is trying to counsel people to not worry as much about performance on the margins in the early days, because we’ve seen a lot of infrastructure deployments and data science projects that just get really bogged down, and think “Well, there’s gonna be this extra cost of data getting copied from S3, and getting back, and stuff like that”, and we always try to tell people “Worry about these things if it’s truly gonna make it impossible for you to accomplish your goals, if this absolutely needs to be a low latency system because you’re doing algorithmic trading, or something like that.” But in a lot of cases, we feel like people get better results by just focusing on getting something that works, and that’s exactly the trade-off that you were making when you were setting up Spark - yeah, if you really bang your head against the wall, you can figure out how to set up S3 on solid state drives on AWS, and it’s gonna be faster than what you’re doing with S3… But if you consider the amount of time that you spend setting that up as performance time until you actually get your results, you might actually get them much slower. So there’s a huge amount of value in just having infrastructure that you understand top to bottom, and that is simple.
[27:43] I wanted to ask about that… We’ve talked about a lot of different technologies in these potential use cases, and I know that, getting back to teams and individual skills - a lot of teams where the skills varied fairly widely; some people like myself came from software engineering into the AI world and machine learning world, and others came straight out of school with Data Science degrees and had not done some of those… Do you ever find that there is any challenge or intimidation where people come out and they may know their data science, but they may not have even heard of Kubernetes or not be familiar with containerization?
I wanted to call that out, because me and you and Daniel are all incredibly familiar with containerization and Kubernetes and such, but not everybody is. How do you speak to that? Do you recommend a data engineer, or an infrastructure engineer get involved? What have you run into in real life?
That’s definitely a challenge for us… We really see the full gamut, and it’s just very interesting. You see some people who build themselves as like “Look, I’m a data science person. I’ve never really done any serious software engineering. I don’t really keep up on this stuff.” Then you just sort of sit them down and explain like “Alright, here’s what Docker is, here’s how you install it”, and they’re like “Oh, this basically seems to make sense. I can get by here.” And then there’s some people for whom we do education sessions, and just try to teach people the basics of containers, so that they can work with it.
I would say that actually when we really have challenges, it’s less about software engineering expertise and probably more about DevOps expertise, to be honest. A lot of the types of issues that we hit are just like the permissioning on the Kubernetes cluster is wrong, and so when you go to deploy your code, everything works until it starts trying to talk to S3, and then the network just doesn’t work, or something, because the bucket is rejecting it, or something like that.
There’s just a lot of DevOps complication in there. We always try to keep our feet on the ground a little bit on this stuff, because our whole goal with Pachyderm was – when I was at Airbnb, it was like “Well, this data infrastructure is really hard, and my team is 25 people just keeping this darn thing running… So what are all the teams that don’t have a team of 25 people to keep their data infrastructure running doing?” So we wanted to make something where you didn’t need that team, where a data scientist could just do it by themselves… And I think we’re closer, but then when we go into companies and talk to them, they’re like “Well, we’ve got one person working on this full-time, and they’re feeling like they have to do a lot of DevOps to keep the Pachyderm cluster up and running”, I sort of realize, like, okay, we’ve made an improvement here; we haven’t just magically eliminated this. We haven’t gone from “You need 25 DevOps people to keep big infrastructure running” to “You need zero DevOps people to do it.”
We’re trying to make that better in every release, we’re trying to make that as easy as possible, and one of the big steps for it on that will be having our own hosted solution, so people don’t have to deploy everything on their cloud just to try it out. The short answer is that’s definitely a challenge; there’s a bit of an infrastructure leap that needs to be made, which can be uncomfortable for a lot of people that I think could ultimately benefit from the feature set of Pachyderm, it’s just they can’t quite get the activation energy.
I was wondering, is there anything else – and another question that you commonly find is people have existing infrastructure in place; they might be a Hadoop shop, a Spark shop, or one of several other technologies, they might have big databases like Cassandra… What are you trying to replace and how are you trying to fit in? I know we talked about the data locality issue, but are there any other big considerations that you would say is why you should go Pachyderm versus what they already have in-house?
I would say the things we’re trying to replace are sort of HDFS, and then the computation layers on top of that; MapReduce is a common one, but Hive, and Spark, and stuff like that, we’re also trying to speak to. Those are the main things that we’re trying to replace. We constantly have the challenge with people who have existing data infrastructure and want us to fit into that well. That’s always a bit of a back-and-forth, because some things can work really well in Pachyderm, because you have the flexibility of a container, so you can put whatever you want in there… [32:01] People will have containers that include code, so that they can go and talk to Hbase somewhere else in the cluster. So then you have a natural shim to put between your existing infrastructure and Pachyderm, which is the container code, which is totally flexible. It doesn’t work beautifully for everything. What you wind up doing with Spark is you wind up having like “Here’s your data, it’s stored in Pachyderm. Now you boot up a job and you wanna talk to Spark, so now I need to push all this data into Spark, or somewhere where it can access it, or something like that…”
We’re constantly trying to figure out how to make these integrations better, but the users that always excite us the most are the people who basically come in and say “We don’t want to go down the Hadoop route. We know that there is a lot of pain required to get a working Hadoop cluster and to get stuff functional on it, so we wanna try something different and just build on Pachyderm from scratch.”
Long-term for our company, we’re focused on “How can we make things really good for people who just see the Pachyderm vision and commit to it from scratch?” Because if we’re successful in ten years, then those are gonna be the people that have really made the company successful. The integrations will help us along the way to onboard more people, but it’s really gonna depend on that core use case.
The team that I’m working on now - the organization is pretty big, but this project that I’m working on, it’s myself, who has some type of data science background, and then another guy who is somewhat technical, but he’s a linguist, and so our ability to spin up a working Hadoop infrastructure is probably less than zero percent probability… I mean, if there’s one thing I could say to listeners - even if you just get to where you can use containers themselves, it’s a huge benefit also to reproducibility in the space of machine learning and AI, which is awesome.
I wanted to follow up – you’ve already mentioned, J.D, that Pachyderm, at least what we’ve talked about up to this point, is free… But you’re a company, and I should give you some congratulations, because you’ve just hit a big accomplishment, isn’t that right?
Yeah, and thank you for the congratulations. We’ve just raised a series A, which means that we have a ton more funding to basically pursue our vision for data science infrastructure. It also means that you can commit to Pachyderm as your infrastructure with a lot more peace of mind now, because you know the company is gonna be around for quite a ways to come.
That also sort of leads – as you said, we are a company, which means we need a way to make money, and for that we have an enterprise product. Let me tell you what’s in that that you won’t find in the enterprise. We try to really make it so that our open source product contains everything that’s going to be really useful to individuals and people who just wanna get some data science done, but they’re running within a gigantic organization where they have all of those concerns.
The types of things that go into that enterprise product are the permissioning system - that’s the ability to say “This data right here is owned by Dan, this data right here is owned by J.D, this data right here is owned by Steve”, things like that, and make sure that nobody is getting data that they don’t have access to. What’s cool, and what we think is a very crucial feature for this type of system is that it’s informed by our provenance model.
This is a big problem that you’ll run into in big data organizations - it’s very easy to have some data that nobody’s allowed to see, that then gets turned into a model or some sort of an aggregation or something like that that everyone’s allowed to see, that is accidentally leaking the data that went into it. So we have our provenance tracking system inform the permissioning system, so if you don’t have access to the provenance of data, then by default you don’t have access to the data itself, because it might contain that information that you’re not allowed to see.
[36:01] Other things that go into the enterprise product are a wizard UI builder for building new pipelines, and visualizing how they’re working, and the ability to track and really optimize your pipelines, see where they’re spending all of their time and squeeze every last little bit of performance out of your hardware.
The other main thing that we sell is basically just support, and our time… The ability to talk to us and have us prioritize features, and stuff like that, which is – you know, every open source project does that.
Yeah, it’s really interesting. I always love to hear different people’s perspectives on their open source models, as well. I was just talking to someone the other day, a friend who is starting a new business, and considering how they should approach open source but yet also be a company and survive, so I think there’s definitely people out there who are interested in that question, so I appreciate you sharing that.
Yeah, absolutely. It’s tricky and it’s very imperfect, because I really think that this is a system that really should exist. There’s a lot of need for a system like this. It basically has to be open source for it to actually fill that need. In my mind I just couldn’t see a properietary system becoming the standard data infrastructure layer… But it’s very hard to get the funding to work when you’re open source. It’s this huge asset, because people can so easily try your product and you get so much adoption, and stuff like that, but it really anchors people in an unwillingness to pay for software when it’s open source… So you always need to cross that threshold.
One of the things that we’re looking to do in the future, now that we’ve raised more money, is basically build the hosted version of our software, because that totally changes the value proposition, but it also has some sort of psychological effects on people, wherein like nobody would ever pay for Git, but the idea that you’re gonna pay seven bucks a month to have private repos on GitHub or something like that is just totally palatable to people.
I think that’s a fantastic idea. I love the hosted idea. I know that when Daniel first introduced me to Pachyderm a while back and I was initially learning the fact that coming from the software engineering world, that it was built on containerization and Kubernetes, was a huge plus for me. If I recall correctly, a lot of it is in Go, which I thought was pretty amazing… As is Docker and Kubernetes.
I guess if you’re just hearing about it and you’ve come away from this episode today and you wanna learn more about it and maybe wanna dive in, get your hands dirty and figure out if it’s right for your organization - how do people get started with that?
We’ve got a bunch of tutorials and quick start guides online. If you wanna just sit down with a guide and start hacking away, then that’s the way to do it. We also have a very active users Slack channel, where all of our engineers and everyone on the team is just always hanging out and ready to ask questions… And you know, those questions range from “I hit this error. What do I do?” and we just give you a simple response, if it’s simple. Hopefully, it’s simple. And to people also asking us “I’m looking at Pachyderm for a new project. Talk to me about the feature set, talk to me about how you think this could be helpful here”, and just talking to us. I think that’s really the best way if you want someone to talk to about stuff - just stop by the Slack channel.
Awesome. Thank you so much for taking time to talk with us, J.D. Of course, we’ll put the links to the tutorials and the docs and the Slack channel and all of that in our show notes, so go check those out. It’s been awesome to hear from you, and really excited to hear about the progress with Pachyderm and all the good things you’re doing.
Yeah, thanks so much for having me, man. I love appearing on podcasts.
Alright. Well, I look forward to seeing great things from Pachyderm. Thanks again.
Thanks for coming on the show.
Bye-bye. Thanks, guys.
Our transcripts are open source on GitHub. Improvements are welcome. 💚