Being that this is “practical” AI, we decided that it would be good to take time to discuss various aspects of AI infrastructure. In this full-connected episode, we discuss our personal/local infrastructure along with trends in AI, including infra for training, serving, and data management.
DataEngPodcast – A podcast about data engineering and modern data infrastructure.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Our locally installed stuff:
Where we see AI workflows running:
Experimentation / model development:
- Google Colaboratory
- AWS SageMaker
- Data Science platforms:
Pipelining and automation:
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another Fully Connected episode of Practical AI, where we keep you fully-connected with everything that’s happening in the AI community. We’re gonna take some time to discuss some things related to the recent topics in AI news, and we’ll dig into a few learning resources that are related to those, to help you level up your machine learning game.
I’m Daniel Whitenack, data scientist with SIL International, and I’m joined by my co-host, Chris Benson, who is a chief AI strategist with Lockheed Martin RMS APA Innovations. How are you doing, Chris?
Doing great. How’s it going, Daniel?
It’s going really well. I’m sitting in a newly remodeled home office, so I’m pretty happy. We got some final painting done, and set up my monitor, and new desk, and everything… So I’m feeling pretty good. What about you?
I’m relieved to be home… I’ve been traveling for the last couple of weeks, and hit Washington DC, New York, and I was just in Silicon Valley as we record this, for NVIDIA GPU Technology Conference. I’m back, I recorded a couple of things there… Last week we had a guest from there, and there’s gonna be some more down the road, so… I’m really looking forward to today.
Yeah, me too. I think it’s ideal that I just went through all of my personal setup here in home office this week, because you had suggested that we talk about a certain topic that I know is really on a lot of people’s minds as they get into this field, and as they kind of try to figure out what to focus on as they’re learning things, and how to build a team… Do you wanna intro what we’re gonna be talking about today?
Sure. Today we’re gonna be talking about a fairly broad topic that we’re labeling “AI infrastructure”, which encompasses a whole lot of stuff. The reason that I had suggested it was I have so many conversations with people who are trying to kind of get their own AI operation set up, both at a personal level, just like you and me as data scientists working on stuff, but also at an organizational level, trying to figure out how their company needs to get everything stood up that they need there to do what they’re doing… So we’re gonna talk about a lot of the ideas. It’s a huge topic, so there’s only so much we’ll be able to cover, but hopefully we can kind of dive into some of that stuff today and have fun with it.
[00:04:09.07] Yeah, for sure. I know when I do trainings and other things I always get a lot of questions about “Oh, how should I do my personal setup? What do I need to buy to actually be an AI practitioner?” …on that side of things, so the personal infrastructure side. But then also, there’s so many choices of things out there as far as how you set up your workflow, and all of that.
Just a disclaimer as we go into this conversation - we’ll probably primarily be focusing on a lot of the things that we have personally interacted with, but we would love to hear some of the infrastructure, or the frameworks, or the setup that our listeners have, or maybe that we’re missing, or definitely if we misrepresent anything… Definitely do that. Join our Slack community; you can do that at changelog.com/community, or on our LinkedIn page, and let us know what we’re missing or what your personal setup looks like.
We go to conferences, and like I said, as we do trainings and other things like that - I think we’ve seen a lot of what people are doing out there, so hopefully we can convey some of that today and give kind of a landscape of infrastructure.
Absolutely. There’s so many different ways to put together infrastructure, there’s so much choice… This field has just absolutely exploded in the last couple of years. When we were first talking about machine learning back when we first got to know each other, there just weren’t the plethora of options that we have at this point, so we’ll try to sort through some of that today.
Yeah. Let’s kind of jump in maybe with a general question and think about, like, as AI practitioners, how much time do we spend doing development on our local setup, our local machine, our laptop, versus in a cloud or hosted environment, or on specialized on-prem hardware. What’s your experience with that, Chris?
I know different people who do different ways, but I really focus on using cloud or hosted environments most of the time. I have friends and colleagues that have bought their own home equipment, in terms of the different types of GPUs that are available, and they can plug in graphics cards and that kind of thing… But from where I’m coming from, I don’t tend to be the guy who’s always out, buying the latest new thing constantly. This field is moving so fast that I have kind of chosen to opt out of buying my own equipment, since I would constantly be replacing it, because the new shiny thing would be out there.
If it’s a trivial toy little thing, like for demos or for teaching people, then I might do something on my MacBook, just using the CPU… But it has to be truly a tiny thing for that to be the case. Almost any other time I’m either going to a hosted environment or a cloud environment.
I think how that splits up for me, or at least has in the past couple of years is I do a lot of the initial work to test my code and ensure that it actually runs… You know, deal with a lot of those issues, maybe deal with some data formatting or data pre-processing, or kind of looking at example data, making some example API calls, figuring out how to deal with that data in a Jupyter notebook… All those sorts of initial things - a lot of those I still do locally.
Actually, I do as well. I was only thinking in terms of training when I said that, so I should have been more clear.
Yeah. And then, of course, like you said, at a certain point you’re limited locally, but also you need to scale things up. We’ve said a lot of times here that AI doesn’t really do you any good if it just stays on your laptop. It has to get out there and be practical. I like to make that jump – I think maybe a good way to put it is to make that jump to a production-like environment, whatever that’s gonna be… Whether that’s gonna be somewhere in the cloud, or on-premise hardware, whatever… Make that jump as soon as you reasonably can, without wasting much time. That’s kind of my viewpoint on that.
[00:08:11.14] Yeah, I would agree with that. As you pointed out, in my brain, as I answered that last question, I was kind of jumping straight into training on a GPU or TPU, and in that case I move it off my Mac pretty quick… But for the vast majority of the data prep, which is most of the work, getting everything ready for training, pulling data in, and massaging it, and doing all the things you have to do so that it is ready for that, most of that I do on my Mac, unless we’re talking about (in some cases) the datasets are simply too big and they’re offloaded to a server, not necessarily a GPU or a specialty (something like a DGX), but to some other server, just to crank away, while I do other stuff.
Yeah, and if you’re gonna be running a training for a model for five hours or 12 hours or whatever it ends up being, it’s just simply not practical to do that locally… But like you said, that’s actually probably proportionality-wise the smaller amount of things that an AI practitioner would do. The majority of things are figuring out what data to use, and figuring out what format it’s in, and engineering some features, or trying out some certain things, making sure your code runs, before you spend up for GPU time in the cloud, or something like that.
Agree. It’s funny - we like to think of ourselves as AI practitioners, and yet the training piece of that, even the training itself may last quite a while as you’re doing that, in the scheme of a project. It’s a very small amount of time that you spend.
Yup. So with that in mind, I guess, one of the questions that I get a lot when I’m going around doing trainings and other things is “Hey, do I need to invest in some sort of GPU workstation for my home office, or a really expensive laptop with a GPU, or something that can be there in my office?” Obviously, I think neither one of us have that situation, so maybe that’s the answer, but I think if there’s people out there that are wondering, that’s not really necessary at this point. I think you can use a little bit cheaper hardware to do your local development, as long as you’re able to connect to the APIs and the UIs, and open up a terminal and connect to the instances that you’re running elsewhere.
Yeah, I have a rule of thumb on that… The way I do it - and we’ve talked a little bit about this in previous episodes… Things that we do at a personal level, that we’re interested in personally, separate from work, and then I have the work things. And because I work for Lockheed, and I’ve previously worked for other large companies since I’ve been in the AI space, I have resources there, where they are dedicated equipment.
In general, people say “Hey, should I be buying a DGX workstation, or should I buy some graphics cards?” I really say there’s a crossover point, where it depends how much you’re training. If you have enough going in your operation to where you are really needing training cycles kind of around the clock - and that’s more than just personal project, obviously; that’s at work, or a team of people…
Yeah, that’s gonna be in a corporate setting.
Right. Then it can make sense to buy your own equipment, because there’s a big investment that you’re making, but then you’re utilizing that equipment constantly… So it makes sense. But for most of us who are not doing that kind of round-the-clock operation, I think going with cloud providers is probably the way to go, because you can just use what you need, pay for that bit, and then move on, without continuing to.
But if you were training around the clock, then there is a cross-over point where cloud providers can become more expensive than actually making that investment yourself.
[00:11:51.26] Yeah… Especially if you’re looking for a job in AI, or if you’re getting into AI, or even once you have a job in AI, most of the time any of that specialized hardware would be purchased by your company, to enable things for your team.
You’re never gonna have to invest personally… At least you don’t have to; you could if you really want to work on some crazy personal projects, but you don’t have to do that. So just to be transparent, as a data scientist, my personal infrastructure basically looks like - and by personal infrastructure I mean my local setup - a MacBook, without the goofy touch bar thing, because that’s weird… I have to have an Escape key, I’m sorry.
[laughs] There you go.
And external monitor, a nice keyboard… That’s essentially for me, that mechanical keyboard makes typing a joy. But then I don’t really have a ton of stuff even installed locally. I have the native Python (or a Brew-installed Python); I don’t use Anaconda, or any of these loaded package managers. Those can be really nice for a lot of people, I just don’t find it as nice for me personally. There is some advantage to that. I use a really simple – I just a Vim IDE, and I have Jupyter and Docker installed locally. I use things like Postman for testing API calls, and Go occasionally, and Slack and Zoom and all the web conference stuff, because I work remotely. So that’s what I prefer. Again, all of those are personal preferences. I know a lot of people that find a lot of value in these environment managers like Anaconda, or maybe other ways of doing things… That’s just not what I do.
What does your personal setup look like, Chris?
Sadly, it’s much like yours, so I won’t go through everything. I’m on a MacBook. Like you, I also like Ubuntu separately, if I’m on a server. Standard MacBook setup, with an external monitor, keyboard, trackpad, nothing fancy for me. I also Brew-install Python as well. I’ve had trouble with Anaconda when I tried to change use cases around; it somehow would start throwing errors, so I found that to be the simplest. For deep learning I’m always starting in a Jupyter notebook, and hoping that it’s successful enough to migrate later down the road out of that Jupyter notebook into code, as a library.
I use Docker a lot… I’m so glad Docker came along before the AI explosion happened, because utilizing it with containers has made the world of AI training and deployment so much easier. And like you, anyone who’s listened to us knows we both love Go. I use Go as my default go-to language; I use Python for the data science things that tend to be Python-specific… And I don’t have a GPU. Actually, I have like a TX2 that I play around with, and I’m about to get a Nano from NVIDIA, but those are mainly for my toy projects. For any training, I am going to typically use either whatever my company has to offer, and we have stuff within Lockheed Martin that I can use for work; if it’s on my own, I’m going to AWS SageMaker, and Google Colab… So that gives you a sense. We can talk both about some of those in more detail as we go forward here.
Yeah. I think the moral of the story - you don’t need a fancy computer, even a MacBook; we have those, but if you just have a cheaper notebook, that’s fine as well. A lot of the things that you’ll probably be doing are hosted and there’s no need for that specialized hardware.
There probably is a lower limit to that. One time my wife borrowed my MacBook for a week on a Chromebook, and there was a lot of pain there… Although I think that’s getting better, too. People like Kelsey Hightower and others develop on a Chromebook.
He loves it, I know from his Twitter feed. But I remember when you did that, because - as an aside, we tried to record an episode at that point; I know you were having some technical struggles with that. I’ve also struggled, but I remember very distinctly you trying to work through that.
[00:15:56.28] Yeah. And again, these are personal preferences, so we’d love to hear what your guys’ setup is and what value you find from things. But moving on from our personal local setup and talking about, okay, from our local setup that’s maybe pretty light - at least a lot of people’s local setup is kind of light - what are the things that we’re connecting to that are hosted, or in the cloud… What enables our AI workflows, the things that we use that don’t run on our setup?
To switch to that topic, I think probably there’s something we need to talk about first, which is there’s an infinite number of ways that we could enable AI workflows in the cloud or on-prem, and a lot of that is gonna be driven by the organization that you’re working for and what their concerns are.
The first of those may be being governance issues. I know probably with you working for Lockheed, Chris, there’s a lot of things related to that, but I think anywhere now there’s gonna be a lot of governance-related issues.
Yeah. You know, if it’s just you - as we’re talking about ourselves individually - you just kind of plug in whatever tools you like, and you create your workflow out of that which is available, and there’s different ways of doing that… And it’s pretty simple, because it’s really based on your personal preference. But as soon as you get to even a small team level, and certainly as you get to multiple teams across an organization, it gets pretty complicated pretty quick, and you have to start thinking about all the issues that go into making something that will work, not only technically for you as a data scientist doing the work, but also accommodate the various laws, regulations, things like GDPR in Europe, that have to be accounted for, and how you do different types of workflows, and testing… Some of those topics just off the cuff, and we can dive into a few of them. Things like data discovery - how are you gonna know what’s available to you, beyond just things that your own team might be producing? What are some of the trust and certification issues out there, provenance and lineage…?
I know that when you were in your previous position at Pachyderm, I know provenance is one of the features that you had there; maybe talk a little bit toward that. But data management, who owns the data, what’s the sovereignty of the data, how do you access the data in the different tools in the workflow, and what kinds of data science processes do you build around that? That’s a lot to think about. What about you, Daniel? If you wanna dive into a couple of things that are of interest to you there…
I think maybe a big one to emphasize is just – all of the problems in this area in my mind, and all of the major blockers that I’ve hit as a practitioner are mostly having to do with the data side of things, not whether I had the compute power to train a specific model, or something.
For example, if you’re using someone else’s data, there’s privacy issues, depending on if it’s personally identifying data… Of course, if you’re working in healthcare there’s issues about even where you’re allowed to store certain data, and whether that is on-prem, out of facility, in the cloud… There’s a lot of issues there. So I think the main issues are that you’re being – some of those data sources that you might like to use might force you into using certain infrastructure; might force you into staying on-premise, in your own company’s infrastructure, or it might allow you to be in the cloud… And even if you are in the cloud, you might still have to maintain certain audit trails, and that sort of thing for the data that you’re using, especially if you’re using data that’s been generated by EU citizens, and all of that sort of thing with GDPR. That’s really in my mind the major factor.
[00:19:55.10] Yeah, I would agree with that. And that was a great point you made, in that just the laws and the way laws will affect your own organization strategy in terms of where you’re housing data in – and I’ve actually seen over the last couple of years, as people were prepping for GDPR, and then it came into being, I’ve had conversations with people where they chose where to keep data from a nationality standpoint. They might literally relocate their operations in terms of the training and the data storage into a completely different country to accommodate those laws, and also to figure out with different countries having different laws how are they going to approach that from a strategy standpoint, where should those operations be.
So it can increase your cost, on having enough equipment to go to different places, and to have to think through that… So there’s multiple rabbit holes here, and I know we’re just kind of skimming over the top of some of these issues on this episode, and we can certainly do episodes - and actually have done in the past - where we do some deep-dives… I’ll kind of leave it there, and let us proceed on down the infrastructure route.
Yeah, I think the main takeaway is before you decide on a specific “Oh, I’m gonna run in GCP” or “I’m gonna run in AWS” or “I’m gonna run on-prem”, you need to have a process in place that will help you decide whether the data concerns related to where I am allowed to store data, how much data I can share, even within your own organization.
So far we’ve talked about our personal setup, what an AI practitioner might need locally, and then what organizational concerns might go into choices around whether your AI workflow runs in the cloud, or where you store your data, and that sort of thing… But let’s go ahead and just into how AI practitioners are running their AI workflows in the cloud or on-prem; what are the sorts of frameworks and infrastructure and tools that they’re using to actually enable those AI workflows. Again, this is from our personal experience and what we’ve seen other people doing, and what we’ve done ourselves, but maybe we can start with what sorts of resources do we need in terms of compute and storage? What sorts of resources do you need to run your AI workflows, Chris?
Well, going back to a little while ago, I’m very focused on Docker, just because it makes it a lot easier. Having said that, I do keep TensorFlow installed locally, but since I’m not running on a local GPU, I’m not sure that I necessarily need to do that. I think for any real workflow I do, I have a Docker container, and I know especially at work we have specific production containers that we use for our workflows there, so we’ll pull down one of those. There are a lot of options on that. I know NVIDIA, for instance, has a whole bunch of production workload containers that you can use as a base for your company - and this is assuming that you are running on NVIDIA equipment in that case - that are already optimized for that.
[00:23:59.25] So for me it’s really easy to grab one of those production containers and then do the customization I need, add my model into it, figure out how I’m gonna get the data into that for training… I’ve done that a couple of different ways over time, depending on what the resources available to me are… But when I’m serious about doing work and I’m not just playing around, I’m starting Docker from the get-go.
I love using Docker as well. How I think about the layers that I need to deploy to enable my AI workflow, a lot of times what I’m running will be in Docker, but then under the hood or a layer down I like to think of two primary types of resources that I need… Those being compute and storage. If you just have compute, you might be able to run your Docker container, but then where are you gonna put your data? It’s not so great to put 200 terabytes of data in a Docker container; I don’t know that anyone’s actually done that… Although some people like to put data in containers, but… So I like to think sort of under the hood, or a layer down, we need two sets of resources - those being compute and storage.
Now, with compute, of course you have some choices as far as whether that’s going to be in the cloud or on-prem, similar to basically any engineering workflow that any company does. And then storage-wise, mostly what I’ve interacted with is pretty agnostic to my AI workflow. A lot of times you don’t have the choice of where your data is stored. You might be working with a production MySQL database, or Postgres, or you might be working with data that’s just dumped to an object store, like S3, or something like that.
Typically, the storage options in my cases are often times driven by things already existing within a company, so you might not have a lot of say in that… But then compute-wise, maybe you do have a little bit more say in that.
When you’re running those Docker containers and that sort of thing, you mentioned, Chris, that sometimes you run on NVIDIA hardware. When you’re saying that, I think what you’re meaning is on-premise NVIDIA workstations…
How is that different from running on a GPU in the cloud? Could you go into that a little bit?
Sure. It really comes down to the constraints that you have, as you said. I don’t have a particular preference, but say you’re in AWS and you’re using SageMaker, and you’re pulling your data out of S3 - it is what it is, in terms of that’s the service they’re offering; it’s a great way of doing it. But to contrast against that, we have a lot of DGX equipment at Lockheed, and my previous employer had DGX’s too. That means you’re getting into a data center where you have the DGX set up, and then you have a set of equipment with storage and such around that, to enable your operation.
So it’s great if you have a DGX too that you’re operating on, but you’re gonna need the storage around that to pull from and to push out to, there may be some processing around there… So you end up essentially creating a whole build around your DGX to enable those operations.
It’s not so different really from the cloud environment. Either way, you have storage, you’re pulling data from it, you’re running it through, assuming that it’s been pre-processed and it’s ready for training, and then you’ve got to output somewhere. Then you have to have access to all of that from wherever you’re coming in.
The AWS or Google or Azure world each has their own ways of doing all those pieces. If you’re running your own data center, then it really depends on the company… Whereas I’ve worked for two companies with DGX equipment that I was able to use; that piece of it was the same, but how they built around the DGX’s was different in both companies. So just because you have the luxury of buying an AI super-computer like that doesn’t mean that your setup is gonna be the same. It’s very distinct on how your organization wants to configure it.
[00:27:59.13] Yeah, I think your experiences are almost on a total different side of the spectrum from mine probably, only because right now I’m working with a non-profit, so… I think this is a good contrast. Obviously, the companies that you’ve worked for do have embedded AI research teams; maybe they invest in some of this NVIDIA hardware. But for me, doing machine learning and AI data-related work with a non-profit, I would basically be left out of every room I was in if I tried to get anyone to buy a 200k NVIDIA box.
I pretty much relied on everything in the cloud when I’ve needed it, and I think that wherever you fall on that spectrum for your AI team, there is a route forward… So it’s great if you’re able to afford that kind of dedicated hardware, and you have that commitment level within your organization, but maybe you’re not at that point yet, or you’re a startup trying to get into this space, or another organization that doesn’t have a huge AI team - you can do very similar things in the cloud.
Every cloud provider has instances that are available with specialized hardware, they also have a lot of services that will allow you to spin up clusters to do distributed computing, like Kubernetes clusters, and there’s frameworks on top of that that we can talk about in a bit… So I think wherever you fall on that spectrum, there is a route forward. But in either case you’re going to have some number of compute nodes on-premise or in the cloud that maybe some have just regular CPU’s, maybe some have GPU’s, and then you’re gonna have some storage that is storing the data that you’re working with for your training datasets, and that sort of thing.
Now let’s say that’s the base - so you’ve figured out whether you’re gonna be in the cloud or on-premise, whether you have dedicated hardware or you’re using the cloud stuff - what do AI people actually run on top of this compute and storage infrastructure?
Maybe let’s first think about “What do people run on this for model development and experimentation?” What’s your experience there, Chris? You mentioned notebook environments like Jupyter - do you run those off of your laptop? Do you have experience hosting those within your infrastructure for model development, or how does that work for you?
Often I will start locally… It kind of depends – at this point, with me being more Docker-focused, I’ve found that it’s easier to go ahead and… In the beginning, I used to open up a Jupyter notebook locally, but then I had to package it up and go put it into the Docker, and I’ve gotten to where I just start off with the Docker container these days, because there’s a little bit more to do in that (slightly) than just opening up a notebook… But that way I don’t have to do the “package it all up” later; it’s easier, because once I get into my workflow, I can start just building, and then when it makes time, I can run the container on the infrastructure. So for me personally, that’s an easier way to go.
There’s also the how do you set up the resources, what do you want to select. I’ve used Domino, DataLabs, and that is enabling – it gives you a very nice front-end when you have different types of equipment out there. It’s something obviously you don’t need – if you’re in a cloud environment, then you have those interfaces that you’re gonna be used to from that provider, whichever one you wanna choose, to do that… But Domino kind of gives me that on the front-end if I have our own infrastructure back-end. At that point, it’s just scheduling it, moving it over there. And again, there’s a lot of variability on how you wanna do that.
I think you’re right Chris; I think that variability also falls onto a spectrum. We talked about the spectrum of the hardware that you might use, that all this is running on, in terms of specialized hardware, versus things available in the cloud. I think there’s a spectrum here, too.
[00:31:57.07] On one side of the spectrum there’s a lot of open source, free tooling that will allow you to do interactive model development, and run it on pretty much any hardware - in the cloud, on Kubernetes, in Docker, on-prem… Things like JupyterLab - it’s like Jupyter, but multi-user Jupyter, so you can have multiple Jupyter kernels, and all of this stuff, and run a lot of different notebooks… But there’s also other free options. There’s Google’s Colaboratory (Colab), which has a bunch of free GPU resources and other things in notebooks that you can manage… There’s things like Binder that will spin up Jupyter notebooks from a GitHub repo… So that’s one side of the spectrum where you’re using a lot of these free environments.
On the other side of the spectrum there’s data science platforms, like you were talking about, Chris, which are things like Domino, and DataRobot, and DataBricks, H2O… Some of these are not free; in fact, some of them are not very cheap… Some of them are a little bit more moderately priced, depending on how many users you have and what workloads you’re running. But a lot of these give you, like you were saying, a really nice interface, maybe to track your data, to track different experiments that you’re doing…
My experience is that a lot of them are centered around the idea of experiments and running experiments, and iterating on those experiments… They’re not necessarily meant for running production AI services, but very much for model development and experimentation.
By the way, as an aside, while we’re talking about some of these different providers, I’ve noted over the past year in particular that there’s a real battle between the cloud providers to draw in entry-level students in this AI world, in terms of doing your initial training… Because any course you may select is gonna have a cloud provider that you get used to.
For instance, deeplearning.ai, which uses Coursera - right now they have the Coursera for the classroom stuff, but they’re also using Google Colab in the current set of courses, and that way you kind of get used to that environment, so you get a little bit of buy-in.
If you’re taking an NVIDIA deep learning institute course, they’re using their own NVIDIA GPU cloud. We tend not to talk about that; we tend to talk about Microsoft, Google and AWS in general, but NVIDIA has theirs, and there are other providers out there as well. So as we’re talking about what you’re buying into, that very well may be impacted by how you were trained and what your comfort level is.
Yeah, for sure. I know that I’m very comfortable with certain things that other people just don’t like, and laugh at me for using. I think that you have to experiment as well, and find where you’re comfortable.
Okay, so we have the base of compute and storage; on top of that compute we are running certain things for experimentation and model development, somewhere on that spectrum of open source/notebooky things like JupyterLab, or maybe less open source or not open source things like Domino for data science platforms… So the next thing that we might wanna run on top of the compute and storage is some way to automate model training, and the pre-processing and post-processing of data. Automatically, when new data is brought in, you might wanna update a training dataset, retrain your model, export a serialized version of that model, and then export that into some serving framework. This is typically called “pipelining and automation.” There’s a whole lot of tools for this.
I’m a Pachyderm user and I worked for them full-time for a while, so I’m definitely biased in that way, and I love Pachyderm… But that’s certainly not the only thing you can use. There’s things like Luigi and Airflow that are commonly used for this.
I don’t see quite as much Hadoop and Spark stuff going on these days, but… What is your impression of the landscape and where things are headed with this side of things?
[00:36:03.13] Well, in the enterprise you’re still seeing a lot of Hadoop, especially Spark more so, in the enterprise environments. And even though I’m in big companies now, I really come from smaller companies, and so it has been interesting… I almost by-passed those particular technologies along the way, and then I’ve kind of come back. In a large organization you do have to accommodate those in those data flows.
Given the choice – and I’m probably heavily influenced by you in terms of liking Pachyderm for that… Obviously, I think you mentioned KubeFlow is another tool that is used, and that one is good… Kubernetes-everything, from my perspective, because I don’t have to think about that too much anymore, and just say “Let’s go that route.” I know Pachyderm is built on top of Kubernetes as well… How about you?
I think you’re right… Obviously, people have invested a lot in Hadoop over time, so you have to deal with that in certain cases. This is another one of those concerns that we were talking about that might drive your infrastructure choice. If you have to write on top of HDFS and have a ton of stuff written in Hive queries and all of that, you might be stuck with using that, for whatever reasons. Thankfully, I kind of have some flexibility in my projects, so I like to do things a little bit differently… But there’s a lot of choices out there, but I think in general, circling back to how I have this architecture laid out in my head, you’ve got the compute and storage, you’ve got those experimentation pieces on top of that, maybe like JupyterLab or something; then you have some type of automated, non-interactive tool that will allow you to automate the retraining of your models, or updating of datasets, or updating of databases that drive certain services, or that sort of thing… And if it involves large datasets, it might involve distributed processing; if it involves model training, it might involve specialized hardware… But having a pipelining tool, something like Airflow or Pachyderm, KubeFlow, Spark - these sorts of things will allow you to update those large datasets over time. But that’s definitely not the end of the story.
Let’s say that we’re updating our model over time; how are we going to then serve that model on top of – so maybe we’ve trained it, now we want to use that serialized, trained model to run many inferences… How are people doing that?
It’s kind of funny - leading into that, that’s the part that most people don’t think their way through all the way; how do you get to deployment, how do you make this thing that you’ve created actually work in real life, with the rest of your software and hardware out there? Some of the things that need to be thought about there are what technologies are you gonna use. I use TensorRT. And you have to be thinking about how you’re serving, and there are different approaches to that… As well as something that’s often forgotten in this space - CI/CD (continuous integration and continuous deployment). We’re so used to (in the software world) thinking about that, but the data science world and the thinking that dominates the AI space often forgets that altogether.
[00:39:18.13] Or in some cases you’ll find data scientists who aren’t really familiar with it at all; they’re so used to doing things that are gonna stay on the server, or website, or something that’s internal to an organization, that now that we’re getting to where you have AI models being pushed out for inference on the edge all over the place, and eventually may far outnumber things even in the servers as we move forward over the years, that’s gonna be really critical. Thinking your way through things like TensorFlow serving, TensorRT, and different things like – MXNet has an approach to doing that as well. What about yourself?
I think the more pieces of that that you can automate, the better, in my experience… Assuming that you’re gonna be wanting to update those things fairly regularly.
Yeah, I think to kind of package the last couple of minutes up, the takeaway there is if you’re a data scientist and you may be very comfortably starting off in your Jupyter notebook world, where you’re creating a model that has to go and be trained, but at some point it has to come back to software. At some point it has to be packaged up into a software package via a deployable code in the programming language that you or your organization is using for that. I know you and I have a bias toward Go; obviously C, C++, Java are all really popular in that space. That model though has to become essentially a microservice for inferencing, to where it’s a library, it’s in a larger architecture, and whether it be a server that is supporting your business operations, or whether it be in the billions of IoT devices that we’re gonna have out there, or whether it be on your phone, or our future mobile devices that we haven’t gotten to yet - that’s where it’s gonna live and that’s where it’s gonna be doing the thing that you’re creating it for. So thinking of that early in the process is really crucial; this thing you’re working on only matters if it’s usable in the real world out there as a piece of software.
I think that’s a great way to close things out. Obviously, there are a ton of things that we did not have time to talk about, so please join our Slack channel at Changelog.com/community. We’d love to chat with you about infrastructure-related things, and practical things about your setup. Before we leave, I just wanted to share a couple of relevant learning resources. We always like to give some learning resources for people that are wanting to level up in the areas that we’re talking about.
We’ve mentioned Google’s Colab a couple of times… I would highly recommend, if you’re just wanting to experiment with TensorFlow or PyTorch on GPUs and on TPUs, or these sorts of specialized types of hardware - you can do that for absolutely free on Google Colab. There’s a ton of examples, there’s a great intro video to that.
Also, there’s Intel’s AI DevCloud. I’ve played around on there a good bit; that’s a great place to experiment with hardware other than GPUs, like Intel’s Xeon processors. They have optimized really fast versions of TensorFlow and PyTorch that you can use there, and without a lot of commitment. So just jump in, try some examples. A lot of these frameworks and tools that we’ve mentioned have great examples that you can try out with relatively low cost or no cost, in some cloud environment. So try some things out, get your hands dirty, and let us know what you build.
Sounds good. Thanks for hopping on this episode, Daniel. This was a really good conversation, and I think actually we’ll probably have some spin-off episodes of deep diving in some of the topics that we’ve hit today that we just didn’t have time for.
For sure, yeah. See you next week, Chris.
Take care, bye-bye.
Our transcripts are open source on GitHub. Improvements are welcome. 💚