Testing ML systems with Tania Allard, developer advocate at Microsoft (Practical AI #74)

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Alright, welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a principal AI strategist at Lockheed Martin. How’s it going, Chris?

Chris Benson

Hey! It’s going great, Daniel. How are you?

Daniel Whitenack

Doing pretty good. The winter sickness is still going through our household, so still dealing with that, but otherwise doing pretty good. I think I’ve avoided most of it at this point, or at least got over the worst of it, so… That’s good. How are things on your end?

Chris Benson

They’re going well. I wanted to note that thing from last week about us… Because I posted on some social media, and people were really surprised, so I thought I’d share it with the listeners.

If you were a long-time listener of the podcast - we’ve been doing this for about a year-and-a-half, and Daniel and I actually have never met in person until last week. We were at Project Voice in Chattanooga…

Daniel Whitenack

The wonders of the internet…

Chris Benson

No kidding. I had commented online how incredibly cool it is that you can develop such a great friendship and collaboration, and yet have never met each other in person. Anyway, that’s past now; we’ve now met in person. It was like meeting family, I think, for us…

Daniel Whitenack

Definitely.

Chris Benson

Yeah. So anyway, just a very cool thing; I just thought I’d relate that… So yeah, doing great.

Daniel Whitenack

Yeah, awesome. This is one of our first episodes for the new year. A couple before this, but – we definitely want to start out this year promoting practical uses of AI, and practicality in developing AI and machine learning systems, and I think we’ve got a great guest today to emphasize a lot of those things… So we’re joined by Tania Allard, who’s a developer advocate with Microsoft, a Google Machine Learning GDE, and a Python Software Foundation fellow. Thanks for joining us, Tania.

Tania Allard

Hi! It’s a pleasure joining you guys.

Daniel Whitenack

Awesome. Well, maybe before we jump into some of the things you’ve been talking about and working on recently, if you could just give us a little bit of your background and how you got into machine learning, and ended up at Microsoft… That’d be awesome.

Tania Allard

Yeah. Well, I started doing machine learning during my Ph.D. I was using machine learning applied to materials science. It was basically trying to try to identify some materials that could be candidates for tissue replacement. It was a lot of automization, a lot of material assessing here and there… And over the course of my Ph.D. I realized that I really enjoy the computation side of things much more than the experimental and research bit.

[04:06] After that I transitioned into research software engineering, which was basically research engineering in research institutions in the U.K. I then migrated into working for Hello Soda, which is a company that has machine learning as a service. I was doing data engineering there, and machine learning engineering, research engineering… So it was pretty much everything, as you normally do in a small company. Then that brought me into Microsoft, because I was already doing a lot of community work, community engagement, doing GDE work with the GDE community in Google… All open source work that I do, and this just seemed like the perfect fit now for me. That’s how I got into Microsoft.

Daniel Whitenack

That’s awesome. You mentioned a few things, like you held different positions like data engineering, and machine learning engineering, and data science, and computational research… Sometimes these sort of titles get a little bit fuzzy. I’m wondering, from your perspective, I guess 1) how you would define what you did differently as a data engineer, versus maybe some of the more sciency things… And then also if that more focused engineering experience influenced how you do machine learning.

Tania Allard

Yeah, right. I can say I’ve been working across the machine learning pipeline in all the different roles… And as you mentioned, a lot of these roles are very [unintelligible 00:05:29.20] or have a lot of things in common. When people talk about data scientist, and data engineering roles in machine learning research, or machine learning engineering rather, they try to use these Venn diagrams… And I’ve found that it is not very descriptive.

For example, if you’re working on the data science side of the pipeline, you’re focusing much more on the statistics, on developing novel algorithms or models that would help your business or your company to get [unintelligible 00:06:03.07] good insights. But then you will probably have/need some software engineering skills as well, to take that into a production format with the rest of your dev environment or your dev team… Whereas when you’re working on the data engineering side of things, you’re focusing much more on all the processes that are [unintelligible 00:06:23.24] but sometimes you still have to know how that data is gonna be integrated in your models, so that the data is actually usable by the rest of the team.

And then the machine learning engineer role is basically the one that binds it all together. It makes sure that everything is robust, is testable, it can be taken into production, that folks that are using the data that is being transformed by the data engineer is actually pulled into a reproducible manner, that you can always track where the data is coming to and from at all times, so that everything is tightly integrated within your data science infrastructure.

Chris Benson

I was just looking - and I know Daniel had seen it already - at your talks that you gave called “What is your ML score?” that you went through at All Things Open and AnacondaCon. In that talk you were focused very much on kind of QA and testing of machine learning systems specifically… And I wanted to start off by just asking if, as we’ve talked about the roles there, if you could also talk a little bit about what is a machine learning system. It’s a little bit of an ambiguous term… Could you define what that means, and what types of things are included in a machine learning system or not?

Tania Allard

[07:46] Yeah. Well, it depends on what data problems your company is working on, or the kind of machine learning projects that you’re normally working on. Normally, the machine learning system is gonna be comprised of wherever you’re getting your data, which can be a canonical database… Then how the data is going to be pulled into your machine learning model, or your prediction or classification model, whatever it is you’re trying to do with that data… And then how you take that model into something usable by your customer or your teammates. This can mean “Are they going to access it through an API? Is this going to be a standalone web app? Is it going to be on a mobile device? How are you gonna be accessing this model?”

All of these apparently moveable parts that confirm your data warehouse or your database, the infrastructure where you’re running your prediction, where you’re training your model, how you’re collecting your database - all of that forms your machine learning system. It’s a lot of data bits, a lot of infrastructure, and it could be things in the cloud, for example, as well, using the public cloud infrastructure.

Daniel Whitenack

And I know when you’re talking about testing or validating the ML system, I was just thinking, if I was to ask a software engineer and ask them about testing, probably one of the things that comes through their mind is unit testing or integration testing… Whereas if I ask a data scientist or maybe a machine learning engineer what they think about testing, probably the first thing that comes to their mind is testing of a specific machine learning model, or the performance of that model in terms of accuracy, or whatever the metric might be.

Now we have this other category of machine learning system, which it sounds like is more broad than a machine learning model… How is the testing or validation of a machine learning or AI system different from the testing of a specific machine learning model?

Tania Allard

Right. I think the machine learning system comprises all of your pipeline. It has to be a more holistic, a bit more integral, and cover all of the parts of that. If we go back to traditional software engineering, you’re testing that your piece of software is returning the results that you’re expecting it to… Because you already know what those results or those behaviors are, it’s relatively straightforward to design your test cases, create your unit tests, your regression and your small test cases.

Daniel Whitenack

And they should always return the same thing, basically.

Tania Allard

Exactly. It should be deterministic, in that sense. When you’re testing a machine learning system or a machine learning model, in many cases you don’t need to know what that end result is, because you have your data, you have your labels, if you’re doing for example a classification problem… But you need to make sure that your system is doing what it’s meant to be doing, and that it’s repeatable, that you can repeat all of that.

In that sense, you have to ensure that you’re testing your data, your features, ensure that the data conforms to that distribution that you’re expecting and the behavior that you’re gonna see, and also the cost that adding more features to your predictive models is adding. That’s a major component, especially when you’re doing things in the cloud.

Sometimes when you have one feature, you marginally increase that accuracy that you were talking about, but then your compute time or the use of resources that you’re using doubles or triples. So you’ll also have to take that into consideration to balance whether that very marginal increase in your accuracy is actually worth all that extra compute cost that you’re incurring on.

Also, when you’re doing your model development, you have to look at things like your metrics, whether the impact of your hyperparameters is also causing impact on your compute resources, testing for implicit bias, testing for your stateleness of your model… Because you might then need to retrain your model after a certain period of time if you’re acquiring new data, or during significant changes to the API.

[12:14] And then, again, you need to test your infrastructure, you need to make sure that you’re able to deploy your models, your infrastructure probably using techniques like continuous integration and continuous delivery… Because that’s essential, especially if you release a new version of your model. Although you’ve just tested it before getting it into production, if it turns out that there is something that needs a rollback, being able to know how low that rollback or that release is gonna take you is crucial. In that sense, you have to have integration tests against your entire pipeline, from data acquisition, to data transformation, prediction, and then result serving, whatever that is for your system.

Chris Benson

One of the things I wanted to ask… I know that Daniel and I come from a software development background before we were much into deep learning, so it’s kind of the idea of testing, and why you test, and the importance of testing is kind of second-nature. But for somebody coming into deep learning and trying to do these things, it may not be. So I’d like to ask you, quite simply, why is testing or validation of machine learning systems so important, and what would be the downside of not doing that?

Tania Allard

I think one of the main advantages of you being able to test your machine learning model is explainability. As we’re going into more complex frameworks or more complex deep learning algorithms, it starts becoming increasingly difficult to explain how to reach to a certain prediction, and why… Especially when we’re releasing machine learning out into the world and it’s affecting other people, I think it’s crucial for us to know that it’s actually predicting what we want it to predict, that it’s being transparent and clear, and that we can always trace all the predictions that we’re doing.

Also, testing for implicit bias is crucial, especially when we have datasets that are biased toward a certain feature or towards a certain portion of our population; having testing places throughout all of our pipeline ensures that we can mitigate those biases early on as well… So I think those are some of the most important reasons for us to be testing our algorithms.

Daniel Whitenack

Yeah. And in certain cases – I know in Europe right now there’s certain regulations that have come down the pipeline there, and of course, you’re influencing the rest of the world… So I guess it may be partly your own – you’re trying to create a development environment that is responsible, and you’re able to repeat things, and actually make incremental progress on things… But also, you might be under certain regulations that you actually have to be able to give someone an explanation, to some degree, of what you did with their data… Is that right?

Tania Allard

That’s correct. I think it was now over a year that all the things GDPR/regulations took place… So one of the most significant things that this brings is that if a customer or someone from whom you’re withholding any sort of data comes and tells you “Hey, I want to see what you’re doing with my data” or “I want to have access to all the data of myself that you’re storing”, you should be able to comply within a standard period of time.

So if you don’t have mechanisms in place for you to trace this data, or trace what you’re doing, or even to delete - because now your customers should be able to ask you to delete all the data that you have in place for them - you should comply. That’s now a regulation.

I think also having a better understanding, as I said before, of what your data sources are, and what you’re doing with the data, and how you’re moving it from one place to another is very important for reproducibility and assurance of our systems.

Daniel Whitenack

[16:10] I know people that might be coming into AI and machine learning maybe from a science background, or even – there’s a lot of different backgrounds, even things like economics, or finance, or that sort of thing… Some of these things around infrastructure, CI/CD, monitoring might be sort of intimidating to them. I was wondering if you had any thoughts as far as who a data scientist needs to work with to make sure that all the right testing is in place for a machine learning system… Because it does impact – like you said, there’s implications for infrastructure, for scaling up, for changes to data… Who does the data scientist need to be talking with to make sure that all the right testing and quality assurance pieces are in place?

Tania Allard

I think a good workflow would be to have the data science team working very closely with the machine learning engineering team, if there is one, or otherwise the software engineering team.

If you have them sitting together or working very closely, it’s easier for both teams (or the three teams) to better understand what the requirements are, how people are bringing things from research and development environments into production. Because something that I’ve noticed in some companies or in some teams is that you have the R&D or the machine learners sitting in a corner, doing all of their things, developing their models, and then they have to throw things over the wall and hope that the software engineer will take that into production. But in most cases, the software engineer doesn’t have an idea on how the model works, the canonical database or the canonical data sources, or the transformations that need to take place for that data to be usable. And that’s when you sometimes see that folks spend weeks or months working on a model, but then they spend another couple of months or weeks sitting on that model, just waiting for it to be taken into production.

Having from day one a collaborative approach where folks define what resources they’re gonna need, how this algorithm is gonna reach out their customers, and what sort of data it’s going to be accessing to is crucial for both teams to be able to take this form R&D into production in a seamless way.

It doesn’t mean that you as a data scientist need to do everything, or need to be super-good at CI/CD, and testing, and know Kubernetes, and all of these complex things… But if you have those teams together, it’s easier for both to understand the world of the other one.

Chris Benson

Yeah. One of the things I’m really getting from you here is that when you’re actually working on getting a model into your overall production lifecycle, and you’re integrating with the existing software development and deployment lifecycles that you are moving into, it’s really part of a larger effort, which kind of fits into what a lot of organizations are already doing - they kind of add this in.

One of the things I’d like to know is given that kind of larger team that we’ve been talking about, how should different roles within that team think about their responsibility for testing? In other words, if you are an infrastructure engineer, what should you be testing? If you’re a data scientist, what should you be testing? What should the machine learning engineer be testing? How should testing be divided out among those different roles?

Tania Allard

I think, as you mentioned, finding these roles and assigning responsibilities is crucial. You as a data scientist would mainly be in charge of assessing your data and your features. This goes back to - when you have your dataset, make sure that the distribution of each feature matches your expectations. This is the very basics I need to check, but sometimes, because we do it so often, we don’t think about documenting it, or going in-depth into that.

[20:07] Also, making sure that the relationship between your features and your targets and the correlations make sense. That sometimes needs to go beyond creating just a correlation heatmap.

As I mentioned before, testing the cost of your features also is something that is very important, and aligns very well with these GDPR, also ensuring that your system maintains privacy across the entire pipeline. Sometimes we are very concerned about the privacy of our raw data, because that’s our most valuable asset, but you need to ensure that your system or every transformation or every data manipulation that you’re doing complies with that privacy as well.

Also, make sure that you are aware of how much time it’s taking you develop a new feature or a new production model, because that’s also gonna help a lot with [unintelligible 00:20:59.04] which also will prevent you from going into half-baked features, or having very tight data jungles as well.

If you are, for example, the machine learning engineer, you’re gonna be testing for model development practices, and monitoring those models. So then making sure that everything is checked into our repository, that there is version control, making sure that there is a peer-review process… It not only has to be the senior data scientist, but everyone in the team has to be responsible for that, making sure (again) that you’re checking your impact metrics, checking the impact of your tunable hyperparameters… And also check against simpler models.

Sometimes, because we are so into deep learning, and deep learning is the most popular framework or the most popular approach at the moment, we want to use very sophisticated models… Sometimes it’s also good to go back to the basics and compare against a much more simple, or a simpler model, just to have a baseline and ensure that we are actually going down the right path, and that it makes sense, the additional cost that we’re going into [unintelligible 00:22:19.21] And again, test for your implicit bias.

And finally, the folk that is in charge of our infrastructure - sometimes it’s a DevOps person - check for the reproducibility of training, making sure that the model specification is up-to-date and correct, and we have properly [unintelligible 00:22:32.07] all of our data, all of our hyperparameters, all of our models, the training and everything, and tested for the integration of the full machine learning pipeline. Making sure that, again, everything is reproducible, whether your infrastructure – sometimes you’ll have infrastructure that is your development, your staging and your production, and making sure that across those three environments you can get reproducible results.

Sometimes you’ll have changes in infrastructure that will imply changes in your predictions, but gives you indication that your infrastructure is not reproducible… So you’ll have to make sure that that’s not the case as well.

And again, as I said before, test that you can do releases and rollbacks in a reproducible, reliable and ruthless manner. Because if it’s only one person, for example, in charge of deploying things into production, what’s gonna happen when that person is on holidays, for example? You’re not gonna be calling them at four o’clock in the morning for them to revert back to a previous version… So make sure that there is a robust way to do that without having a bottleneck. That is also very important, and ensuring that this rollback can be done safely, in a controlled manner, so that anyone can do that… like yeah you would have an automated pipeline that would take care of that. Testing all of these little bits will be assigned within the specific role and responsibilities, and it makes everyone’s lives easier as well.

Daniel Whitenack

[24:12] Yeah, and I’m so glad you went into the details of those different areas; it’s really helpful. And there’s definitely a lot there to work on. There’s a lot probably for every team, whether a small organization or a large organization, that they can always be improving on. I know a lot of the things that you talked about I know I have work to do on my end… But one of the things that I liked about the talk that you gave on this was that you developed a sort of very practical – I guess I would call it a rubric for kind of scoring yourself in these different areas, which is what you meant by giving yourself an ML score… And then kind of helping you focus your effort on where you’re lacking and where you can improve on.

I was wondering if you could go into a little bit of the details of that scoring system and that rubric, to help me develop my own ML score, and make sure I’m putting in effort where I can make the most difference.

Tania Allard

Sure. Throughout this conversation I’ve placed a specific emphasis on data science testing, machine learning engineering, and infrastructure testing. Within these three areas you have the different steps that I’ve already mentioned, the things that you should be testing at a minimum.

For example, if you have no testing in place for any of these steps, that would give you zero points, because you have no testing. If you’re testing, let’s say, from the data science perspective, and you’re manually testing the distributions or checking the distributions of your data, and then ensuring that the training of your model is reproducible, but all of this is done manually - you can assign yourself probably one point.

If you or your company has reached a more mature level, where all of these tests are done in an automated fashion, probably through your favorite CI provider - it can be GitHub Actions, Travis, Azure pipelines, or GitLab CI - then that would give you two points.

And as you go creating your tests, you can add those points, and then compare against the three stages, whether it’s data science, infrastructure and machine learning engineering… And more [unintelligible 00:26:17.01] you’re gonna be very good in one of these areas, and not so good (you’ll have lower scores) in one or two areas. There, where you have the lowest score, is the one that you should be paying more immediate attention. Start trying to level up that score.

Daniel Whitenack

Yeah, that’s super-helpful. I was just trying to think through, while you were talking, what my score would be… I don’t know, Chris, if you were doing the same…

Chris Benson

Me too… [laughs]

Daniel Whitenack

For me I guess there is the manual and the automated tests, and then there’s the three sections - data science, machine learning development and infrastructure. I think probably - at least in my organization, on the stuff I work directly in, I’ve got a fairly good amount of manual tests going, but definitely not everything’s automated. And probably we have been more focused on the data science and machine learning development side than on the infrastructure side, just because our organization being a non-profit, it isn’t already operating a ton of infrastructure, and have some of those things in place… Although they do, for a variety of projects.

So I’m guessing I’ve got like a one going on (maybe a two) in a couple of the first sections, and then maybe where things are needed is more on the infrastructure side. What about you, Chris?

Chris Benson

I think it’s more or less the same as you were saying. I think the things that for me personally - and I’m part of a larger team - the closer we are to what I grew up in in the software development side, I think we’re actually (and this is where I’m a little bit different from you) probably more automated in actually attending the tests on the infrastructure side, and as you move over toward the data science, it’s probably pretty decent manual tests, probably not a lot of automated tests…

[28:07] So I think it varies a lot for us whether it is a team effort, and I think we’d probably get a little bit higher score, because different people are attending to different parts of it, versus when me or somebody else is doing something alone, and I’d say our scores probably fall off… So I think it’s definitely probably a testament to throwing people from different perspectives on the team probably yields us a higher score, since we’re combining that… But boy, I’ll tell you, after talking through this and then hearing Tania talk about her scoring, I’m starting to realize all the places that I need a little bit of work.

Daniel Whitenack

I think that’s cool though, because if you were to just present all this, like what Tania presented, in general, like all the pieces of the machine learning system, it could be overwhelming and a bit crippling… But if you’re able to kind of zero in on where you need to put the most effort, I think it’s really helpful in terms of starting somewhere and at least getting some more testing off the ground.

I’m curious, Tania, as you’ve gone around and presented this to various groups, what’s been the feedback that you’ve got in terms of where people – is there a consistent place that you think data science teams or machine learning teams are maybe not putting a lot of effort, where they need to? Has there been any sort of trends in that sense?

Tania Allard

I think something that has come up a lot is infrastructure, in the sense that they have a DevOps person that normally takes care of the infrastructure, but everything is very flat - they always have the same tests, they always run the same processes, without adapting to the specific cases or the specific situations for machine learning.

Sometimes we would have a bit of different behaviors, or we would need something a bit different if you are serving, for example, a model that is gonna be accessed by a lot of people, over a web app for example, or a more simple e-commerce app, that folks are gonna be accessing to buy products or to search. So I think the understanding in the testing of machine learning infrastructure is very often overseen by a lot of teams.

Break

[30:18]

Chris Benson

So in a slight change of topic here - Tania, I think we’d be remiss if we ended our discussion about testing and machine learning systems and integrity without mentioning notebooks. I understand you gave a talk that was called “Jupyter Notebooks: Friends or Foes?” recently, and I was wondering what was your conclusion in that talk, especially – you know, given the emphasis in this episode on integrity and reproducibility… Could you share some of your thoughts there?

Tania Allard

Yeah. I’ve given that talk a couple of times, and it’s been very well received… Because Jupyter Notebooks are a tool very commonly used by data scientists. And I’m gonna say I love Jupyter Notebooks, but I always try to use them within reason. Even with the teams that I work, I try to have standards on how we work with them, have processes (again) in place.

There are a lot of very good things in the Notebooks, but there are also a lot of hidden things and caveats. The more aware you are of this, the better use you can make of this tool. And then again, it comes to having processes and workflows in place, for example.

Something that is relatively easy to do is if you have someone to help you, or you spent some time working with notebooks, and then you checked them into version control, having for example a GitHub; that will make sure that all the outputs are cleared out, that you are conforming to certain standards, that your paths are not referencing to local paths before those are going to be checked into version control… And then, again, testing out your Notebooks, making sure that your environment is reproducible - that makes a very dramatic change in how folks are using it…

[35:59] Because I know some software engineering folks - they absolutely hate it; they absolutely hate Notebooks, because it also allows for a lot of bad practices in the more traditional sense of software engineering… But I think, again, if you [unintelligible 00:36:12.23] style guides that are enforcing workflows that will allow for this quality assurance - this goes a long way.

Also, being smart about what you’re using the Notebooks for, and when it’s good or more advisable to move from Notebooks into a more traditional development practice, as having your scripts and your tests and importing your modules… And being able to discern between these two use cases - or these two different approaches rather - is very valuable.

Daniel Whitenack

I’m curious – first, I have a couple of follow-up questions, because there’s so much here… And of course, Jupyter Notebooks are everywhere, so it really does influence a lot of people’s workflow… I guess the first thing – so you mentioned a couple of the checks that you might do when you’re checking your notebook into version control, but you also mentioned maybe some caveats where Notebooks kind of can break down… I was wondering if you would go into that a little bit more.

I know, for example, one area that I’ve seen is where – and this is something that Joel Grus mentions; we had him on another episode, and he gave a talk as well, more controversially titled, I guess, “I don’t like notebooks.” But he was saying there’s a lot of hidden state in notebooks, where you might run things out of order, and it’s hard for another person to jump in and actually recreate that state that you might have had some misunderstanding of how you got to a certain place in the notebook… I was wondering if you had a similar experience.

Tania Allard

Yeah, right, I totally agree with him in that sense. There is a lot of hidden state… Jupyter Notebooks gives you so much flexibility when it comes to executing [unintelligible 00:38:01.15] and getting the output straight away, and then going back and forth… But I think especially for people that are getting started or have never had proper software engineering practices, this is not obvious. Sometimes they will work in notebooks, jump from one cell to another, get a result, whatever that result is that seems satisfactory for them, and they’re like “Okay, I’m done.” And just check [unintelligible 00:38:35.00] their notebook into state that it is.

But then if you got a step further and you have this practice of “Okay, I’ve finished this notebook, so I’m done. I’m gonna clear everything, restart my kernel, and run all of the cells to make sure again that my results are reproducible”, then you’re adding this extra quality check.

There are tools for example like nbval, that I love and I’ve worked with a lot, that allows you to do this regression test… And they’re very useful, because you already have the state of your notebook saved, so it runs again in background your cells and checks whether what you’re getting is the same with what you got before. This is very useful for regression tests, for validation of somebody else’s work.

But then again, something that is very obscure is the dependencies that you’re using. Unless you are actively sharing your environment through [unintelligible 00:39:31.18] or Docker or something of the such, it’s very obscure. A friend of mine has done a lot of study around how slight changes in packages version can actually change the result that you’re getting in a workflow or in a study…

I think all of the hidden state and all of the weird practices, and then especially when folks only learn to use for example Python, or Jupyter Notebooks, it becomes very problematic… Because then they’re like “Okay, so I normally import a library like (let’s say) Pandas. If I develop something in a Jupyter Notebook, how do I import this notebook into my Notebooks?” So people start misusing the notebooks, if that makes sense…

Chris Benson

[40:24] It does. In my own thought process – I have this bias where I’m coming from the software development background, as I had noted before, and so I do think Jupyter Notebooks are wonderful for experimenting and trying and doing your experimenting with feature selection, stuff like that… But I also know (speaking for myself) at the first point where my model starts to stabilize a bit and I’m doing less experimentation and variance from a minute-to-minute kind of thing, I’m always looking for that first moment where I say “Okay, it’s time to get it out of the notebook at this point and really start wrapping it with software development”, best practices, as we’ve discussed.

I’m kind of curious, what is that point in your own workflow, speaking for yourself, and also, what would you definitely (as part of that) not want to see happening in a Jupyter Notebook? At what point would you be saying “I’ve stayed in a Jupyter Notebook too long” in your personal workflow?

Tania Allard

Yeah, I think I very much agree with what you said. Jupyter Notebooks are excellent for prototyping, doing very fast things, and get the initial part of the R&D process off the ground. They’re amazing. Another use case that I’ve found that is really good is parameterizing Jupyter Notebooks using tools like Papermill.

But then once my model starts [unintelligible 00:41:45.04] and I need to start making more consistent predictions, or validations, or proper training, I try to go into a more traditional software engineer practice. There is this tool by Fast.ai called nbdev, where they try to bring all of this literate programming into Jupyter Notebooks, so you can have your code and your tests, and then develop your library from there. I think it’s good to start bringing the software engineering practices into the workflow of people…

But then again, once I start finding myself that I am reusing and calling a lot of functions or methods that I’ve declared into a Jupyter Notebook, and I have to reuse it, probably for the same workflow or other workflows - that’s also an indication that “Okay, this isn’t working. This has to become a standalone module, or a standalone package that I can use, and share, and reuse, and maintain separately”, rather than having bits and pieces in multiple codebases or multiple Jupyter Notebooks, and having to keep that updated, if needed.

Daniel Whitenack

[43:02] Awesome. Yeah, I would definitely encourage everyone, all the listeners, to go and listen to your talks, both the ML scoring talk, but also this Jupyter talk. This is really useful and a very practical thing, so I would definitely encourage people to go there.

Also, thank you for mentioning Fast.ai. We actually mention them a good bit on the podcast…

Chris Benson

A lot… [laughs]

Daniel Whitenack

…because they contribute so much. It’s a testament to what they’re doing, that they’re mentioned so much.

I was wondering, to close us out here - so we’ve talked a lot about maybe some things that certain listeners especially who maybe are newer to the software engineering side of things, might be a little bit intimidated by, whether that’s Python project structure, tests and automation and all of this stuff… I was wondering, because you’re also in the role as a Python Software Foundation fellow, you do have any good recommendations for maybe people that are wanting to level up their ability to code good Python, that has a lot of integrity, and integrate tests, and Python deployments, and all of those things? Do you have any good suggestions for people in terms of learning resources and ways for them to level up in that sense?

Tania Allard

Yeah, well there are a ton of resources out there. That is the problem, that it’s so easy to fall into this rabbit hole…

Daniel Whitenack

There’s so much.

Tania Allard

There is – for example, if someone wants to get into more DevOps kinds of things, Emily Friedman, who is one of my colleagues on my team, she just released a book called DevOps for Dummies. It’s not focused on machine learning. It’s a very good resource to get you into that dev-opsy mindset and understanding how to integrate continuous integration and delivery into your projects. And then you can interpolate some of those things into your own data science stuff.

Something else that I recommend is – I’ve talked a lot about collaborations across teams and team members… Sometimes just sitting down with the software engineer, and having conversations or pure programming sessions where you both sit and start writing tests, or just discussing about continuous integration and continuous testing, continuous deployment, testing of your programs and your models - it goes a long way. Because again, you’re learning from the other one, and getting things off the ground.

Daniel Whitenack

Awesome. Yeah, those are great suggestions. I know for me personally - I was a little bit nervous to sit down and pair program or be next to some software engineers when I was first getting started out of grad school… But that’s probably one of the ways that I learned the most, the fastest. So I can definitely recommend that. And hopefully you’ve got some good engineers on your team that are receptive to that.

Well, Tania, it’s been super-instructive and really great to talk with you today. We’ll have links to the various talks and the other things that we discussed in our show notes. I really appreciate you taking time out of your busy schedule to talk through some of these things with us. I hope we can meet at a conference or somewhere in the near future. Thank you so much.

Tania Allard

Thank you for having me. It’s been a pleasure.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Practical AI – Episode #74

Testing ML systems

with Tania Allard, developer advocate at Microsoft

Featuring

Featuring

Notes & Links

Books

Transcript