We’re partnering with the upcoming R Conference, because the R Conference is well… amazing! Tons of great AI content, and they were nice enough to connect us to Daniel Chen for this episode. He discusses data science in Computational Biology and his perspective on data science project organization.
DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog.
Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with no ads, extended episodes, outtakes, bonus content, a deep discount in our merch store (soon), and more to come. Let’s do this!
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
- Twitter: @rstatsdc
- Discount code PRACTICALAI20 is good for 20% off every ticket type, including the conference & all workshops
Links relevant to the show:
- William Stafford Noble 2009 - A Quick Guide to Organizing Computational Biology Projects
- Greg Wilson, et al. 2014: “Best Practices for Scientific Computing”
- Greg Wilson, et al. 2017: “Good enough practices in scientific computing”
- Jenny Bryan’s code smells: link 1 and link2
- Jenny Bryan on naming things
- JD Long’s talk at rstudio::conf this year about being empathetic
- python’s version of pyprojroot
- Pandas for everyone book
- “Be kind: all else is details”. – Greg Wilson, Teaching Teach Together – The Rules
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a principal emerging technology strategist at Lockheed Martin. How are you doing, Chris?
I’m doing very well today, Daniel. How’s it going?
It’s going great. Over the weekend on Saturday morning my time I gave a workshop at OPSC Europe (Open Data Science Conference Europe). That was a good time. Virtual conferences are kind of fun, because I get to connect – there were people joining from all over the world, kind of, so that was cool, to get people joining into the workshop from all over, and get to discuss some fun things with them.
We did some transfer learning, and reinforcement learning, and GANs with TensorFlow, which was fun. Transfer learning is very much the bedrock of a lot of industry work…
It sure is.
Reinforcement learning and GANs - for me it’s a really fun topic to play around with and have some fun with. It definitely has some practical application, but it’s just kind of fun to get into… So that was a good time. It was a fun weekend in that sense. What about yourself?
Well, over the weekend just enjoying a cooler weather. It’s been nice. We’re about to get into the fall here, and that was pretty pleasant, last week. I know that we may talk a little bit about public orientation, government-type stuff potentially today, in terms of conferences and stuff… And I was doing some work with the air force - it’s amazing to see government following industry into advanced technologies, things that support AI, doing software development better… That was the last week for me, and it’s just a pleasure.
It’s actually chilly outside right now.
Yeah, it’s awesome.
It’s amazing when you get through a Georgia summer, which is all very humid, and hot… And you get to this first inkling of spring, and everything is like “Aahh…”
Yeah, yeah. Well, as you mentioned in talking about government and that vertical, which I know you’re involved a lot in, I recently got connected to the R Conference, which I’ve spoken at before the New York City version, which was a lot of fun… I think that was 2-3 years ago I was there. But now they do – of course, they’re doing virtual conferences this year. I attended the one online recently; it was super-high-quality and a lot of fun. They had a great system for it. But they’re doing several of these that are related to different verticals. There’s gonna be one coming up that is related to government and related sectors… And what used to be the sort of DC Conference, but is now of course virtual. I got connected to that, so we’re actually as a podcast gonna do a media partnership with that conference, and that’s gonna be really cool.
As part of that, they helped arrange today for us to have a conversation with Daniel Chen, who’s with us, who is a Lander Analytics data scientist, he’s a Ph.D. candidate as well at Virginia Tech, and a former RStudio intern. Welcome, Daniel. It’s great to have you on the show.
Hi. Thanks for having me.
Yeah. We’re gonna have to navigate this two-Daniel situation. Chris, best of luck to you.
If it makes it easier, you can call me Dan… [laughter]
I may still respond to that accidentally, but we’ll navigate this; I think we’ll get through. I don’t know about you guys, but it always bothered me when there’s two people named the same, sometimes people wanna call that Danie2, or something… But that always really bothered me, because it seems to be actually 2xDaniel, not Daniel2. There’s two of them; it’s two Daniels. It’s not like I’m multiplied by Daniel Chen.
You’re not multiplied by yourself there.
Right. I don’t know if that’s just my own peculiarity.
You know, my name is Chris, and there were Chrises all over the place growing up… So I understand. You feel like you’re melting into the background when there’s five Chrises in your class. I get it.
Yeah. Well, on that note, Daniel - Daniel Chen - if you want to just let us know a little bit about yourself, about your background, how you got into doing what you’re doing now, the path that you took to there… I’d love to hear a little bit about it.
I grew up in New York City, and my dad is a software engineer; my parents, when they came to America, they both studied computer science in college. My mom doesn’t do computer science-related stuff anymore, but I’ve always had a computer at home, since when I was a kid. So back in the day, these were just old company hand-me-downs. And around high school I went to one of the math and science high schools in New York City; it was interesting, because the sophomore year it was a requirement for all the sophomores to take one semester of computer science and one semester of technical drawing. Then the people who liked computer science at that time could take the [unintelligible 00:08:13.17] class, etc. But it was interesting, the fact that every single sophomore student in high school was exposed to programming some way, shape or form.
Looking back on it, I have no idea how the instructors got all the material in, because we covered in one semester – we first went through NetLogo, which is like drawing turtles, and I made a little BlackJack game for that, a small project… Then we went through Scheme, which is like a Lisp language, to talk about lists and functions. And then towards the end we got introduced to Python, where it was like “Make your own prisonner dilemma kind of algorithm and we’ll compete it in the class.
And that was all in one semester, so as an educator - I started teaching data science now - I’m always baffled when I think back “How did they make that work?” Because there’s no way I would be able to teach all of that stuff, even in a semester or so. That was super-interesting.
Yeah, did it seem overwhelming at the time, or did it just seem like new and exciting stuff, or different stuff?
It was new and exciting, but as I started teaching more – I didn’t realize it at the time, but yes, because it was a math and science high school, clearly there were people who have done this stuff before in the past, and then there were the people who have seen this for the first time… So I was in the camp of like “I’ve actually never programmed before”, but then there’s all these kids who knew the answer as the question was being written on the board, and I’m staring at a blank piece of paper, like “How do I do this?” So that was actually one of the “I don’t think I’m ever gonna do this for a living” moments… [laughter]
Yeah, it pushed you into that place, rather than further inspiration, at least at the time…
Yeah… Well, I’ve always been interested about tech and things, but yeah, programming definitely at the time seemed “This is not for me” kind of ordeal.
And then fast-forward a little bit to my undergraduate years - I ended up getting a computer science minor, just because I was like “You know what, I’m just gonna go do it, just learn how to program formally.” And that’s when I realized, looking back - the whole people who have seen that before, versus not seen it before.
My intro classes - they were relatively easy for me, even though it was like – for example, one C++ class, I’ve never actually programmed in C++ before, but I didn’t have to think about print statement debugging. That was not a brand new concept at the time. Or if statements and loops are no longer something I need to struggle with, because I’ve seen it before in the past… And then I actually felt bad for some of my students… I picked up my computer science minor in the junior or senior year of my undergraduate career… And then I felt bad for the freshmen coming in, who wanted computer science as their degree, but they’ve never seen it before, and they actually struggled really hard. So that’s when I had those feelings back in high school again.
It gives you empathy, doesn’t it?
Yeah, yeah. That’s when I actually started realizing, “Hey, wait, I have actually seen this before, and that’s why it’s easy for me.” And in some way, that carried forward, and so after that I got my masters in public health and epidemiology, which is somewhat relevant these days… And it was a two-year program, and the second year I ended up taking a Intro to Data Science class, with some of my MPH friends… And that’s where I met Jared, so that will eventually tie in somehow… So it was during that Intro to Data Science class where I sort of really understood what data science – what could you actually do.
[12:20] During the time I was doing my masters, we talked a lot about linear regression, logistic regressions, survival analysis and all of the epi-concepts associated with that. But I never knew what random forest was, or clustering, and all of that stuff until I took the data science class. And that’s when it was sort of like “Oh, if you can just think of something, something already exists to make that happen, in some way, shape or form.” So it’s really eye-opening in that sense that whatever you can imagine, you can probably make it happen. So that was great.
And then from my MPH, I entered my current Ph.D. program. Fast-forward till today, since I started, I am now doing my dissertation topic on data science education and the medical and biomedical sciences.
Do you think that those – I mean, it sounds like those experiences in high school, when you were introduced to computer science, and then when you were introduced, your vision was expanded to see all these different methods and the possibilities later in your education - do you think that pushed you to this specific interest in data science education? Or what is it you feel about data science education that – I know there’s a lot of gaps out there and a lot to be addressed, but how did your specific interest in that develop, and what are you hoping to learn and contribute through what you’re doing now in your current Ph.D. work?
Yeah, so I guess in terms of pivotal moments in my life, it would definitely be taking that data science class during my masters program. And part of it was Jared was an inspiring teacher - Jared, Care and Rachel, they taught the class. It was actually a very difficult class, but if you struggled through it, there was so much that you learned from it.
And also, during that class - it was the first time I’d attended a Software Carpentry workshop. So those two things put together sort of put me on the road where I am now. During that Software Carpentry workshop – a little background about Software Carpentry, which is now The Carpentries… They are a non-profit organization focused on teaching scientists the computing skills that they sort of were never taught.
So I attended that workshop, I sort of knew a little bit of Python from undergrad and high school years, and had been sort of like playing around in Bash and Git, because for some weird reason I decided to install Linux on a computer where no one that I work with uses Linux… And so the stuff that they taught during that workshop were like all those pieces - it was a little bit of Python, some Bash and some Git. And I thought to myself “Hey, I can actually do this. It’s not that much of a jump from what I currently know.”
So that’s how I got into the education area. The following semester I signed up to be a Carpentries instructor. This was back in 2014 or so. That’s where I met Greg Wilson, who was the instructor/trainer at the time. He currently works at RStudio, but that’s sort of where I picked up all of the fundamental parts of teaching this stuff.
I didn’t know that this would actually turn into a career or a dissertation topic, but that’s sort of when I realized or started thinking about what makes a good teacher, thinking about students, how to convey topics in some coherent way for people who were new to this. I did that over a large enough period of time that I eventually wrote it all down into a book called Pandas For Everyone. That is my attempt of teaching Python from a data science perspective, using Python.
[16:33] That’s awesome. So you’ve mentioned Jared a couple times… It’s Jared Lander. He’s very involved in the R world, so if you’re listening and you’re part of the R community, you probably know that name already… But he’s also involved in the R conferences that I mentioned, like the one that’s coming up later this fall, that the podcast is involved with as well… And he actually was a previous guest on the podcast as well, all the way back on episode number seven, which seems like another age ago…
And I don’t know if as part of that data science class with him, Daniel, if this was part of it, but I remember him just talking, giving a really great overview of the landscape of machine learning or AI techniques, and where certain things fit in, and how to orient yourself in terms of how for example deep learning fits into the spectrum of other techniques… So that was very useful.
I’ve got a question for you, Dan, and it’s something that really caught me when you said it a few minutes ago. You were referring to that first data science class when you were taking your masters as a pivotal moment for you… I’m kind of wondering - we have other students out there listening, and they’re kind of trying to figure out where they wanna go… What was it about that class that you’ve found inspiring? You talked a little bit about the fact that if you’d get through it, and when you could, it would help you, but what was it that really grabbed you about that? What was it that you found beautiful about data science at that particular moment?
Yeah, so there’s two parts to it. One was the people, and then second was the actual data science material. This was a class, so the people that you’re interacting with are probably going to be more important than anything else. And what I’ve also learned – this doesn’t apply to actual Jared’s case, but one of the things I learned over the years is what makes a good teacher doesn’t necessarily mean you have to master the material. Being a good teacher is different from knowing the material. But it was the way the whole entire class was taught.
Jared taught the technical lab component, and he was also a carpentries instructor at the time… And so it was sort of that style of actually live-coding in the class to go through the lab material that was really good as a student to see… Because 1) it just slows you down. Instead of flipping through slide decks, it literally will just slow you down. And you see the typo error process, and stuff like that… Which is a lot to take in when you’re a student, seeing it all for the first time… But I wanna believe that subconsciously it does really help a lot, just seeing the error process.
And then Care, Patel and Rachel Schutt, they taught the general data science landscape portion of it. And that’s where I learned about how does this apply to everything else. There’s so many techniques and methods outside of what I was learning in my epidemiology classes that I just didn’t know existed… And so just learning about those methods, and just understanding – or not really understanding at the time, but just seeing what they are, how they work, just understanding the heuristics of how they function under the hood, I saw so much…
[20:01] It was eye-opening for me just to see how this could be applied in the health space. Granted, I was doing a masters, so a lot of this stuff that we were learning in the data science class, I believe that if I were to do a Ph.D. in epidemiology, I would have seen some of that stuff eventually… But it was more just like I was doing a masters, there was so much new information about a field already coming in, and then you just threw in this analytics component and it was just like “Wow, we can do this for everything.”
So it was sort of like that eye-opening moment for me, where it was just – the teachers were great, so it kept me motivated, and then the material itself, I just was able to make so many more connections to what I was currently learning, so that sort of just kept pushing me forward.
I was really interested to hear that as you were going through that data science class, you saw a new world open to you in terms of how these techniques could be applied specifically in the medical space, or in epidemiology, like you were talking about… Do you feel that those communities now are aware of those methodologies, and data science and AI is really taking a foothold in those industries? Or do you still see it as maybe a bit of an uncomfortable mixing right now, and people still learning where things are being applied? What do you think is the current state of those things, and how do you see it progressing forward?
It’s definitely been more adopted in the medical space, especially with deep learning stuff being so good at image recognition; that’s a prime case for looking at medical imaging. But it’s tricky for other parts of medicine, because a lot of what we learned in epidemiology courses and biostats courses is trying to do inference on our data… So epi as a field - one way you can think about it is it is the field of setting up all of your observational experiment. So when you do the stats, you’re a little bit more comfortable with what is actually like a cause and effect. So if you take that part in mind, it gets a little tricky, because there’s so many machine learning methods that are really just black boxes that really don’t give you any sort of inference; it’s really just made for prediction. And so you have to be careful using these methods in a medical context if they are like these black box methods… Because if it predicts something wrong, it becomes harder to figure out why did the model predict this wrong, and usually at the other end of this is someone’s life on the line.
Yeah, the consequences are high.
[24:01] Yeah. So yes, there is a place for all of the AI/ML stuff in medicine, and you just have to be more careful when you’re trying to put a model into production. I guess then your regular company, I guess - that’s the other way to put it; at the end of the day, in [unintelligible 00:24:19.08] the end of that model is going to affect someone’s life, versus some bottom line, I guess…
Right. And I imagine that that kind of ties into some of your feelings about good code practices, and the Carpentries stuff that you were talking about as well, in terms of understanding the implications of the code you’re writing, and how to test it, and how to deal with debugging models, and all of those things.
Yeah, so the next part/question/problem is not everyone was as fortunate as me. I went into a public health program, a medically related program, and then got thrown into data science, and then went down that track. So a lot of people who are actually practitioners, or physicians on the medical end, when they want to do research, they typically are just doing research from Excel sheets, because that’s what they know, or that’s what they went through school with doing. They weren’t taught all of the techniques and methods and skills from computer science, or data science, or just programming in general…
So yeah, that’s sort of where the Carpentry stuff comes in, where now it’s our time to teach all of the researchers the skills that they haven’t actually formally learned, and they just went through their life patching stuff together because programming was the means to get their work done, and they just had to program something or do some kind of analysis just to get the result that they needed. They just struggled with the tool because they never really had formal training.
So that’s eventually how I came to my dissertation topic, which was I’ve been teaching for so long… You know, I read education books for fun, and I’ve always had this interest in the medical space. So I found an advisor who will let me match those two things together, and I got super-lucky [unintelligible 00:26:28.11] and I got super-lucky just getting to meet her through the Virginia Tech library.
So if you are a student, definitely go befriend a librarian. Because if you think about what the people in the libraries do - they’ve been doing data science since libraries were a thing.
That’s a great point right there. So one things I’m curious about - you have a lot of experience in both Python and R… On the Python side you wrote Pandas for Everyone to share that learning; on the R side you’re giving a talk at the R conference, focusing on the government and public sector… I’m wondering - those are two different tools within the data science toolkit, if you will… How do you see those? At what point do you turn to R and say “That’s the particular problem I’m trying to solve right now”, it lends itself better to R in your view, versus when would you turn to Python? Since you have them both, and often those two communities - people do an either/or, but for the benefit of someone who might wanna consider both, how do you see that? Where is each one for you stronger personally?
[27:43] Currently, today, the way I pick the language is like “Who am I working with?” If I’m working with my advisor, I’m probably working in Python. If I’m working with someone else who does R, I’ll probably use R. That’s today. If you’re currently an R user and you go through my book – there was a tweet a couple of weeks ago that was actually like “This book is great if you’re an R user”, because I make so many references to R things in the Python book. It’s not super-explicit, but it’s one of those “If you know, you know” kind of moments…
And what’s actually interesting these days, or now, is it really doesn’t matter which language – if it’s your first language, it doesn’t matter. Eventually, you’re gonna end up learning both. I almost feel like it’s the nature of just doing data science.
It’s the nature of programming.
It’s the nature of programming.
Lots of languages for different things.
Yeah, yeah… So as far as the first language goes, it doesn’t really matter. If you’re coming in from a data science point of view – I always make the distinctions between data science and computer science. But if you’re coming from a data science point of view, the most important thing is when you see a dataset that is “messy”, can you in your head write the general sequence of steps to make it clean again? I borrow all the terminology from the R world, which is like the concept of tidy data… So if you can see a dataset and know the steps on making it tidy, then at that point it really doesn’t matter what language you use, because you can literally just look up like – in the R world now, in Tidyverse, it’s like pivot longer or wider. So you would just google “pivot longer wider tidy R”.
And then on the Python side it’ll be like “pivot longer wider python”. But in Python it’s melt in pivots. So one of those words will show up in some search result. And I think that’s probably the more important thing - knowing the steps on processing data, and then just treating programming as like the thing to get you there… Because if you’re just starting off, you don’t know the steps, you don’t know the terminology or how to clean data, and then you’re also trying to learn a brand new language… So when something goes wrong, you don’t know if your overall sequence was wrong, or was it an actual programming type of mistake. That’s what you wanna separate as much as possible. So just pick one, learn how to manipulate data, and once you’re comfortable with that, it becomes super-easy to transition to another language.
When I did my data science course in my masters program, it was actually all done in R. It was actually all done pre-Tidyverse was formalized as a thing… But I worked with processing data for a good year or two, and then that’s when I actually understood what tidying data meant. That transition into Python was super-easy, and that’s why the ordering of the book that I’ve put together – there was a lot of stuff in the book that was like… I learned all this from my transition to R, and that’s why there’s so many random R things in the Python book.
Yeah, that’s awesome. It sounds like – we were talking a little bit before the show about your personal data processing pipeline. When any of us go into a project, somehow we have to set up a set of scripts, programs, folders, files, config, data sources, whatever that is, to define our project and the structure of the pipeline that we’re using. It sounded like that’s something that you think about quite a bit. As you’re kind of now also thinking about data science education a lot, what are your thoughts as far as when you’re talking to students, when you’re thinking about how to educate them around your project structure, what are some of the main things that really can benefit you as you set up a new project, whether that be just something that’s analytics, or whether that be a machine learning project… What are some ways that you can help yourself down the line when you start out a project?
[32:09] Yeah, so I am in Academia, so there’s three papers that sort of talk through this entire process. The first one that I read that sort of introduced me to all of this is by William Noble, and the title is “A quick guide to organizing computational biology projects”. That was probably the first time I’ve seen in academic writing literally how do you set up the folders in a project. So you have an output folder, you have a scripts folder, you have a docs folder, you have a readme file on the top level, stuff like that.
And then the two other papers that sort of expand on this from the Carpentries folks - there’s a paper called “Best practices for scientific computing”, that was written in 2014. And then in 2017 there was another one that said “Good enough practices for scientific computing.” So you can see how doing good or best practices is actually pretty difficult… But it really does all stem from – one of the core pieces is having a folder structure, so that your scripts can find the data that you’re working with… And it’s focused around the idea of “Yes, it works on my machine, but it needs to work on someone else’s machine”, or another one of your machines, or the cloud as well, without having to change a whole bunch of file paths. In an ideal world, it runs on your computer with a command, and it will run on another computer with the same exact command, without you having to change anything.
So that’s sort of the overall overview of what I focus a lot on. And then there’s the super-technical parts of like – yes, Git is a thing, version control is a thing that you have to know when you’re trying to collaborate. That’s just sort of the nature of the beast. The good thing is The Carpentries has a Git lesson, so if you want to learn it on your own, it is written down somewhere… And I’ve this summer put down a few workshops that are on The Carpentries YouTube page, on like the actual super-complicated collaboration aspects of using Git and GitHub.
So most of my stuff really does focus around – project organization is the actual cornerstone or centerpiece to managing a project.
Yeah, it’s interesting – you talked about the one paper being Best Practices, and then they went to like Good Enough Practices… That concept definitely resonates with me. I was wondering – because there is software engineering best practices in industry where… I think now if you’re working on a project, you have a GitHub repo; if it’s not connected to some sort of CI/CD and you don’t have some sort of portable way of deploying this thing, maybe with Docker or something like that - that’s kind of like what people are doing a lot. But that’s a lot of things for someone in Academia or a new data scientist to learn; it can be rather burdensome. Yeah, I guess “daunting” is a good word.
In terms of people that are starting out as data scientists, do you think that’s something – as they’re embedded in an organization, should they strive for eventually learning all of those software engineering best practices, and adding that to their workflow? Or do you think there is a sort of in-between, where the workflow of the data scientist – it is different, right? There’s different data concerns in all of those things… So “How much of a software engineer does a data scientist have to be?” I guess is the end question that I’m going for.
[36:23] Yeah, so that’s the other big dilemma - a lot of workflows from data science are actually anti-patterns from software engineering. As data scientists, we primarily work in scripts that execute from top to bottom. Very rarely do we end up writing classes or things, using those software engineering tools in a data science analysis. We will write functions, that’s good, but we don’t necessarily create packages. That is considered maybe a best practice, but it’s a lot more stuff. So just writing a function is good enough, but then what happens when you have 50 functions?
There’s this tension between – well, not tension, but… The way you program things from a data science perspective is going to be different from software engineering. That’s just going to happen. It’s kind of interesting when you hear stories about data scientists working with engineers, and then when their codebases need to mesh, and that becomes a different question and problem on its own… But at least from what I’m working on now, which is the data science perspective, but it’s catered towards the biomedical sciences and those people, we even need to go an even step further back from that, thinking about best practices in that stuff… Because these are the people who are so new to this field that if you talk about Docker and CI/CD integrations, those are letters that they’ve never seen put together before in that order.
So one of the by-products of my dissertation is this – I guess you can call it a book/lesson plan that’s called DS 4 Biomeds, so data science for biomedical sciences… Literally, the first thing I talk about is “We’re just gonna talk about spreadsheets for now”, because it is probably something that they’re most familiar with in terms of a data perspective. One way you can think about spreadsheets - it is a GUI for your dataset. People like looking at things and being able to click on things… So how do we go from spreadsheets to a data science pipeline is sort of where I’m focusing more of my time these days.
I’ve just finished the first spreadsheet module, so I can actually talk about this… And putting that part together, I sort of realized that yes, we can actually introduce those tidy data concepts in the spreadsheet section… Which is like - if you’ve ever loaded up an Excel sheet… First of all, as a data scientist, when I see an Excel sheet, I’m already preparing myself over [unintelligible 00:39:11.09] a csv file.
So why do I cringe when I see an Excel file? Well, it’s because sometimes you have multiple tables in the same sheet, and from A to M is one table, and from P to Z is another table; and you have to load those tables separately. Those are data issues that happen when you’re loading in data into R, Python or whatever language… But from a lot of people who don’t actually work with programming languages - that’s great; they get to see everything at the same time. So it’s sort of like identifying those bad habits and trying to show them why they’re not conducive if you want to load them into a programming language; that’s where I’m at right now.
It also comes down to the whole mantra of “You want to have empathy” for the people who are learning this stuff… And if all they get away from that first workshop is structuring their spreadsheets better, I’ll be okay with that; I’ll be happy with that.
[40:18] So it’s always about making these small incremental improvements every time you start a new project. And that happens if you’re a full-blown data scientist as well; like, yeah, maybe you have the whole project structure thing working for you, and you can have all your code work on whatever machine that your codebase is deployed on… What would be the next step for you? That might be trying to learn one of the continuous integration services, or using Docker, or something.
So there’s always something that you can do to improve your workflow, and I guess that does take a lot of effort on one’s end, because you do have to do a lot of introspection of what can be improved. The way I’ve always seen it for me - it was easier for me to do it, because it’s always like “Oh, what is the best practice?” And then I read about it, and it’s like “That is way too complicated.” And then six months later, it’s like “That seems doable now”, because I’ve learned all the other stuff in the middle that gets me there.
So there’s different entry points towards picking up practices from software engineering, but at the end of the day, data science pipelines or workflows really don’t mesh with software engineering stuff. In software engineering your end product is probably a library, or this big program thing, versus in data science it’s really this pipeline of scripts that create this model, and then this model gets handed off to the software engineers to implement somewhere else.
So those things are just going to be different, and it makes sense that the best practices on both sides aren’t going to be the same… But if you make incremental progress, you’ll eventually get to a good spot.
So you’re really thinking a lot about trying to get people working in these areas - bio-informatics and other related areas - to think about using data science techniques. If you were to look into the future, and let’s say that you’ve accomplished your goals of getting these people to use these sorts of techniques in their workflows, what are some of the example things that you envision them being able to do with data science techniques that maybe they wouldn’t have been able to do if they followed the same workflows that they have been using for quite some time?
One of the main takeaways would be just working with multiple sources of data at the same time. We have a system where every local department, organization at the government level etc. they’re doing reports of case counts, for example, on a daily basis, and they don’t necessarily all come in as one. They’re not all combined together for you. In this current pandemic - yes, you can find data sources that are doing the aggregation for you. Back in 2014 during the Ebola outbreak that wasn’t necessarily the case. We were getting daily reports from different countries as PDF files, for example.
So being able to work with multiple data sources is going to be one of those skills that are going to be super-important. And how it all ties back into why use a data science approach - and when I say that, like, why use a programming language to do that kind of analysis over something like spreadsheets… That goes into one of the most important things when you’re working with data - you always wanna keep your raw data completely intact. This way, if there is an improvement or something in your actual data science code, in an ideal world you just rerun your code over a new set of data, and then you get your updated results right away.
[44:24] That’s probably the most important idea that we have, even in today’s Covid world; that’s sort of the reason why you’ll hear recommendations changing over the past couple of months… Because in the beginning, the data itself didn’t show a conclusion. But as more data came in, if you were to rerun your analysis over and over again over new courses of data, you might actually find a new outcome.
So currently, we are in real-time living the scientific process, and part of that process is making sure that if you do have new data sources coming in, you can still rerun your analysis, and that part is reproducible. And then as more data comes in, your conclusions may change. So that’s the story of how that all ties into current times… But it really is something that is really just fundamental to data science as a whole, since we’re always querying data from the world, and we want our pipeline to be there, so that as new data comes in, we can have an updated model at the other end.
That makes perfect sense. So we talked a little bit a while ago about the fact that you’re doing a talk at the R conference, and I was wondering if you could share a little bit about what you’re talking about and what is your message to the R community. Give us a little insight into that; we’d love to hear what’s of interest to you.
My previous R conference talks have always been around the topic about this data pipelining part of data science. I think the lat talk I gave last year was something around “I’m gonna teach you how to make a makefile, so you can make your reports.”
Yeah. Last year, 2019, I was one of the interns at RStudio, and I worked on a package called gradethis, which is the autograder system for R code. It can tell you, if you’re an instructor, like “Here’s the correct answer” and then the student can type in some R code, and compare the results. That’s the easier way you can grade code.
The more complicated way you can grade code is looking at the code itself. In more technical terms, it’s looking at the abstract syntax tree… So you’re literally comparing, like if the student put in, for example log of 3, and your solution is log of 2, you want a sentence that essentially says “You put in 3, where the answer should have been 2.” So creating that sentence is a lot more complicated than it may or may not seem, if you didn’t think it was a hard problem.
During that process I learned a lot about R’s way of handling code expressions, so this year I am trying to teach that to regular people. How that ties into the greater R ecosystem is if you’ve worked with Tidyverse packages, you’ll notice that you are allowed to pass in column names without having them quoted in strings, but they look like regular variable names, which is terrifying from a Python user’s perspective, because lazy eval is not a thing in the Python world, but it’s there in the R world… So trying to introduce those topics is sort of my goal for the next series of talks.
[48:08] Why does this help? This is the transition of if you want to write your own Tidyverse-compatible packages for your own work, this is what you need to know to make that happen. So yes, the talk is more towards the software engineering side of things, but it’s one of those “Hey, if you wanna have your own work plug into this whole ecosystem, how would you go about doing it?” This is my part of trying to make an incremental improvement, for myself and for the greater community.
As you dug into the underlying mechanics of how R processes expressions, do you think that’s influenced how you write your R, in terms of just your general programming, and has it made you more sympathetic in terms of how you write your R with that better underlying understanding?
As far as regular day-to-day, if you were to just tell me to run some type of analysis on a dataset now, it doesn’t affect that part of it. If anything, I have a lot more sympathy for people who develop these packages… Last year I literally read The Advanced R Book like three times, over and over again, without understanding what I was doing during my internship… And then after seeing it and having it mesh in my brain for like a year - I’m reading it again, re-reading it again, and it makes total sense now…
So as far as like a day-to-day thing, it doesn’t affect it that much. But when I am writing functions and things that might need to end up – like a collection of functions; if I start writing a collection of functions, whether they make it into a package or not, I am more mindful of certain things, mainly around dependencies is what I’m really mindful about. One of the things that sort of surprised me last year was, you know, if I wanted to do some kind of grep search for a string, in Tidyverse world I would just instinctively use something from the stringr package, or something like that… But you don’t need the entire stringr dependency, which if all you’re doing is a simple grep call, just use the regular built-in grepl, it’s fine.
[50:34] I sort of realized that yes, when you are a package developer, all of the engineering hurdles are now your problem. Your job is to make the end user’s life easy, and then you deal with all of the engineering burden on your end. So I definitely appreciate that a lot more.
As far as my day-to-day, it’s mainly like “Just write more functions, and try to keep working on the best practices”, and stuff like that. And then when I have a new student that I’m working with, that sort of like grounds me back, like “Okay, this is where I once was… So how do I get them to some other point? The next level in their life and programming - how do I make that transition less violent for them?”
I really appreciate what you just said actually about dependencies… You can reduce - I don’t know what the right word is - I guess your liability or your potential debugging issues in the future… I’ve done this sometimes where it’s like “Oh, I need a sigmoid function, or something.” Well, I could import any number of packages where I could call [unintelligible 00:51:48.17] but I could also just write that in a couple lines of code and just embed the function in my own code, so that it’s super-clear what’s going on. I don’t have an external dependency. I think that’s something that is underrated a lot, so I really appreciate you bringing that point to the surface.
I’m super-interested to hear the other insights you have from your talk at the R conference, we’ll definitely look forward to that. And we’ll also link in our show notes to a bunch o the things you’ve mentioned; The Carpentries courses, your book, the various packages you’ve mentioned, and all of that. We’ll also link to Jared Lander’s episode, if people wanna go back and listen to that… But yeah, we really appreciate you joining us.
I think, if I’m right, that people can find out more about the R conference at RStats.ai. We’ll of course link to that as well… But yeah, it’s been a huge pleasure to get to chat with you, Daniel. I’m looking forward to spending some more time together at the conference.
Yeah, thanks for having me, it’s been really fun.
Our transcripts are open source on GitHub. Improvements are welcome. 💚