Practical AI – Episode #80

What exactly is "data science" these days?

with Matt Brems, Lead Data Science Instructor at General Assembly

All Episodes

Matt Brems from General Assembly joins us to explain what “data science” actually means these days and how that has changed over time. He also gives us some insight into how people are going about data science education, how AI fits into the data science workflow, and how to differentiate yourself career-wise.

Featuring

Sponsors

DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog.

FastlyOur bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.

RollbarWe move fast and fix things because of Rollbar. Resolve errors in minutes. Deploy with confidence. Learn more at rollbar.com/changelog.

Notes & Links

📝 Edit Notes

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of the Practical AI podcast. My name is Chris Benson, I am a principal AI strategist at Lockheed Martin, and with me as always is Daniel Whitenack, a data scientist with SIL International. Hey, how’s it going today, Daniel?

It’s going great, you know? I don’t have the Coronavirus yet, so that’s always a good thing… [laughter] And working on some interesting things, so it’s a good week. What about yourself?

About the same. I know both of us travel a lot, so I too have my eye on that very carefully; trying to listen to good, science-based recommendations on the kinds of things we should be doing. Other than that, all is going well down in Atlanta, where it’s not too cold today, although raining just a bit.

Yeah, nice. I’m excited about today. I know one thing that we’ve interacted with for a while is the “data science” industry, which – the title of the podcast is AI, but of course, AI and data science are all mixed up in an interesting way… So who do we have on the show today?

Today we have a guest that can tell us all about how data science is changing, and how people are interacting and learning from it. We have Matt Brems, who is the global lead data science instructor for General Assembly, and he is also a managing partner for a data science consulting called Betavector. Welcome to the show, Matt!

Hey. Thank you so much for having me.

If you could just give us a little bit of background on how you got to where you got, and a little bit both from the General Assembly perspective, and tell us a little bit about what you do at Betavector as well.

Yeah, so I guess the best way to describe it is everything that I do has to do with data science. Part of my day is focused on doing data science by teaching it to other people, helping to train the next generation of data scientists. That’s a really cool thing, that’s what I’m doing with General Assembly. But also, I get to do a little bit more hands-on work by actually implementing some of the data science solutions by advising and building for a bunch of different clients through my consultancy, Betavector… So just about everything I do is very innately connected to data science.

How I got into that is sort of an interesting path… I’ve been with Betavector for some form or another for about a year now. It’s been something long in the making; we formally incorporated back in September, but I’ve been doing some data science work for kind of a long time. I’ve also been working for General Assembly for about 3,5 years now. Prior to that though, I was doing data science for a political consulting firm. I was based in the Washington DC area and I was helping to use data science to advance issues, and get people elected, and that sort of thing. Building models, for example, to forecast who’s likely to show up in the next election, or who is likely to care about issue X, or candidate Y. So I did some hands-on data science with that consulting firm. Prior to that, I was in grad school at Ohio State, where I did a master’s in statistics.

[04:32] One of the things that I felt was really cool about that was when I was in grad school I got to teach a lot. I taught a little bit, I tutored, I was a TA for a couple of classes in undergrad at Franklin College in Indiana… But when I went to grad school, that’s where I really started to ramp up that teaching. I got to teach for a bunch of very different classes. A lot of it was aimed at freshmen, but working with people who are going into the business school, versus people who are in social science majors, working with small groups of about 15-20 students, all the way up to lecture halls of 315 students. So I got a really good understanding of how to try and make statistics and data science etc. exciting to people, and in many cases who weren’t as excited to be there as I was… They actually frequently game me the 8 AM slot to teach, so it’s a hard sell trying to get a bunch of college freshmen…

The dreaded 8 AM… [laughter]

Right, right.

Yeah, so I’m kind of curious from that perspective, since you have been speaking about this so long and kind of explaining it to students also - something I get frequently asked is, you know, given that we talk about AI a lot… You know, I’m involved in some sort of analytics things, and then my job title is data scientist, actually… But I get asked a lot “What is data science and how is it different from some of these other things?”, maybe analytics or business intelligence things that people talk about… And I realized that’s a hard question, because there’s so many different answers to that, and it really depends, and there’s a full spectrum. I was wondering, from your perspective, as you talk to students, how do you explain what data science is, how do you view its main components?

There’s two different ways I like to think about it. On one hand, if you’re familiar with Drew Conway’s Venn diagram of data science, it’s where one of the circles in the Venn diagram is math and statistics, one of the circles is computer science or programming skills, or hacking skills, and then the third circle is subject matter expertise… And so oftentimes data science is represented as the intersection of those three things - the math stats, the computer programming and the subject matter expertise… And I like that, but I do think that we need to get beyond that a little bit and think of it more as the union of these three things. And the reason that I say that is I think that when it comes to data science, depending on why you’re doing it or what your purpose is, you may not need to be an expert in programming, or you may not need to be an expert in statistics, or you may be able to not know innately or as intimately the subject matter expertise that you’re working with.

As an example, there are plenty of people doing data science-related things in Excel. I would call that data science. There’s a lot of people who may turn up their noses at it, and I think it’s important too if you want to get into data science as a professional field, that you should be able to have tools in your toolkit that are more sophisticated than Excel… But it’s certainly possible to do data science in Excel, if that’s where you or your organization are kind of at.

[07:53] That gets me into the second definition or description of data science, that I personally use… And it’s just that data science is using data to make a more informed decision than if you were to not use that data. And I like that, because I think it’s simple, I think it’s easy to understand; we’re just trying to use data to better inform the decisions and the choices and everything that we make…

And I think that that’s a very flexible definition that applies to a lot of the things; kind of like you mentioned - in analytics, and business intelligence, and all of that. I think the distinction between those or among those is fairly arbitrary, and it depends on different companies and what they want. There’s all sorts of incentives there, largely centered around how much they pay people. You might be able to pay people less if they are analysts or BI analysts than if they were data scientists, for example… But I think that at the end of the day I just think of data science as “Let’s use data to make a more informed decision than we otherwise would be making if we didn’t have that data, or if we didn’t use it.”

You raise a really great point there when you’re talking about the different levels that people are engaging in data science from, starting maybe at that Excel level, and moving up to very sophisticated applications. As we have seen in the last few years this field just absolutely explode, not only in terms of applications, but in terms of the number of roles, and people are kind of engaging in data science, or finding it at different levels… How has that fragmentation of a rapidly-expanding field resulted in - looking at it from different skills, in terms of different practitioners needing different levels of expertise and different skillsets to fulfill their own roles, with so many roles out there now?

I think that a lot of this is driven by just – I mean, you’re right, because this field has exploded, because there’s such a demand for people who can make sense of numbers and of data, I think that there are so many people kind of aiming to get into this market… I know that for example colleges and universities are doing their best to better prep people to leave and go out into the workforce. There are organizations like my company, General Assembly, where our mission is to empower people to pursue the work they love.

We wanna focus on 21st century skills, and we recognize that there’s a skills gap there that exists, and there’s a lot of people who want to be able to fill in that skills gap, or be able to close that gap between the skills they have and the employers that people want… And at the same time there are also employers themselves who say “Let’s try and figure out how we can get the right people in these roles.” Do we hire directly out of a college or a university? Do we hire directly out of a General Assembly? Do we try and train people internally, just train a coach and train people to get into those roles? Do we, for example, reach out to General Assembly and say “Hey, can you train our team specifically to level them?”

So there’s a lot of different avenues there, and based on the background of these different organizations and what people are working with etc. you’ll notice that fragmentation exists. Perfect example - people who are trained at colleges and universities in statistics and data science will more frequently come from our backgrounds or maybe Stata than using Python. And I think that there are a number of things for that, but it largely gets back to the types of problems that people in Academia tend to be solving, and how those tend to be more formal and statistical in nature than a lot of the other work people are doing in data science… Whereas I think that coming from industry, if people jump into data science from industry, you’re often working in Excel, perhaps you’re working in Tableau, maybe you learn Python… And I think Python is commonly a language of choice for a lot of them because those people spend a lot of their day cleaning and munging data, and so I think that in my opinion Python is pretty good at that.

You notice this kind of – everybody comes in with all of these different perspectives, and these different incentives, and goals… So everybody is in this big universe of data scientists… However, everybody got there through a very different path, and has very different skills as a result.

[12:09] There’s so much there that I want to follow up and ask about, but before we get too far into the conversation about the skills that people are acquiring, how they’re acquiring them and all of that, I was wondering if you could kind of give us a glimpse as to some of the main tasks that data scientists do and how AI fits into that. Because we are Practical AI, I wanna make sure and make that connection, because it is confusing, like you said. I know some data scientists who are creating their own neural network architectures, and publishing academic research papers from their company… And then others, like you say, who maybe they’re not running TensorFlow, they’re running Excel, or something like that… How does AI fit into the task that data scientists are typically approaching?

To address the first question, when you say “What are the types of tasks and things that data scientists are doing in their jobs?”, those vary quite wildly. I think that there are a lot of – I mean, again, it depends on the needs of your company or the needs of your organization, and the backgrounds that people have etc. but one thing that people are fairly surprised to hear is that as a data scientist, especially at the entry level, a significant portion of your time is spent cleaning and gathering and exploring your data. An often thrown out figure - I don’t know how rooted this is in actual evidence, but a lot of people ballpark the amount of their time in a project spent on gathering and cleaning and exploring their data at about 80%.

Yeah, I would second that. I know Chris and I have talked a couple times on here and I actually enjoy that part of my job. A lot of people seem to think it’s really unenjoyable, but –

He’s kind of sick in that way…

[laughs] It’s funny that you bring that up, because a lot of people don’t get into data science to do that. Everybody gets excited by - and connecting with the other part of your question - the artificial intelligence part. People get really turned on to the idea of neural networks, and saying “I want to learn how to build a neural network.” And that’s great. I think that neural networks can - and it’s not just me thinking this; it’s true, neural networks can be used to solve a great many problems. I also think that in order to be able to build those, you’ve gotta start out with the basics and understanding all of those inputs… Because as you and I know as practitioners of this, if your data isn’t good, it doesn’t matter if you craft it or if you selected the world’s greatest neural network; it’s not gonna do what you want it to do.

So I think that it is surprising to many people getting into the field how much time they spent on that exploratory data analysis, but how it’s so important to understand how critical that is to everything that comes afterward.

That’s an interesting misconception that I think people have when they’re thinking of the sexy thing, like “Hey, I’m gonna go in and do neural networks, and AI” and all that. We know that so much of it is in data cleaning and other mundane tasks. Can you talk about some of the other common misconceptions that we tend to experience as we become practitioners in this, versus what maybe our expectations would have been upfront?

Sure. I think that probably the other biggest misconception, at least what I see from students that have come in on day one and who know the broad idea of data science, but the inner workings… That’s why they’re coming to General Assembly, to learn a lot of that. And this is certainly not a – everybody comes in with this misconception, but something that people are so much surprised to learn is how much data science you can do without getting into that artificial intelligence or that neural network kind of level of doing data science.

[15:55] When I use the term “artificial intelligence”, what I mean is we’re trying to get computers to mimic or to simulate human intelligence. I think that people are surprised when we teach models like a linear regression model. For listeners who may not have hear of a linear regression model, you may have heard of the term “line of best fit” in the past, where you’ve just got a scatter plot of data points and we need to put a line through that data. Or thinking about a slightly more complicated model, but a related one - a logistic regression model, where you’re using a curve of best fit to try and prevent a one-zero outcome. For example, for my political experience, will this person vote or won’t they?

I think that the misconception for a lot of people - or a misconception - is that people often think that neural networks will be the solution to all of their problems, when in reality linear regression, logistic regression, much more basic techniques (for a lack of a better word) are really helpful and are often the solution to the problem that you’re trying to solve.

You make a great point there, because I’m always telling people don’t start with a neural network, because there’s quite a bit of expense in a variety of ways to doing that. Start with the thing that will solve it that is the cheapest mechanism.

And I think that that gets to the heart of data science. Our goal is to be able to solve problems. Very few of us are doing data science just for the sake of doing data science. We’re not building neural networks just because we want to build neural networks. Now, I think building neural networks is cool, don’t get me wrong… But at the same time, when we are paid by an organization, we’re paid to solve problems, not to do data science just for the fun of it… So it’s really important to always keep that in mind, and I think that that’s probably the last misconception that I’ll bring up - and maybe it’s not a misconception… People often lose the forest for the trees, and they focus so much on the modeling technique that they’re using, and then forget that the reason that they’re using that modeling technique is to try and solve a problem and get a more complete picture of the world around us.

So Matt, as we’ve kind of ended up talking about the AI side of things and how that’s integrating with data science, and also how there’s been this sort of explosion of roles and diversity in data science, I was wondering, as you’ve taught data science over the years, how has the toolkit that you’re teaching and that a lot of people are using for data science - how has that shifted or changed over time? And I mean in the sense of have things become more standardized, less standardized, as opposed to maybe a few years ago? Are you having to teach TensorFlow and GPUs now, whereas maybe before it was Pandas and scikit-learn? Or is it both? And also, how has the quality of that toolkit changed as you’ve been teaching it, in terms of its robustness and integrity and all that?

Yeah, all really good questions. I think that in terms of the – the toolkit continues to evolve. Let me lay the groundwork first; the program that I teach is a 12-week Monday to Friday, 9 AM to 5 PM class. So it’s a full-time immersive program, designed to take people from (I’ll call it) approximately square one, where on the first day we’re talking about things like data types, we’re talking about control flow, and what’s the difference between a for loop and a while loop, and all of that stuff in Python, to - at the end of week six students are presenting a project that I like to call the Reddit Project, where they choose two subreddits of their choice; they use the Reddit API, they will scrape thousands of postings from two different subreddits, and then they will use NLP (natural language processing) techniques to parse those out. Then they will train a classification model to get the computer to understand, and if you gave it a new post, it would be able to tell you “Does that come from subreddit A or subreddit B?”

It’s really cool to see how quickly people grow with that. That’s the halfway point in our class, and then people just kind of skyrocket from there. So that’s a little bit of background in terms of the program that we teach.

In terms of the toolkit that we’re using, given the Python stack, and focused on Pandas and scikit-learn and StatsModels and of that, those continue to update. I think Pandas is at 1.0.0, just was released recently. Those continue to evolve, so we’re staying on top of those changes as they go. We have also expanded the amount of content that we’ve had. For example, deep learning - we used to have one two-hour lesson on neural networks, just because 3,5 years ago that was something that was not expected to be seen in an intro-level data science role. It was good to show people neural networks, but it was not reasonable that people were going to be feeding neural networks in that entry-level role… I will say in most cases. Certainly not all, but in most cases… To now we’ve extended that to a full week of the class.

So we are continually changing that in response to a couple of things. One - what we see as the instructors and what we see as our product team as the industry changes and develops, and two, what we’re also seeing from our alumni people graduating, they say “Hey, this is what I’m focused on in my job. These are the things that were most helpful. If you were to change things again in the future, these are the things that you should perhaps include more of moving forward.”

Matt, a moment ago you talked about the course that you teach, the 12-week completely full-time immersive course, which raises the question - there’s so many different ways these days of engaging in education to fit people’s needs, to fit their lifestyles, and such as that… So how are you seeing people engage in those different ways? How does General Assembly fit into that? Could you describe as you see it, being in that space, how you see the options that are available for people out there?

[23:48] Yeah, of course. We were talking before the break about people who want to get into the field, and the options they have. Some people may go to a college or a university, some people may go to a General Assembly, some people may be trained by their company in some way, or they may self-study… There’s a lot of different options out there, and I think that it all comes down to a couple of dimensions, for most people. I think that people think about the time investment, people think about the monetary investment, I think people also think about - related to that monetary investment - the opportunity cost… For example, if they were to leave work and go to General Assembly, or if they left work to go to grad school, or if they went to grad school at night. And then also, the practicality of the skills that they’re doing, the hands-on nature of the program.

Where I personally think that General Assembly sets itself apart is it provides, like I said, this three-month program to take people from approximately square one to be well-qualified for entry-level data science roles, and people are often able to get more senior roles, depending on their backgrounds coming in.

I think that a great deal of that has to do with the applied nature of our program. I think when you ask where GA fits into everything, I think General Assembly fits into it just because of it’s applied nature. Like I shared earlier, I did a master’s degree in statistics, and I learned a ton of really good and important things in my master’s program. At the same time, there were a lot of things that were not as applied as I would have liked them to be. As an example, I can’t recall working with any data that was missing in my grad program. I’m sure that you both recognize how much missing data we deal with on literally a daily basis in the practice of data science, and how we choose to deal with that is a problem and is a challenge for us.

I think that with minimal exception, the largest dataset that we worked with in grad school was probably about 200 rows, whereas that’s not the case in most data science roles. Everything was already sanitized, and we just focused on building a model instead of thinking about everything else that goes into it.

So all of that is to say I think where GA fits into that is for people who say “Look, I want to commit to learning skills”, perhaps they need more of that personal – “oversight” isn’t the right word, but maybe somebody who says “Look, I don’t have the responsibility to do that myself. I don’t know if I could sit down and learn data science on my own, start to finish, so I need some support in getting there. And for people who are looking for specifically practically-applied skills, I think that’s where General Assembly fits in the broader landscape of things.

As a follow-up to that, I’m kind of curious, and I’m thinking a little bit selfishly myself now… You have people obviously who are working full-time and they’re just getting into data science, or they’re fairly early on, and you have people like Daniel and me, who do this for a living, but we’re in a fast-moving field that’s constantly evolving, and we’re constantly leveling ourselves up as this continues on… What are some of the options for people who are working full-time, have families to support, that they can do? What do you recommend and what are those options for us to do?

I think that if for example leaving a job is a non-starter, which I think it is for many people, I think that the best options that are out there might be either looking at a graduate degree part-time in the evening or on weekends or something like that, or self-study.

The biggest challenges that I’ve identified with those - for example if somebody goes to grad school at night, it just takes way longer. Again, it’s a trade-off - that time and that money. We see at GA people will be willing to say “Look, I’ll step back from work for three months and get a job toward the goal of being able to shorten that amount of time between now and when I would be able to get those new skills.”

[27:51] And I think for people on the other end, who say “Look, I’m just gonna study on my own, I’m not gonna commit to a graduate program that might cost however much money. I’m going to study on my own” - it can just be difficult. There’s so many data science resources out there to understand what the right thing to try and learn, and when, and how it fits into the broader picture - it can be quite challenging to do that. Certainly not impossible, but I think that those are probably the two easiest options, or the best options for people who are currently working full-time.

One caveat or one other thing that I will share - and I promise I’m not trying to turn this into a General Assembly ad, or anything like that… But there are also part-time classes available in the evenings. For example, there’s a part-time Python class, there’s a part-time data science class… Those certainly don’t go in as much depth as you would see in the full-time class, but if you say “I’m not able to leave a job, but I want to get that baseline set of skills that will give me a good enough starting point where I can jump off and then start learning more on my own” - that’s something that I think is available to you as well.

I’m curious, after doing some teaching myself in industry, and in university a bit as well, I know one of the challenges that I’ve faced in the past is standardizing a data science or AI-related curriculum for people with a varied number of backgrounds… So I was curious, how do you approach that in the work that you’ve done over the years, and how have you seen – some people, like you said, might come into the beginning of a program and they’re already Python experts, right? So learning about a for loop or a while loop - they’re not gonna struggle with that; they may struggle with other bits, that is on the other hand easy for people that already know maybe a bunch of math, or something like that… So how do you go about standardizing that sort of curriculum when there’s so many people coming from so many different backgrounds into data science?

4: Yeah, so one of the things that we do is there’s a certain level of pre-work required in order to get everyone to similar starting point on day one. For those people who are Python experts, I will broadly say that the programming piece for them is going to be quite easy there, and they can kind of blow through it and not need to worry about their ability on day one.

We do things quite quickly, so by the end of, for example, day two, people are – I’m trying to remember the exact flow of things… By the end of day two, people are writing functions, and doing list comprehensions in Python… Which are still going to be very basic for people who know that coming in, but we condense it down… Because it is an immersive program. We’ve got 12 weeks, and we say “Look, let’s make the absolute best of those 12 weeks.”

So our pre-work is an attempt to get folks who may not be at that level yet to prepare before the program, so that they’re at that level. Then, on day one, we say “Look, we’re gonna talk about data types. It’s gonna be quick. Make sure to follow along. We’re gonna work with you.” We give people support along the way, of course. However, we move at a pace such that if you do come in with that advanced Python background, we’re getting into statistics and distributions and all of that by day three or day four, so it’s not a very long time that people tend to - if you come in with that great Python background - that people say “Oh, okay, I’m bored. This is not as challenging as I thought it would be.” We get through that pretty quickly.

I’m wondering – one of the questions that I get asked a good deal from companies that I’m advising with is “What sort of person should I hire for a data science position that maybe isn’t a data scientist now, but they could grow into one?” This is similar to what you’re talking about; there’s people that come from a lot of different backgrounds, and maybe they can grow into a data science position, or maybe they can – there’s a lot of different backgrounds that can go into your General Assembly program, and that sort of thing…

[31:51] I was wondering, from your perspective, to kind of help my understanding of that, are there certain backgrounds that you feel like lend themselves very well to quickly adapting to the data science world? I guess, from my perspective, what I’ve told people in the past is “If you’re hiring someone in your company, maybe it would be best to train up the engineers in your company to grow into data scientists”, because they’re used to building things and thinking about product, thinking about testing, thinking about robustness… And a lot of times I see people struggling with that bit in the data science world, even after becoming a data scientist. So I see those people as having an advantage from that perspective. But I know that I’ve also known people with a philosophy background that do amazing things and still have a level of practicality… So do you have any insight on that?

Yeah, I think you’re absolutely right. Certainly, if somebody has a computer science background, I think all else held equal, they will tend to be better at adapting if they need to learn a new language, or change the way that they’re using Python to focus more on a data science thing. I think that if somebody comes from a math background, they will have an easier time with understanding some of those statistical and probabilistic concepts that laid a foundation for data science and are very important.

At the same time, when you talk about the person who has a philosophy background who does well, we see that. Some of the stronger students we’ve had have come from very different backgrounds - chemistry, journalism, English, law… It’s certainly not limited to folks who just have a math background, or just have a CS background, or something like that. When it comes to thinking about within your company, I would agree with you that it can be better - and obviously, your circumstances are particular to you, but in many cases it can be better to train someone up internally and upskill them, or reskill them, depending on what it is you want to do.

For example, you might take a programmer and you may say “Look, you know Python. Well, we want to extend your knowledge of Python”, so you might upskill them and give them that additional skill. Or you might take someone who knows the business really well, maybe somebody who’s a financial analyst within your company, works in Excel most days, doesn’t have a programming background, doesn’t really know the math or the stats, but you say “You know what - you know the business or the organization well enough that you are well-suited to shift into this. So let’s reskill you and give you a whole new set of skills that are fairly foreign to you, but given your knowledge of the business and given your knowledge for those problems that we’re solving.” Again, it comes back to the fact that data science is really just us using data to solve problems. It can be easier to, perhaps, upskill or reskill those individuals.

From a strictly financial point of view, it tends to be more economical to do that. I don’t know the statistics, and I’ve seen numbers on them and I’m sure that we could find them and share them out afterward, but it is generally much more expensive to hire someone new, as well as riskier to hire someone new from the outside for a role, than to train someone up internally. Your mileage is going to vary there, but a a lot of it is just the – you take a risk when you hire someone from the outside, where they don’t know the business, they don’t know the problems you’re trying to solve.

So bringing this back around to your original question - when it comes to what are the backgrounds that tend to suit people the best for this, I think that the biggest traits that lend themselves to people learning data science if they do need to upskill or reskill or change direction tends to be grit, and tends to be logic.

When you start out programming - I know this was the case for me, and I imagine this may be the case for you as well… Programming for me was really hard. It was something where it took like three times of me learning to program for that to really click. And I’m not sure what it was, I’m not sure what that mental block was for me, but it was something that was really challenging.

[35:57] Whenever you start programming, especially in a new language, you’re gonna get a bunch of unfamiliar errors and warnings and exceptions that tell you “Hey, you’re doing these things wrong.” And it can be very easy to just throw your hands up and say “Screw it, I’m done. This is not for me.” But the more you realize that that kind of is a common experience for everybody, that even experienced programmers are looking things up on Google, checking out Stack Overflow to see other people who have had these problems before, you realize that you are meant to be there. So just that willingness and that grit to try and try and try it again, even when it’s initially tough, that logic - I mean, logic is an integral part of the stats and the programming needed in order to be a data scientist.

Matt, we’ve focused very heavily on the experience of practitioners as they’re learning and going through education and seeking out what fits them… I’ve got another question for you that is kind of adjacent to that, and that is - we have lots of managers and executives out there in organizations that are not practitioners themselves, they’re not gonna be doing the data science themselves… But they’re in a position where they have to make lots of decisions, and they have to decide things for budgeting, and for a strategic path forward for the organization… For those managers and executives that are in that position of having to make those choices without the fundamentals that the practitioner has, what are the tools and the skills that those individuals need to be able to their job effectively in this day and age?

I think that it’s a really good point, and something that, again, GA has been working on… Because if you think about the skills gap, the gap between what skills people have and the skills that their organization requires of them - that skills gap is quite large in some cases, and it’s gonna differ from person to person, what are the skills that somebody needs in this role versus where they are now. And that’s true of executives as well.

So when it comes to minimizing or erasing that skills gap and figuring out what are the skills that executives need, in my opinion so much of it is rooted in understanding the – you certainly don’t need to be an expert programmer, you certainly don’t need to be a master statistician or something like that. Instead, what I think is important is people need to understand the provenance or the source of the data, they need to understand the biases that may go into that data, and then they need to understand how that data is being used to solve that problem.

A perfect example - let’s say that we wanted to understand what our customers thought of our product; there are a number of different sources that we could check out to gather data, but let’s say a data scientist under you does an analysis. And they do this analysis, they share their results with you, and it looks like people are really dissatisfied with you and with your company, and the executive knows to ask “Okay, where did we get this data from?” and the person responds “We used the data from Yelp that’s available about our business.” Well, as a person, we recognize that people who go and they post on Yelp, that is not gonna be a random set of people. People generally post on Yelp in one of two situations - one, they’re just kind of a constant Yelper, so to speak; they’re just gonna put ratings up everywhere they go. Or the other time, people are generally only gonna be driven to Yelp when they have a really bad experience and they kind of need the world to know about it.

I’m painting with a very, very broad brush here, so all sorts of approximations are in there… But recognizing that potential for bias is huge in understanding how you can make decisions. So when it comes to the skills that an executive can have, I think that understanding what are the right questions to ask, try and poke holes in the analysis… I think that, certainly, if an executive knows about something called overfitting versus underfitting, and knows how to assess for that, that is great. However, I think that in many cases being able to understand the decision that’s being made, and just ask some questions about “Okay, what’s the source of the data? What are the steps you took here? When you dropped data/observations, why did you choose this? What if you had done it differently? How does that change your results?”

[40:19] I think just asking good questions and being generally data-literate and being aware of the biases that are around us - I think that probably puts them in a position to be successful, as opposed to saying “As an executive you should know Python”, because in reality the chief data officer or the chief financial officer making decisions based on data probably doesn’t have to spend a ton of their day in Python, depending on the size of your organization.

I’m curious - this is related to management in the sense of hiring, but we’ve talked about how there’s all of these different backgrounds that people can come from, and all of these specializations in data science, and there certainly are many and varied specializations… But I was wondering, based on your recent experience with General Assembly, and maybe things that you’re just monitoring in the industry, what are the top specializations or skills that people are hiring for immediately and very rapidly, for example? Is that NLP, or is that quantitative finance? Maybe it’s an industry, or maybe it’s a skill. Are there any standouts that you’ve seen?

Honestly, the most common thing that I have observed in terms of data scientists roles is SQL.

Interesting.

Knowing how to get data out of databases I think is something that is highly sought after in data science roles… And it’s interesting, because we were talking earlier about the fractured nature of the data science industry, and all of the different things people are looking for in data science. You’re looking for a unicorn. You want someone with a bunch of interpersonal skills, and programming skills, and stats, and subject matter expertise, and all of this stuff.

The closely-related title “data engineer”, dealing with more the back-end and figuring out “Okay, how do we take information, how do we structure databases” and all of that. There’s a bit more of a melding between that, because now people want data scientists to also be at least moderately well-versed in SQL. Different companies and different organizations will have different perspectives, but I think that SQL is an important thing for everybody to learn.

I also think that in addition to that, thinking about SQL, for my money it’s probably an easier skill to learn than Python and some of the other stuff that’s out there. But that almost feels more like table stakes to NoSQL than a differentiator, where - to put a bow on this - people knowing SQL is kind of a… People need that in order to be considered for the role; it’s not like knowing SQL would immediately qualify someone for that role or set them apart, if that makes sense.

Table stakes is a good – when you said that, I would agree. It’s kind of a universal thing that almost everyone needs to know.

I don’t know, I would see it as a differentiator if I’m thinking about people, too. There’s certainly a lot of people that can know SQL, and there’s a lot of people that can learn and know the data science tooling in Python, whether it be TensorFlow, or just Pandas or scikit-learn, or whatever… I think it’s a very unique differentiator if you’re able to connect the SQL world and the database world with the other world. If you’re able to drive models out of SQL queries in a reasonable, efficient way, and you’re able to connect Python services that are running and doing something fancy with the database… I think that sort of thing is a real standout, from my perspective.

[43:59] As we’re getting towards the end, I’m curious - what are the things that you’re excited about, Matt, in terms of the topics and problems that we’re finding in data science now? What are the things that are really making you look at it on a day-to-day basis and go “Hm, this is really exciting me”? Where do you think this is gonna go over the next few years; what kinds of interesting topics and exciting problems do you think we’ll be contending with over the next few years?

I think that the biggest problem that we’ll be grappling with as a society (for a lack of a better word) is the idea of deepfakes. I think that the methods that are out there have become very powerful, and I think that what we will see probably in the next five years or so - maybe sooner, just given the political landscape - is understanding how we are able to create images that obviously are fake, they’re not real, but they look really, really real.

We can do this even with video, we can do similar things with audio; it’s not just still images. We’re able to create audio and video and all of that that literally doesn’t exist, but could be used to make a point. I don’t mean to sound the alarm, but I do think that that’s gonna be a really important thing that we need to reckon with as we see how the tools of our field can be used in nefarious ways. And there are a lot of ways that we’re more aware of that, but I think that’s only going to become more and more of our conversation.

It is not enough to do a drive-by nod to ethical behavior, and say “Hey, in order to be an ethical data scientist or an ethical consumer of data, this is something that you should be aware of” and then move on and forget it. Instead, when we’re teaching these methods and when we’re learning about these things and when we’re putting them into practice, I think we’re gonna have to apply an ethical lens to everything… And I think the best data scientists already are. But when it comes to the developments in the field, what I’m most (I think) excited about, but also most – I think more important is a better understanding of what are our ethical obligations as data scientists. An organizational role, if I’m a data scientist professional, what are those obligations that I have?

As somebody who knows data scientists, even if I’m not working at the moment, but I’m at home with family and stuff comes up, or you see something on TV, what are those ethical obligations to maybe call out those biases that we’re talking about, that executives should be aware of, or to recognize things that aren’t exactly right. So that’s probably, from my money the most important thing moving forward, and understanding how we can be good stewards of the data that we collect, how we can be good and ethical practitioners of data science, and making sure that we are developing things - I will broadly wave my hands and say - for the good of all of us, as opposed to the bad.

I think that’s a really good place to end the conversation. We have another episode that covered deepfakes in some detail, and we’ll for sure link that in our show notes, so check that out… But definitely check out what General Assembly is doing in terms of their data science education, and we really appreciate you joining us on the podcast, Matt, and sharing your perspective after so much experience in data science and in data science education. I really appreciate it.

Yeah, absolutely. Thank you so much for having me.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00