We went back into the archives to conversations we had around data science at OSCON 2017. We talked with Vida Williams (Data Scientist) and Michelle Casbon (Director of Data Science at Qordoba) about the social impact of open data, personal data and transparency, privacy, the big data problem of public surveillance, electronic fingerprinting, the rift between data scientists and computer scientists, natural language processing, machine learning, and more.
Featuring
Sponsors
Bugsnag – Mission control for software quality! Monitor website or mobile app errors that impact your customers. Our listeners can try all the features free for 60 days ($118 value).
Linode – Our cloud server of choice. Get one of the fastest, most efficient SSD cloud servers for only $5/mo. Use the code changelog2017
to get 4 months free!
GoCD – GoCD is an on-premise open source continuous delivery server created by ThoughtWorks that lets you automate and streamline your build-test-release cycle for reliable, continuous delivery of your product.
Toptal – Hire the top 3% of freelance software developers, designers, and finance experts. Email adam@changelog.com
for a personal introduction.
Notes & Links
- Citizen Data by Vida Williams at TEDxRVA
- Qordoba - Localization Software for Global Brands
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Unless you’re a data practitioner in the world of open source developers, it’s not really on the core of everything.
True.
I have to make a compelling case to be interesting.
I see data science and I get excited. And I’m an open source developer. So maybe I’m the outlier.
Well, it was interesting, because one of the things I talk about is open data; that’s specifically what I’m interested in, the social impact of open data, like how do we come together –
That’s what we wanna talk to you about.
[04:01] So that’s my thing, and there’s just now burgeoning conversation around it. I think we tried to have it (interestingly enough) twenty years ago, but there’s wasn’t an infrastructure for open data at the time.
Who’s “we”?
Data practitioners. I mean, my first big project was a DPA data project, so that was big data before big data was big. We were doing something stupid that 15 years later we knew not to do, and that’s move from mainframe into relational. You probably don’t wanna do that to that volume of that.
That being said, at the time there were discussions around transparency and open data and who should have access to it, but there were no standardizations, there were no protocols, there were no accesses, there were no platforms. Now we’re finally in a place where we can have this discussion, because especially in the open source sphere, all that stuff exists. So now it’s regathering the vendors, if you will, all the data superheroes and going “Hey, we can now hold everybody accountable for privacy, for standardization, for protocols on access, in order to actually make a difference, so why don’t we do that?” So anyway, that’s what the talk was about.
Cool.
Interesting. We’ve actually had some shows – we’ve been around for a while; in 2009 we started this show, and we’ve talked about open data, mostly in the government space a couple times… I’m looking for some older shows… It’s been a while. This is like the first one - Civic Hacking, with Luigi Montanez and Jeremy Carbaugh; that was when they were both working with…
Sunlight Labs?
Yeah, Sunlight Labs…
Sunlight Foundation?
Yeah.
Well, now you have the President’s Information Fellows (the PIFs) who are in that whole White House-sponsored open data platform… But an interesting question came up in my session about if this conversation was before, and what do we do about the question of privacy? It was really like, “Okay, so if everybody is supposed to have this personal data, then how do we accomplish this around privacy?”
My response was we as data practitioners need to challenge the hypocrisy of privacy. We want to put a camera everywhere and be able to develop in reality TV, and there’s no privacy communication there, but all of a sudden you’re a data point, and there’s all of a sudden a need for privacy. So we as practitioners need to actually challenge the definition of data as though image is somehow not data, and thus exempted from privacy, but if you’re a number or some type of codified information, then all of a sudden there’s privacy rules.
That’s interesting, I’ve never really considered the idea of cameras being somewhere, and considering that, I hate that, too. I may be somewhat of a devil’s advocate, but I’m not sure of your perspective… It kind of bugs me that you can take six data points and figure out exactly who I am - male, color, where I originated from, how much money I probably make, if I had kids… You could take six data points and pretty much figure out roughly everything about me besides my name. That’s the world we live in, but should we accept that? Is it okay to have all that – and I’m born in ‘79, so I’m 38 years old. People born in today’s age, they’re like – it’s second nature.
They have no expectation of privacy.
Right.
Okay, so where I sit on it - I’m an introvert data gig, so I don’t want anybody to know anything. [laughter]
Okay, so maybe I’m not devil’s advocate.
No, no, no. I don’t want anybody – I’m one of the first ones to say “I’m falling off the grid for a set period of time and you can’t get me.” But I also, having been in technology for so long, strike a cool balance between the fact that in order for us to have this technological infrastructure and the innovation revolution that we’re currently in, we have already as a country, at minimum - world, a little bit less, but equally made a decision to forego privacy.
So now when we discuss privacy, we’re only talking about it really in the realm of making you feel comfortable at having you as a citizen, for having given it up.
[08:11] So it’s already out there. It’s reversing it.
Right, it’s already gone. Now, the problem that I have from a data sciences perspective is the definition of data. We will refuse to call image information data, and it is equally data.
Who’s “we”?
We is when we start talking about privacy laws, we do not consider image, video etc. with the same standard as we do your credit card number, your social security number… Except for now we have technology where if I put your picture up, I can equally find everything about you on the internet that’s associated with that image, right?
You’re scaring me, Vida… Come on now.
I’m just saying…
It’s true.
It’s like catfish - you just throw that image in Google or whatever, this magic machine, and…
Look, if you’re trying to prevent catfish from happening, you might wanna put the image out. I’m just saying.
Okay. Yeah, that’s true. [laughter]
But we don’t have the same protocols and expectation around privacy, and I’m saying there’s a bit of hypocrisy there. In my space, when we’re talking about making an actual difference in the world, we will not at all disclose the information of a youth who is in trouble at all. But as soon as he’s in a fight or as soon as he’s in some police exchange or as soon as he’s in whatever, all privacy goes out the window, because there’s an image, there’s a video, and now we know everything, right?
Yeah.
But if we could have just – and this is my… So one of my course bases is child welfare; I work a lot in education, I’m planning a lot of impact investing and a lot of those things where I feel like we make community safer. How about if we just identified at the point in time that he became a foster youth, and all of a sudden his environment is instable? Why couldn’t we de-privacy, de-new some of that data events, so that we could provide services that could have helped him? But now that is a privacy issue.
I don’t know where the lines are, I just know that we don’t – I don’t know where the lines are, but I know that we do not have a rational way of discussing privacy via data in a way that is actually gonna be beneficial for a community. That’s what I know. So my thing is issuing a call to action to those who deal with data to begin the process of discussing “How do we templatize it? How do we standardize it? What protocols do we put into place in order to make data more available and more consumable for impact?” That’s my goal, and I don’t know if you’re recording any of this…
We’ve recorded all of it.
Did you really?
Yeah, we’ve already started, basically… It was like a soft opening here… [laughter] Unless you wanna resume differently. I was about to say that “By the way, we’ve been recording this whole thing and this is a good riff, so let’s keep going…” [laughter]
Well, speaking to your privacy here… You know, we’ve been recording everything you just said… [laughter] It’s funny, because we normally will do like an intro thing and then we’ll start, but like –
She was glad we already had it going, and I was like, “We’ll just keep talking.”
I was over here thinking “This is better than the show is gonna be…”
This is the show, y’all…
This is the show, yeah. So, Vida Williams…
Vida Williams…
Lots to say… From my perspective, I didn’t realize this. I’ve always considered it – but because I’m just like a nerdy developer person, images are data, the video is data, my phone number is data… I always saw it the same; I didn’t realize that the classification from the data practitioners or from the governmental bodies or people making the decisions - they see imagery and video as completely distinct things.
Well, think about it this way - when you had the huge push for police to wear cams, right? That was the answer to the interactions between police and youth, right? The answer was “Everybody wear a cam.”
Body cam, yeah.
[11:56] So my response was “Who is managing all that data? How are you exactly organizing the fact that, well, we need to pick up this cam, from this person, at this time…? And who has the space? Who’s managing the space constraints for calling all of that data at once?”
Is it archived? Is it archived well? Could it be used in the court?
Absolutely.
All these things, I’ve never even thought about that. Nobody does.
Nobody did.
We do! We should!
Right, and that is where the data people come in, and we were nowhere in that conversation, so yes, it’s a social justice question, because the legislators wanna say “Yes, wear a body cam”, and the data people are like “Wait a minute, that’s like a yes-no”, because that’s a “Yes, we should do it”, but a “No, we can’t.”
Right.
And then how do you play that out later in the courts? And then where is the question of privacy then? The people in the video are under 18; how much can you show? You can’t even tell a child’s name if there’s been any type of sexual violence in the newspaper, and yet you can show an entire video of a young person in some type of exchange with the police? Talk to me about privacy again. But because the data people are missing from those types of conversations, those points are only discussions in our rooms, behind our little screens, because we don’t really like talking to people.
So what are they doing then with these cameras? How are they dealing with the data, do you know?
I have no idea. I honestly have no idea. I have talked to a couple…
What’s your best guess?
My best guess is they’re not.
They just lose it.
So maybe it’s around for a week, until the SD card is formatted?
What will happen is we’ll have some case that will challenge it, where the data will need to be there - the data they filmed, the metadata and the images will all need to be there, and the (we’ll just call them the) legislators of the day will come up and say “You know what, our policy at that point in time was to archive it seven days because of the volume of the data, and unfortunately that was cut before we could get there.”
It will be some answer like that, because then that enables the legislators to vote yes, and then the execution of it to fall defunct, and it’ll be nobody’s fault.
Yeah… I’m starting to think of chain of custody and issues like that as well…
Exactly.
Because who’s the one who’s maintaining the data? Is it the same people who are called in or questioned by [unintelligible 00:14:06.22]
That’s why I said the metadata becomes very important, like “Who picked it up? Who cataloged it? Where did they move it? When did they move it?”
We have electronic fingerprints - that’s all a data issue, that’s a development issue, that’s an infrastructure issue, but we don’t have the practices in place and nor do we have the protocols in place to deal with issues such as privacy. So now, if you had a routine traffic stop, I was stopped, he’s got a camera on, he’s taking a picture of me. But later I go running for office, what if I cursed him out during that traffic stop? Well, that video can resurface; where’s the privacy of that? That was a state-sanctioned video.
So there’s all kinds of questions of privacy that never come up when you’re dealing with data from an image perspective.
They always say you never have something to hide, until you have something to hide, right? [laughter] That’s the truth, though.
It is! But in the era of data, you have everything to hide, or nothing to hide. That’s where we are now. You don’t even know what’s out there to hide.
I’m going off grid, I’m out.
I’m out. We’re done here. [laughter]
We want privacy back.
Oh, boy…
Do you kind of feel like that? Do you throw your hands up and you’re like “What are we gonna do?”
I did that years ago when I knew that we gave up privacy. It was just one of those things where I literally would fall off the grid… For a moment, because I know I’m never really off the grid, right? I just don’t wanna talk to anybody.
Right.
I think we’re in the era of transparency. I think the best opportunity we have is citizenry, and on our side of the house as developers, as infrastructure planners, as data, is to begin to influence the legislation around it, is to begin to have some expectation that we be at the table as they’re defining what are the rights and the wrongs of people, as it has to do with information that we’re calling.
I think that’s where we need to be, and I don’t think that we’re in the conversation at all. I don’t think that people are thinking about “Let’s bring the geeks to the table to discuss how this can happen.”
Right, I agree with that.
They want us there last.
When it’s too late…
[16:00] “We’ve made the solution, go make it. We’ve designed how it should be…” Yeah, exactly. “All the decisions are made, here’s the spec. Can you do this in two weeks?” or “We’re gonna need this tomorrow.”
Exactly. “Really, we needed this last week, so we’re gonna pay you a hell of a lot of money to maybe get it wrong, but we’re gonna roll it out anyway, and then [unintelligible 00:16:19.07]”
Oh, man… That’s how it’s gonna go down. That’s how it goes down.
That’s how it goes down… But we can change that. That’s why you’re doing this podcast; we’re calling awareness to it, a call-to-action… Bring the geek avengers out, we can change this.
What’s your biggest call-to-action for developers, data scientists, geeks out there? What’s your biggest call to action?
Yeah, actionable steps, what can we do?
My biggest call-to-action is really get engaged with social justice issues. There are not enough of us that apply our talents into spaces where our impact can be readily felt. Three years ago I went from working high corp, enterprise architecture and data, to deciding that if I was so good at what I do, that I can drive corporate missions forward, Department of Defense missions forward, that if I use that same talent and applied it to child welfare and applied it into these other places, that I can drive those missions forward just as fast. And I would think that that would be true for all of us, that if we reapply all of our skillsets in these areas and look at that as a donation as much as we look at dollar donations, then maybe we can start affecting change in our communities.
Any low-hanging fruit in particular that you could mention?
Absolutely. Probably education is the biggest one right now - like, how do we standardize education data so that we can actually show where our students are successful, where they’re struggling, which communities can benefit from what types of actions.
We just need data, we need platforms to be able to nationalize some of the results that we’re getting from the education systems. If there’s already a mandate to produce education data, why isn’t it standardized across the nation, and who’s holding them accountable for doing that, and then who’s doing that type of reporting that is accessible to educational practitioners, whether that’s pre-school programs or extra-curricular education programs, or social workers or counselors?
That’s low-hanging fruit that’s really easy, but has the biggest impact for our next decade.
We always have to take care of our future generation, right?
It would seem to be.
It’s the best place to invest.
They don’t even know that they’re not supposed to tell you this information, so… [laughter]
Yeah, really…
So that’s probably my biggest call-to-action in the first industry that I would say we could be the most impactful.
So if people are listening to this and they’re like “I love Vida, she’s awesome” and they wanna learn more about you - where do they go to find out more about you and what you’re doing?
Well, the first thing I would have to do is tell you my name is not Vida [veeda], but Vida [vyda]…
Oh, my goodness…
…which is fine!
Come on now, you let me say it 15 times and I messed it up?
You waited this long… [laughter]
I even said, “Are you Vida Williams?” and she said “Yes, I am!”
I’m not even embarrassed, I’m just mad now…
Well, she CAN be… [laughter]
Oh, man… The audience knows that I mess a lot of names up.
And I’ll just say it’s not a big deal, because in Europe they told me I say my name wrong anyway.
Okay, what is it then?
It is Vida Williams…
Okay. I was thinking Vida like life in Spanish…
Yeah, me too…
Livin’ la vida loca was what I said to Adam, and he rolled his eyes at me…
That’s it, that’s the thing… [laughter]
Livin’ la Vyda loca…
Yes, that’s it, and I am @vidachristy everywhere - on Twitter, on Google, via email on Gmail… You can always get me at @vidachristy.
We’ll put the links in the show notes to you, and make sure everybody knows about you.
Awesome.
Any closing thoughts?
I’d just thank you for the opportunity to ramble for about 15 minutes… I mean, I don’t get that too often, so it’s pretty awesome.
Awesome. We’re happy to…
Happy to talk to you, very much.
Thank you!
We’re here with Michelle Casbon, Director of Data Science at Qordoba. Michelle, you as well as Vida Williams and other data scientists that we spoke to at this show, and I guess maybe other – we’re sensing a thing which I didn’t know existed… We were talking about it before we started recording, but I wanted to get your explanation, because this is a social construct that I’ve never experienced, which is there seems to be a bit of a divide between data scientists, maybe with quotes around that, and computer scientists with quotes around that (or programmer)… What’s up with that?
Yeah, that’s a great question. I think it stems from a lot of – so data science didn’t really exist until 5-10 years ago; it’s a new thing, and I think when companies started to bring data scientists on, they sort of created these organizational structures that put a wall in between them, and they have different skill sets for the most part. So there’s definitely some overlap. Engineering - you need a really strong programming background… But data science - you need strong engineering, and strong math… All of these other things in addition. So I feel like engineering kind of thought “Well, their programming skills aren’t as strong, because they’re really good at math”, and then the data scientists are like “Well, they don’t know anything about modeling, so they’re no good.”
I think it really boils down to organizational structures and having that wall in between, because a lot of times data science will do some really amazing things with math, and then they’ll sort of like “Hey, go implement that, go put it into production”, and an engineer is like “This library - it doesn’t exist in Java. I don’t know what kind of magic you expect me to do…”, but that sort of throwing things over the fence, that kind of tension I think has caused a lot of problems.
I see. And that seems to have moved beyond the walls of the corporations to even events like this, where I think yourself as well as Vida, both responded to us in different terms… Like, “Are you sure you wanna talk to me? I’m a data scientist” or “I’m not a developer.” [laughter] And our response to that was like “Yes, we do wanna talk to you!”
Yes, of course!
I have never been aware…
What was my response to that question…? “That’s okay.”
That’s okay. [laughter] A little pat on the head, “That’s okay…”
It wasn’t that kind of “That’s okay.”
I didn’t say I’m not a developer, because data scientists are definitely developers.
Right, you didn’t say you’re not a – well, Vida said she wasn’t a developer… You just said “What’s your audience? Because I’m a …”
Do you think it’s just like the community hasn’t gotten to know you well enough? Like, maybe not hanging out…? Since it’s newish, so to speak, maybe you all haven’t gotten that time to congeal or hang out in the same rooms and realize that you’re all human beings and you all have smarts and can bring something to a changing landscape of things?
Yeah, I mean logically that makes sense…
It’s making a lot of logical sense… Humans aren’t logical.
Right.
That’s true.
We’re emotional.
Very judgmental, very picky…
I don’t know, I guess it seems like there are these two focuses. One is just on sort of production code, writing things that don’t break, and then there’s the “No, but machine learning…” and “The math is the most important part…” I just think that like with any two organizations, just like between engineering and DevOps, there’s a lot of tension because the goals are a bit different.
Right, and in a certain sense because there’s overlapping skillsets, but not identical skillsets, both sides feel threatened by the other one…
That’s a strong word, but…
Was that too strong?
I mean, “threatened” is like… That’s just a strong word.
Okay. I’m gonna back it off…
I’m not saying it’s wrong…
How do you mean threatened? Just curious…
Well, I said it…
No, but she thinks it’s strong - why is it strong?
Yeah, because I thought it was an apropos…
I feel like it’s right on, too…
Yeah, but different reaction here, so please, tell us.
[27:01] I think because we understand enough of what the other side does… It’s easy to be critical of how other people are doing things. I think the best way to – what I’ve seen to make the problem go away the best is really just to take down those walls… Organizationally, you’re not too different people…
As you’re saying, just sit together, work together… There’s even job descriptions–
Yes, and sharing titles. I consider myself a data science engineer, because I feel like that better describes what I do. Because I do have a background in engineering, and now I do a lot of machine learning, and my official title is Director of Data Science, but I don’t feel like that’s distinct from engineering anymore.
NLP is what I focus on, and in order to do that, I have to be able to understand distributed computing. That didn’t necessarily exist in traditional NLP, and so now to be able to do machine learning, I really have to understand so much of it… And vice-versa, if anyone wants to implement any of these models, any of this NLP stuff, they really kind of have to understand what the libraries are doing…
I guess what I’m saying is just that the more you can merge the roles and the everyday tasks, whether that starts with calling people data science engineers, or merging titles somehow, or giving people the same sort of social status in the engineering hierarchy - either way, I think the more those can merge and the more you can align those goals…
The better off they will all be.
Yeah, then the better will people work together.
It’s a form of segregation, right? Titles… Wouldn’t you say?
Well, you’re literally segregating. You’re actually drawing lines.
It’s not a racial segregation; maybe that term is normally associated with… But it’s a segregation; you’re separating by roles and distinctions, when you should be melding more and considering yourselves more of a cohesive unit. That’s what you learn in the military, that’s what you learn working with teams, and the more you operate as a team, a fluid team, the better you are in the end result.
Well, in the military you have titles; you have the medic, you have the engineer, you have the…
Well, I didn’t say that the authority and structure isn’t required, because you have to respect those above you who’ve had the experience a bit down the road… So that’s still there, I think… I mean, military is maybe a little different to compare perfectly, it’s not a one to one, but you still have structure, you still have hierarchy, but that doesn’t mean that you can’t be on the same team.
I agree. And that also helps with the whole common goal thing. We’re all working towards the same thing.
Right.
You don’t have to be nailed down to a certain thing.
We’ve just gotta quit putting each other in boxes, man…
That’s right, man. No boxes, okay?
Don’t put me in a box, alright?
Box, not boxes.
I’m really encouraged by the fact that you guys didn’t even know that there was this tension… That is definitely a good sign for the future.
I’m starting to get a hint of it, though. I’ve been working with…
Daniel Whitenack?
No, Pete Soderling from DataEngConf.
Oh, he’s great!
Yeah, Pete’s great. So I’ve kind of caught some edge that there’s this divide, because like, okay, why is it DataEngConf and not DataScienceConf…? Why are there those nuances? So I didn’t know the animosity or the divide, but I could sense that something was not perfect, not a cohesive world. There was a distinct between the different roles.
Yeah. And his conference is I think part of the solution, because he addresses it, and it’s all about working together as data science engineers and not as engineering and data science.
Those individuals, yeah. That’s cool.
Let’s talk about your talk, what you’re here to talk about. You said your focus is on natural language processing, speech recognition, stuff like that. Is that what your talk was about?
[30:55] So it was about how we use NLP at Qordoba. We have a platform that helps people localize their products… It doesn’t really matter what the product is, but most everyone has a website or a mobile app, anything like that… We have a platform that helps people release that product in different markets. So not just English-speaking ones, but really across the globe. My role within the engineering team is to work on the machine learning.
My talk really set the stage for “Okay, why is localization important? Why should you even care about it? Because these are the disasters that happen when you don’t care about it.” I went down into a few of the details about which tools we’re using…
We’ve built a lot of this on open source software. I really couldn’t imagine building it on anything else. Open source really did enable us to even create this platform.
Because of the costs, or why?
No, capabilities.
It’s just better software…?
Well, there’s so many different components… I don’t think any one vendor provides that entire stack, and even if I wanted to cobble all that together, it would be extremely difficult. It’s much, much easier using open source tools, and they have gotten better so much faster.
What are some of the tools that you’re using?
The heart of our machine learning - we’re using Spark’s MLlib; we use their LogisticRegression, random forests libraries, stuff like that. And PredictionIO is what does a lot of the NLP stuff.
We’re running that in Docker containers, on Kubernetes… It’s all in Scala. Our storage layer is, we’re using MariaDB and Cassandra.
Lots of things.
There’s a lot of stuff, yeah. So I talked a little bit about that.
That’s interesting.
A laundry list of…
Yeah, and it’s all open source.
It’s basically a dream.
That’s good.
Almost all open source.
It’s basically a dream?
Yeah, like as an engineer, to be able to work with such amazing tools, it’s really, really fun.
That’s cool.
They didn’t have to work too hard to recruit me, because… The mission - changing the world, being able to give people products that feel native to them, even if they don’t speak English, can really do so much good in the world by building that kind of platform. And then using the best tools out there to do it, the tools that engineers really want to use… That’s a big plus.
Yeah. I love the branding.
Yeah?
The branding is phenomenal, Qordoba. Have you seen the site?
No, I haven’t.
It’s beautiful.
We have a great designer.
Yeah. I mean, I love the direction it’s – it looks extremely trustworthy.
That’s actually our brand new, newly unveiled site, because we’ve just announced our funding; we just closed our series A funding round, and part of that was unveiling the new website… So I’m glad you like it.
Congratulations on all that.
Why is it the first time we’re hearing of Qordoba? Why do you think?
[34:00] I’ve asked myself that question a lot. When I first met the co-founders and I first heard about what they were building, it was one of those times where I was just like “Lightbulb! How have I not thought of using machine learning for that purpose?” It’s so well-suited, it just makes sense. But I think a lot of good ideas in the past are like that. They seem obvious once you’ve thought of them.
Right. “A wheel!” [laughter]
Exactly!
“This circle is better than the square I was using…” [laughter]
The thing about the localization field is that it just really hasn’t changed much in 30-35 years, and we’re really here to take a lot of the tools that work so well in other areas and apply it to this older, more traditional one. Why hasn’t anyone done it before? I have no idea, because it makes so much sense, and it’s really exciting to be a part of that so early in the game, at such an early stage of the startup. It’s a fantastic experience.
Cool.
Well, Michelle, thanks so much for sitting down with us!
Of course!
Any closing thoughts to share? Any words of wisdom to part on? For the data scientists out there, the data engineers out there, and the mathematicians not knowing you well enough, what’s going on.
Feel the love…! [laughs] I guess I feel very personally invested in that whole data science versus engineering thing because I have one foot in both sides.
Both worlds, yeah.
You’re the hybrid.
I am definitely a hybrid, and that’s been a fantastic experience. I haven’t encountered any animosity in my personal teams, and so I guess I just wanna see more of that… Just everyone be nice.
Everybody be nice.
Be nice! Please!
Our transcripts are open source on GitHub. Improvements are welcome. 💚