In this enlightening episode, we delve deeper than the usual buzz surrounding AI’s perils, focusing instead on the tangible problems emerging from the use of machine learning algorithms across Europe. We explore “suspicion machines” — systems that assign scores to welfare program participants, estimating their likelihood of committing fraud. Join us as Justin and Gabriel share insights from their thorough investigation, which involved gaining access to one of these models and meticulously analyzing its behavior.
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
|Chapter Start Time
|Welcome to Practical AI
|Signs of bias
|How to start investigating
|Predictive analytic trends in gov.
|Can you justify using these systems?
|Requesting Rotterdam's data
|The model's capabilities
|Understanding the labels
|Sponsor: Changelog News
|How effective is the model?
|Public perception and reactions
|Hope for the future
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am the founder and CEO at Prediction Guard. We have some really exciting stuff to discuss today. So thankful to have my guests with us today, because there’s a lot of talk about the dangers of AI, or potential risks associated with AI, which we’ve talked about on the show… But I think maybe that kind of misses some of the actual real world problems that are happening with deployed machine learning systems, that maybe have been going on for longer than some people might think… And maybe we can learn some things from those deployed machine learning systems that would help us create better and more trustworthy AI systems moving towards the future. So I’m really pleased to have with me today, Justin Braun, who is a data journalist at Lighthouse Reports, and Gabriel Geiger, who is an investigative journalist at Lighthouse. Thank you both for joining me so much.
Thanks for having me.
Thanks so much for having us.
Yeah, yeah. Well, like I mentioned, I kind of teed us up to talk a little bit about maybe risks, or kind of downsides of deployed machine learning systems… And you both have done amazing journalism related to what you’ve kind of titled here “Suspicion Machines.” And I think it would be worth, before we kind of jump into all of the details of that, which is just incredibly fascinating, if you could give us a little bit of context for both what you mean by Suspicion Machines, and how this topic came across your desks and you started getting interested in it.
Sure, I can start with that. I mean, the reason that we chose the Suspicion Machine as a title for our series is it’s kind of a driving metaphor for what these specific machine learning models are doing within the welfare context. So a while ago we wanted to investigate the deployment of machine learning in one specific area, but weren’t sure which one yet. So in the US, there’s been a lot of reporting about the use of machine learning or predictive risk assessments within the criminal justice system, also in facial recognition; and us over in Europe looked at that reporting and noticed that there’s a big lack of it over here in Europe. And so we were exploring different realms and settled on looking at welfare systems, which is a sort of quintessentially European issue, if you want to say, and in the last decade welfare systems have become this sort of polarizing political battleground within Europe - how much welfare should we be giving out, are people defaulting the state, how much money… And so we wanted to focus and hone in on this one area to make this sort of manageable, and we decided to investigate the deployment of predictive risk assessments across European welfare systems. And basically, what these systems do - I mean, they vary in sort of size and color, but the basic sort of mechanics remain the same, is that they assign a risk score between zero and one to individual welfare recipients, and rank them by their alleged risk of committing welfare fraud.
And the people with the highest scores are then flagged for investigations, which can be quite punitive, and where their benefits can be stopped. So we landed on this metaphor of the Suspicion Machine, because we felt that these systems were oftentimes essentially laundering or generating suspicion of different groups who were trying to receive welfare benefits that they needed to pay rent every month.
And when you all started thinking about these Suspicion Machines, these deployed machine learning systems, were there existing examples, concrete examples of how these Suspicion Machines were being punitive, maybe in either biased ways, or just like in the kind of false positive error sort of way that’s creating problems for people that it doesn’t need to create? Was there actual evidence at the time, or was it just a big question because there wasn’t any sort of quantitative measurement?
So there was signs of it. So in the Netherlands specifically there was a case where 30,000 families were wrongly accused of welfare fraud, and it turned into this huge scandal called the Childcare Benefits scandal, and it eventually led to the fall of the government. And it turned out that the way that these parents were wrongly flagged for investigation was because of a machine learning model the agency had deployed. But there was no sort of quantitative measure of what that model was actually doing. Nobody took it apart and actually looked inside and saw “Okay, well, why was it making these decisions?” which is a huge reason why we as Lighthouse decided and were so adamant about the idea of “We’re not just going to investigate these systems, or the classical journalists methods, call people up, sources, getting contracts, but we actually wanted to take one of these systems apart. And that was sort of the big challenge or hurdle in our reporting. And I think Justin can maybe talk to what’s sort of the existing literature on these predictive risk assessments.
[05:51] Yeah, I think my interest in the topic comes kind of from the broader discussions around AI fairness that really started after ProPublica published its machine bias piece six or seven years ago. And in the aftermath of that, there were a bunch of systems that worked in a similar way that were kind of discovered in various contexts. I myself work a little bit on predictive grading systems… So during the COVID pandemic some school systems replaced their previous handwritten exams with an algorithm that tried to predict based on previous exams how well somebody would score in their final exam.
And with each of these systems, the issue that emerges is essentially similar. Once you try to pacify people according to risk, and you have a training set that’s not a perfect representation of the true population, you’ll start running into issues like disparate impact for different groups, which is kind of the most hot button issue. But you’ll also start running into how representative is your fairness data; in general, you’ll start running into issues with where do you set the threshold, and what values are you trading off when you set the threshold higher or lower.
And so I was generally interested in that, and then kind of joined Lighthouse at a point when Gabriel and some others had done a lot of the groundwork already to see whether there was something there in welfare risk assessments, and then kind of took on the technical work on there.
One question I have just even as a data scientist, just thinking “Okay, where do I start with this?” This model is deployed by some entity; in theory, it’s been developed by some group of technical either engineers or data scientists, or whoever it is… Where do you go about actually starting to find out – like, where does this model exist? Who has the serialized version of this model sitting on some disk somewhere, in some cloud, or…? Yeah, where do you even start with something like that?
So now there’s a sort of trend of having algorithm registers where public agencies across Europe publish what different types of algorithms or models they’re using. But that didn’t exist when we started this reporting. So what we did was we made use of freedom of information laws in the US; I think they’re called Sunshine Laws. And we started sending in these requests, trying to figure out at least where are you using predictive modeling within the sort of welfare system. Because you could be using it to look for fraud, but you could be using it for other things as well. And we started sort of slowly building out this picture of which countries were using predictive modeling at different places in their welfare system, and then sort of start slowly building a document base.
So maybe we’d ask them for – we didn’t start by asking a lot of times for like source code, or final model files, or training data; we’d start by asking for “Can you give me the manual for your data scientists, for retraining the model every year?” And that allowed us to ask for more specific documents, and more specific questions, like “Okay, we know that there’s a document called Performance Report 2023 dot HTML, because we see it referenced in your manual for your data scientists, so we can request that.”
And then sort of built up to this place of - okay, now let’s request the final model file, the source code, train it, ask for the training data, which we can get into, because there’s some prickly things there around data protection laws in Europe… So we kind of tried to do this tiered approach to sort of build for that final ask, for asking for the model once we could make sure that our request was specific… Because oftentimes, agencies would try to resist our requests, saying they were too broad, or we were not being too specific enough, or trying to argue that disclosing certain documents could allow potential fraudsters to game the system.
I’ve got to ask, as you kind of did this sweeping look at how predictive analytics was actually deployed across Europe, even before we get into the specific case that you studied, are there any kind of takeaways or trends that you saw in terms of how machine learning is actively being deployed by government entities, or by welfare entities across Europe?
Yeah, so I think it started essentially a bit later than in the United States. You kind of have this trend in policing, in kind of risk analysis… I would say that it begins in the early 2000s, where you kind of have semi-governmental organizations doing credit risk scoring, the first kind of instances of predictive policing, and also more serious thinking around big data mining for some risk analytics in the welfare context.
[10:23] And then I would say there’s a bit of a bifurcation. So you kind of see some instances where big industry players, Accenture, the Palantiers of the world, these big, big companies hype up the case for big data analytics to be deployed across different sectors… And at the same time, you have a lot of failures when those tools are deployed; they often don’t work very well, or people who have to use them in the agencies don’t know how to use them. You see some agencies that drop those systems, and at the same time you see other agencies that kind of build up internal capacity, and build those tools themselves, sometimes in collaboration with universities or smaller startups… But you kind of have these true pathways that continue to coexist at the same time.
I would say in terms of the systems that we looked at, most of them were developed kind of from the early 2010s onwards. It’s definitely gotten a lot more in the last five or six years, and across the eight or maybe nine countries now - I’m not quite sure how many we’ve looked at - but I think we’ve only seen a single country where we did not see evidence of predictive analytics being used to assess risk in welfare.
Interesting. So I guess on the other side – I asked the question about evidence of these systems prior to your reporting evidence, or cases where these systems maybe behaved in ways that caused harm, or issues… On the other side, you mentioned this kind of hyped perception, potentially hyped perception of what these systems could do in a positive way. I mean, the main case for using these systems, as you mentioned, is to kind of catch fraudsters, from my understanding. On that side of things, is there evidence that hey, yes, this type of fraud is a huge problem that we need to invest kind of advanced technology in solving? Or is that also kind of up in the air in terms of the – I guess I’m getting at the justification for using these types of systems on this sort of scale.
This is one of the questions we try to address in our reporting a little bit. First of all, distinguishing between deliberate fraud and unintentional error is really messy and difficult. I mean, how do you prove intent? How do you prove that someone intentionally didn’t report something? I mean, there’s clear-cut cases where it’s like criminal enterprises defrauding the welfare state, using identity fraud; that’s pretty cut and clear. But when it’s individuals or family, and they didn’t report 200 euros - is that intentional, or unintentional? How do you prove it? So that’s already a challenge.
What we did see is evidence of a lot of the larger consultancies tending to overhype the scale of welfare fraud, and these estimations being criticized by let’s say academic studies. And when national auditors like the National Audit Office of France, for example, actually did random surveying to try to estimate the true scale of welfare fraud, they estimated it at about 0.2% of all benefits paid. Whereas consultancies will estimate it at about 5% to 6% of all benefits paid. So there’s a little bit of this situation where they’re hyping up estimates to sort of sell the solution.
At the same time, fraud does happen within the system, and our reporting isn’t meant to try to dispel the notion that fraud doesn’t exist, but I think there’s it’s definitely still unsettled science on what the actual scale of welfare fraud is, and whether these systems that are being deployed in places like the case study we’ve looked at are actually catching fraud, or are just catching people who have made unintentional mistakes, and that these unintentional mistakes are being treated as fraud.
[14:04] To add on to that a little bit, I think the added justification that is often being used is that actually these systems are more fair than analog equivalents; that by using a machine, you get rid of biases, and that they’re better at detecting fraud than people are. And I think, as we’ll probably get into later, there’s good reasons to doubt both of those propositions.
All of that was a really good to setup for this particular case study that I think you’ve highlighted in some of your recent work… I’m wondering if you could kind of set the context for the particular case study that you’ve focused on, the particular model that you’ve focused on, in light of what you were just talking about, about kind of scanning the environment, I guess, through information requests, and freedom of information requests, to understand where things were deployed, all the way down to like getting your hands on a model. So how did that transition happen, and tell us a little bit about the use case that you studied more deeply.
As I mentioned earlier, we started by sending these freedom of information requests across Europe, eight or nine countries, and we started receiving a patchwork of responses back. So some places just said “No, we’re not going to give you anything at all.” Some places would be like “Okay, we’ll give you the manual”, but then when you tried to ask for anything technical, like code, or a list of variables, they shut it down. But there was this one exception in all of this, and that was the Dutch city of Rotterdam. And Rotterdam had deployed one of these predictive models to try to flag people as potential fraudsters and investigate them. And right off the bat, Rotterdam sent us the source code, or the training process for their model.
And we got really excited at first. We were like “Wow, this is great.” And we started looking through the code and we noticed that the scoring function in the code goes to load something called the final model, the RDS file. And we go looking through the directory, and we notice “Huh, wait a second… This final model, that RDS file, the actual model file that can be important to the score, isn’t in the directory.” So we emailed them back, we say “Hey, guys, I think you made a mistake. There’s this final model, the RDS file missing in the code directory, so we can’t actually run anything.” And they go “Oh, well, yeah [unintelligible 00:16:20.06] but you’re not getting that one.”
And their justification for this was that if this was made public, potential fraudsters would be able to game the system. So long story short, we went on this year-long battle with them to attempt to get this model file… And eventually, the city, to their credit, decided to disclose this model file to us, so we could actually run it. And what does this model do? I think Justin can do a good explanation of what this model actually does and how it works.
Yeah, so it’s a gradient boosting machine model. It’s a pretty standard machine learning model. It ingests 314 variables, and it outputs a score. The issue that we ran into very quickly once we had access to this model is “Well, what does this actually tell us?” Okay, we can make up a bunch of people now, and score them… But how do we then know what that means for those people? And so there were kind of two things that became important to figure out at that point. One was what do realistic people look like? And the second was, what is the boundary at which a person is considered high risk?
The second one was relatively easy to figure out; we kind of had some broad estimations of how many people are flagged each year, we could run some simulations and kind of see the distribution of risk scores, and at that point we could take a good guess for what the threshold would be. Getting access to realistic testing data was a lot more challenging. And for a while, we thought we would have to just simulate a bunch of people, take guesses. But actually, Gabriel had requested some basic stats about the training data at an earlier stage; he essentially asked “Can you tell us – give us like a histogram for each of the variables, so we can see what the broad distribution in the training data for age is, for instance, or for gender”, and so on.
[18:19] And our idea was to use those basic distributions to sample new people. But when I was meant to type all of this stuff down into like a file, so we could then run those simulations, I got lazy and I wanted to just scrape the document. It was an HTML file, so I opened it up and inspected it, and it turns out that the entire training data was contained in this file… Which happens when you create plots with Plotly, quite often. So if you want to leak something to journalists, that’s a good way to do it.
There you go. [laughs]
Yeah, so we asked for and got access to the entire training data. At that point the question became “Okay, now we know what realistic people look like. What tests can we actually run in terms of figuring out who does this model flag at higher rates? Does it have justification to do so?” and so on. And the one thing that was missing from the training data was that we didn’t have access to the labels themselves. So we knew your age, your family background, your job history, that kind of stuff, but we did not know if you had actually committed fraud or not. And that meant that – and this is the big limitation of our story. But that meant that we could essentially only understand which characteristics lead to higher or lower scores, but we wouldn’t know if those scores are [unintelligible 00:19:27.25] at higher rates for one group rather than another.
So I just want to be very open about that that is a limitation of the design… But having access to the training data, having access to the source code, being able to see how the training data was constructed, having access to the final model file, all of that allowed us to investigate a bunch of aspects with a system which I think still made for a very valuable story, both in terms of explaining how this stuff works, but then also in terms of showing that there are likely consequences which seemed to be discriminatory against certain groups.
Probably a lot of our listeners will be familiar with what a gradient boosting machine is, and sort of – like, maybe this is one of the tutorials that you ran on a Jupyter Notebook when you were first taking your kind of data science 101… So the model, I think, is very familiar. I think a lot of the interesting things here are related properly to the model features, and that sort of thing. Did anything jump out to you, maybe even before you kind of ran a larger-scale analysis, in terms of like the features that were included in the dataset, and how those may or may not intuitively be connected to this sort of welfare fraud situation? Did anything jump out when you were kind of doing your initial discovery, and kind of exploratory data analysis on this data?
Yeah, for sure… Though I think it’s maybe important to preface this with saying that including features that seem discriminatory does not automatically lead to discriminatory outcomes. And I think that is sometimes being confused. You can get discriminatory outcomes without features that look bad, like, I don’t know, racial background, or gender, or something like that. But it also works the other way - you can include a bunch of these features and not get any discriminatory outcomes. Both of these things are possible.
That being said, there were a bunch of features that seemed perfectly reasonable; you know, contact with the welfare agency, how often have you been there, have you missed any of your appointments, that kind of stuff. There were a lot of demographic features, and I think those get into trickier territory. Some like age are maybe justifiable on some level; gender gets a bit harder… And then a lot of features measuring, through proxy, but measuring ethnic background through language skills. I think there was 10 or 12… Gabriel, correct me if I’m wrong, but definitely a lot of variables on language skills.
No, I think 30, or something.
Oh, 30. Yeah.
[21:55] Yeah. Because it measured everything from like your spoken Dutch fluency, to your writing Dutch fluency, the actual language you spoke… So there was like a categorical variable with 200 values, or something. So it got as granular as the specific language you spoke, whether you speak more than one language… But anyways, continue, Justin.
Yeah. And then I think in some way the weirdest set of variables were essentially behavioral assessments by the caseworkers. So we actually got access to some of the variable code books, and in there it said that – there was a variable essentially where people were meant to judge how somebody was wearing makeup, especially for women… So stuff that just seems really sexist. So those variables were included, which is problematic in and of itself. But then the way they were transformed in the pre-processing steps was that essentially this textual data was just transformed into a 0 or 1 variable, depending on whether there was anything in this field or not, which is also – I mean, you just lose a bunch of maybe the more interesting information if you do that… But I think that set of variables, because it’s just based on individual case worker assessments, if your claim is that a system should lead to a reduction in bias, and then you include these variables that are so obviously subjective, I think that kind of undermines your claim right away.
And in the dataset, in terms of the label and the output, were you able to understand at all “Oh, these are investigations that happened, that actually were verified to be fraud or not”? Essentially like a one or zero type of label. Or how was that setup?
Yeah, so we did not have access to the label, which is, again, the big drawback. So we could only score people who we know they had labels, but we didn’t have that label ourselves.
But we did a bunch of ground reporting to essentially work around that, and maybe Gabriel can speak a bit that to that.
Yeah. I mean, two things first. Just to talk about how the training data gets constructed first… It’s over 12,000 past investigations that the city has carried out. And these past investigations are not a random sample. So there’s some subset within there that’s random; I think about 1,000. But all the rest of the cases are just where investigators have looked at in the past, either through anonymous tips, or through these kinds of theme studies that they do, where they say “This year we’re gonna check every man living in this neighborhood.” So it’s not a random subset of people that they’re trying this model on, which is problematic in first place.
The second thing is that this label, yes fraud/no fraud, doesn’t distinguish between intentional fraud and unintentional mistakes. So these are flattened into the same thing when labeling the training dataset. So those are, I think, two problematic things right off the bat. I think an even third more complicated thing is that the law for what is considered fraud has actually changed over time, and this training data spans back 10 years.
But all that aside, one of the things that we wanted to do with this reporting was to look at the impacts of being flagged for investigation… You know, what does that mean for a person, and how are they treated by the system? So we did a bunch of ground reporting in Rotterdam and we sort of used the results from our experiment to build profiles; who would be considered some of the most high-risk people. And we saw that it was – one of them at least would be like single mothers of a migration background, who don’t have a lot of money, financially struggling, living in certain majority ethnic neighborhoods.
So we did a bunch of ground reporting in those places and found people and it was quite challenging. People were afraid to talk, people who had been investigated in the timespan the model was active. And what we’ve found was that they were treated incredibly punitively by these investigations from the city, where fraud controllers are empowered to raid your house at 5am in the morning unannounced, count your toothbrushes, sift through your laundry, go through all your bank statements. And that even the smallest mistakes, like forgetting to report 100 euros, could leave you [unintelligible 00:25:58.22] landed as an alleged fraudster.
[26:02] So I think based on reporting [unintelligible 00:26:05.13] didn’t even question the validity of the label and the consistency of the label. But beyond that, I think what we established with our reporting was that the consequences of being flagged, even if in the end you’re found to be completely innocent, just having people raiding your house at 5am, asking you questions about your romantic life in front of your children… I mean, that’s a negative consequence in and of itself, even if you’re found to have done nothing wrong.
All of this is very interesting to me from a data science perspective, because a lot of these things are kind of - yeah, things that I know we’ve talked about on this podcast, but also in my day to day work, things that have come up, that you sort of establish as best practices around how you construct your label, how you construct your features in a responsible way to do well at your data science problem.
I do want to get to the actual model performance here in a second, which - one question is, well, we see all of these flaws in the data… Does the model actually work, or have all of those kind of underlying problems poisoned the output? But I think before then, I’m just wondering, as a person who provides occasionally consulting services to other people in data science, did you get a sense at all for like the city of Rotterdam hired X consultancy to give them a model that they deployed and are using, is just sort of like that consultancy threw the model over the fence, and like “Here, use this”? Or how much interaction was there with actual Rotterdam employees, and how deep was the understanding of how this model was built and deployed? Or was it just sort of a contract, “Here’s money. Here’s the model. Alright, let’s put it into production”? What was the interaction like there? Were you able to discern any of that?
[30:14] Not super-deeply. But from what we do know, the city put out a tender asking for someone to come in and build a predictive model for this purpose. Accenture won that tender, put someone on it, and there was a Rotterdam data scientist involved, but who presumably, or from what I can tell, didn’t have any sort of machine learning background. [unintelligible 00:30:33.25] at the city. Rotterdam set up the whole codebase, trained the model, developed all the code for the pre-processing, trained the model, handed it over to the city, and kind of went by like “We’re gone now.”
And from that point on, Rotterdam took full control of the model. They would retrain it every year, they made adjustments to features, and also decided to exclude some features, like nationality… But I do think that during that time Rotterdam upgraded its own data science capacity. So by the time we got there, they did have like two people who were specialized in machine learning that were looking over the model. That’s my understanding of the basic setup.
Yeah, super-interesting. I do want to get to the kind of model performance, I guess, because I know this is something that I got asked when I’ve done workshops, and I talked about either like fairness or bias in models; there’s always someone that kind of comes up with the question of “Well, if the data is biased, but the model is accurate, and I’m predicting accurate results, is that a problem?” I think there’s problematic things about how you might answer that question in and of itself… But in your case, was the model actually helping in any way? Or were the problems kind of so deep in the data and the way that the labels were generated, such that the majority of what it was producing was maybe more chaos or issues?
So in the test set that the city used - and we have kind of their documentation of that, even though we don’t have the labels ourselves - we see that in the set there is a 21% baseline rate of fraud or some kind of wrongdoing… And the model, kind of depending on where you set the threshold, the model essentially has a hit rate of 30%. So out of the people selected, around 30% of them are labeled within the positive class. So it’s a 10% improvement above random. Is that good? Is that bad? The ROC curve looks absolutely terrible. Margaret Mitchell, who many listeners probably know, called it essentially random guessing. I’m not quite sure if I would go that far, but it’s certainly not anything to write home about. And we see that there’s huge disparities in who’s getting flagged and their characteristics. Does the label data show that there’s a reason for that? Maybe. But because we have some idea about how the trend data was constructed, specifically through these theme investigations, there’s a very strong probability that a lot of these patterns that we see in terms of who’s getting flagged is a function of the selection process that leads to somebody being included in the training data, rather than have actual fraud being committed. I can give an example of how that my work.
Most of the men in the training data very likely were selected through one of these investigations where all men in a certain neighborhood were investigated, which have a pretty low likelihood of actually finding fraud. That implies that most women were selected by anonymous tips or random sampling, and those things have somewhat higher probabilities of detecting fraud. And so if your method of selection impacts how likely it is that the person who you investigate has actually done something wrong, then the training set that you train your model on will contain patterns that are a function of your selection method, rather than off the real world and how fraud patterns [unintelligible 00:33:57.02] in the real world.
[34:00] And so we couldn’t conclusively prove this, because we didn’t have access to who was labeled, or who was selected – within the training set we couldn’t say who came from which source, but we know that these different sources fed into the training set… And it seems very probable that this type of selection method would lead to these kinds of disparate outcomes.
I think there’s all sorts of things to learn in this story, as even just a data scientist setting up datasets and trying to train models… I come, of course, from a certain perspective, and kind of what touches me about this story - and I’m so glad that it’s out there and there’s some transparency around this… I’m wondering, could you speak a little bit to the reception of this story maybe more widely by non-technical audiences, in terms of realizations that people were coming to or responses that came out of people realizing how these systems were constructed, and how they perform in reality, versus maybe what their perception was prior?
Kind of two answers to that question. I think, first of all, one of the big goals of this project and the piece that we published with Wired, where we kind of take viewers through the model and how it works was to have it be an education piece of journalism, too; like, you’ve been hearing about machine learning, and the sort of impact it has on your lives, but very few stories actually take you through like a full lifecycle of that model - what does it look like “inside the machine”. So we really wanted to make an educational piece, and also talk about what Justin has covered, the different sorts of problems or flaws in the system, and what are the consequences of those flaws.
And I think normal people, of course, found the sort of discriminatory angle, or the fact that, for example single mothers are penalized more, or – you know, I think that was something that they took away from it. But surprisingly, one area that surprised me a little bit that people seemed quite fixated or curious by was the decision trees portion. So what we tried to do in that portion of the piece, for people who haven’t read it yet, is we take some decision trees from the model, from this gradient boosting model, and we show how this creates nonlinear interactions. So features affect each other differently relationally. So in decision tree x, if you’re a man, you might go down the right side of the tree, and if you’re a woman, you might go down the left side, and you will be evaluated by different characteristics. So that one seemed to be something that really seemed to resonate with readers, like questioning like “Okay, well, this is how it works. Is that fair to me, or it makes it difficult for me to understand how these interactions work?”
On a political level, Rotterdam, to their credit, was quite graceful when we presented them with the results. And they sent back a statement saying, essentially – they called our results informative, educational, which in the field of investigative journalism never happens. [laughs] Like, the subject of your investigation saying it’s informative and educational I think never happened.
When they’re the subject… Yeah. [laughs]
Yeah. And called on other cities to do what they had done, to be transparent. And I found that an incredibly brave and elegant response to what we’d done. And they were sort of debating whether to continue the use of this model, and then decided that they weren’t going to use it anymore. That the sort of ethical risks were too high. And then I think – I mean, elsewhere… I don’t know, Justin, if you have any reactions that stuck out to you…
Yeah, maybe the one thing that I would add is that I think this field of algorithmic accountability reporting, but even the academic discussions around it, has - I don’t want to say suffered, but has been kind of constrained a little bit by a streetlight effect following machine bias. You had this big story coming out, and then afterwards, for years, everybody was talking about these various outcome fairness definitions. And I think that’s a very valuable debate. I myself almost enjoy it. I think some of it is just mathematically very interesting. It’s really difficult ethical questions that it brings up. But I think a bunch of the other dimensions of fairness in the lifecycle of the system have been neglected.
[38:15] Gabriel and I, in the past year, have kind of been making the rounds and the cases to people that we should be looking at algorithmic fairness more holistically. We should look at the training data, we should look at the input features, we should look at the type of model that is being used and how that maps onto our understanding of the process… And then we should also look, of course, at the outcome fairness stuff.
But I actually think - and your reaction kind of spoke to that… I think this training data bit is probably the most interesting one, and one that I - I have both academic training as a computer scientist, and also as a political scientist. And when I took my computer science classes, nobody ever talked about how do you set up a representative sample. I was kind of like “We take whatever data we have and then we try to run as many models over it.”
Use it all, and all features.
Right. And - well, that might kind of up your performance along certain metrics. On some level, if the data doesn’t contain the functional relationship that you’re trying to model, you can’t get there. And I think that’s a lesson that I hope some of the – yeah, maybe practitioners who read our piece also take away from it.
Yeah, that’s super-helpful. I think you got to where I wanted to ask anyway, because I know we have listeners that are practitioners, and are probably thinking to themselves, “What is a kind of takeaway that I can take away from this?” Because I would say, from my experience at least, most data practitioners are not intentionally trying to create harmful outcomes from their systems. They do actually want to be responsible, it’s just sometimes they might be somewhat confused or constrained in certain ways that don’t allow them to spend time thinking about those things. But yeah, I really appreciate you bringing us around to that.
As we kind of close out here and we look maybe to the future… When we started out this conversation I kind of mentioned there’s all of this talk, of course, constantly swirling around us about the dangers of AI and all that stuff, which is operating on multiple levels, some of which are useful and some of which aren’t, probably… But I want to ask both of you maybe as you look towards the future, post this project, what you’ve done here, what’s on your mind as you look towards the future of how this technology is ever-expanding? What gives you pause, what gives you hope? What do you hope people are thinking about as we kind of look to the future in how this technology is developing?
So there’s two things I would respond to that. One is that I hope we’ll have more discussions around transparency around these systems. I think that’s a precondition for anything else. And for that to happen, there is an argument that needs to be dispelled. And that argument is that making these systems public allows people to game them. One, I think it’s really, really hard, and there’s some very good academic research that shows how hard it would be. And two - well, these systems operate essentially like bylaws, right? They’re essentially administrative guidelines encoded in a model file for how a decision was being made in some bureaucracy. And I think it’s really hard to make the case that such guidelines should be secret.
And so yeah, I think we need to have a discussion and make the case proactively that transparency in this space, and encouraging people to learn how they work is a good thing, and encouraging people to game those systems is probably a good thing, because that means you’re probably closer to abiding by the law. And if you can game the systems, then maybe they aren’t very good. That’s the first thing I want to say.
[41:45] The second one is that most of the systems we’ve looked at are pretty terrible, in most ways. I think they don’t work very well. They either use features that are absolutely terrible, or have training data construction that is really problematic, or have disparate impacts on various groups. Almost every single system we’ve looked at so far has one or multiple of these features. But there are some systems that maybe are better. And it’s possible, I think, if you think very seriously about how you do each of these steps - the feature selection, the training data, and then constructing the model, and then evaluate for bias, and then potentially retrain/reweigh your training data, and so on, maybe it’s possible to get to a better place. Technically, it certainly is. And I think then you get to a different set of questions. And I hope that the conversation at some point can move beyond kind of the gross incompetence in the way which we’re showcasing across the board, but can move to a place where we can discuss “Okay, let’s take this best-case scenario. We have a system that doesn’t have obvious bias and so on, that was constructed carefully. Should we do this? Is it a good idea? Is a machine making the decision removing something inherently kind of valuable from the type of interaction? Is the machine actually more explainable than a human is? And is that a good thing? Is it equal treatment, because everybody’s being scored by the exact same system and not by individual caseworkers? Or is it not equal treatment, because the tool contains a decision tree based model, and so different people are based on different characteristics?”
How do we think about systems that include some level of probabilistic assessment? Is that something that we think an administrative position should do? And then of course, we can also have the maybe fun for some people discussions around like which fairness definition is the best, whether we should seek to minimize or equalize false positive rates across different groups, and so on.
I think there’s a bunch of really important questions that society has to grapple with here, but I don’t think we’re there quite yet in most cases. And so long as we aren’t, I think Gabriel and I will have plenty of work showcasing incompetence, and all that stuff. But I hope that at some point we can move beyond that.
Yeah. Anything to add, Gabriel?
No, I think Justin summed it up really well. I’ll just kind of tease that we do have some reporting that’s coming up in the coming year, that will grapple with some of these thornier ethical issues, ask questions like when and if ever is it okay to use these systems…
I think maybe one thing that I will add though is I think it is important for people like practitioners that are listening to your audience to also take a step back and to maybe not see always the deployment of these systems or this sort of thorny fairness question as like a math problem, but it can also be a sort of wider societal problem as well. So for example, in the European welfare context we’ve seen everywhere we’re looking models that attempt to detect fraud. But what we don’t see is models that try to find people who are eligible for welfare benefits, who aren’t using them because they’re afraid of the system. And we know this is a huge problem. In places like France 30% of people eligible for welfare don’t use it, because they’re scared of the system. This has consequences for people not using welfare, but it also has consequences downstream for society. So imagine families that aren’t able to feed their kids, developmental issues that come from that… So I think it’s always important and something we tried to raise in our reporting, to kind of take a step back and ask “Should we be doing this?” To think about the premise of why are we actually deploying this model, and to rethink that and at some points and think about, “Is there a better way to use this technology? Or are we only kind of narrowly zeroing in on one piece of this picture?”
Yeah, that’s great. I think that’s a really wonderful encouragement to end things with. We will certainly be on the edge of our seats looking for your future work, and I encourage everyone - we’ll include the links to Gabriel and Justin’s work in our show notes, so I encourage you go and explore it. There’s lots of great graphs and references, and even more technical description of the methodology than we had time to go into here… So dig in and learn about what they’re doing; it’s really wonderful. And yeah, thank you for your work, Justin and Gabriel, and thank you for taking time to join us.
Thanks so much.
Thanks so much for having us.
Our transcripts are open source on GitHub. Improvements are welcome. 💚