At this year’s Government & Public Sector R Conference (or R|Gov) our very own Daniel Whitenack moderated a panel on how AI practitioners can engage with governments on AI for good projects. That discussion is being republished in this episode for all our listeners to enjoy!
The panelists were Danya Murali from Arcadia Power and Emily Martinez from the NYC Department of Health and Mental Hygiene. Danya and Emily gave some great perspectives on sources of government data, ethical uses of data, and privacy.
Click here to listen along while you enjoy the transcript. 🎧
It is great to be back at R Conference. I spoke at R Conference New York when it was in-person, and I remember a heated discussion, since I was living near Chicago at the time, a heated discussion about various types of pizzas… And there were things said, and all was in good fun, but it was a good time. So I miss that, but this has been a great experience; it’s great to be here.
We’ve gonna discuss today how data professionals can engage with governments on AI for good projects. This panel will actually be recorded and released after the conference as well on the Practical AI podcast. So if you’re a podcast listener, there’s a link there, you can follow along and listen again, because it’s going to be such a good time.
Today we have with us a couple of great panelists who have agreed to share their vast expertise with us. We have Danya Murali, a quantitative analyst at Arcadia Power. She’s a data scientist with a passion for energy, environment and climate, and currently, she’s working on the data team at Arcadia Power, a fast-growing startup in the Washington DC area, seeking to create a 100% renewable energy future while saving customers’ money on their power bill.
We also have with us Emily Martinez, who is an interoperability unit chief at the NYC Department of Health and Mental Hygiene. She has her master’s degree in public health from Columbia’s Mailman University of Public Health, and she currently spends her time connecting healthcare providers to the CIR in anticipation of the Covid-19 vaccine, which is of course very timely, and also very much fitting within the AI for good space.
[00:04:15.21] So I’d like to maybe just start out this panel by asking – maybe we can start with Danya, just to get her sense of… When we say “AI for good”, or “data for good”, or “data practitioners working for good”, what does that mean to you in terms of your day-to-day, and things you’ve seen in your career?
Yeah. Thanks Daniel, and thanks Jerod for having me back. I am so excited to be able to be at this conference, even if it’s virtual this year… So yeah, AI for good - that’s something that I care about immensely. What I think about when I hear the term AI for good - it’s the idea that we should be using data and we should be using AI for things that are equitable and helpful to communities across the board; so not just one particular community, or a group of people, but kind of helping all people together, and being very aware and cognizant of how you can use that data and how you can use AI to achieve that goal.
And I guess a follow-up on that - what do you think about the current state of how data and AI or data science is being used? Do you see it being used equitably now, or are there places where that’s not true?
Yeah, that’s a really good question. I think it’s on a path of improvement, but still in a place where we should be all more aware of what impact our use of data and our use of AI can have. I’ve heard some scary cases of using things like facial recognition for police-related things, which could disproportionately affect communities of color, and hurt various communities… So there’s scary moments like that, but then there’s also really great things, like what we heard today with the panel with tracing Covid, and using data and using AI to try and help the general health of our population.
At Arcadia, we’re focused on trying to get renewable energy as something that is just accessible to all people, no matter what your community is from… And also, to do it in a way that we are making sure that we’re not removing, but enhancing energy justice. So that’s something that we really care about, that’s something that as a data scientist at Arcadia we’re constantly thinking about how our products affect different people and how we can build for all.
Awesome. That’s exciting. I wanna pose the same question to Emily, to see your perspective on what triggers in your mind when you’re thinking of using data for good, or being an AI practitioner working for good?
Yeah. First, thanks Jerod and the team for having me tonight. I think from my perspective coming from a public health and local government bubble, I think we’re always working for good, and we have in our fingertips access to incredible data, data that we can take action pretty quickly and it impacts communities very quickly, so I think we’re on the receiving end too of outside private companies that work with government companies, so that we’re able to create and use newer technologies, or methods… I view things a little different. I’ve always been using data for good.
Yeah, that’s great. And you mentioned this interplay between governments, whether they be federal or local governments, and private entities… I was wondering - of course, now that we’re in this whole time of Covid and pandemic, and we’re getting to hopefully on the horizon some distribution of vaccines, like was mentioned in your bio, that you’re specifically involved in that work - that necessarily involves a number of private companies, it involves logistics companies, and all of those sorts of things… So I was wondering if from that perspective you could give us a little bit of a sense of, at least from your own experience, how governments and private entities/commercial entities can work effectively together.
[00:08:20.21] Yeah, I think from my perspective, and with the Covid examples, I think the collaboration has been great. It’s been nice to be able to contract with outside companies. I know particularly we were interviewing different vendors that we could partner with to establish a new point of dispension system.
Previously to working with these vendors, everything was on paper. Point of dispensing means we’re set up to pop up, for example, a mobile clinic at a school where staff will be able to vaccinate people, and people can just approach the school to get vaccinated… And all that needs to be checked somehow. Previously, this was all [unintelligible 00:09:00.02] So that can still be happening in 2020…
So it’s nice that we’re able, because of the severity and the impact that Covid has had worldwide, that we can move forward with using better and efficient data products.
Yeah, great. And of course, one of the underpinnings of this - like you mentioned, governments, whether they be national or local governments, have a lot of data that can be immediately put to use for certain AI for good purposes, whether that be as related to health, or I think as related to energy and other things.
Danya, I wanna kind of kick it back to you and hear from your perspective what the role of data from governments is in your own work in the energy industry.
Yeah, we use a lot of government data. Before I worked at Arcadia, I used to work at the Energy Information Administration, which is the statistical hub of the Department of Energy. So it’s been a cool transition to go from the government agency that collects the data, to being a private entity that uses the data…
But one example I can give about how we use EIA data for good is we try and give our customers an understanding of what their CO2 impact is of their monthly residential electricity use. It’s sort of like being able to keep track of your Carbon footprint, but in a very precise way, that is very much related to your individual usage. We do that using EIA data. So the power of government is you have the ability to go and collect data, and sort of mandate that the data that you’re surveying is getting responded to… Which is great.
One example is EIA has a survey of all the combustion power plants across the U.S. - so all the coal power plants, natural gas, oil… All of those power plants, and how much of each fossil fuel is being used in a year for each of those, and also how much CO2 is being emitted. So using that dataset, and using our individual customer data, we’re able to create individualized forecasts and calculations of how much CO2 is released by using a certain amount of electricity, and then also how much CO2 is averted when you’re able to source your power from places like wind and solar. So there’s a very direct relationship between how we use government data and our data to achieve that.
Yeah, that’s awesome. It’s cool to see how granular you can be with some of that stuff, and see how it affects individual people’s energy usage, and maybe their own mindset around that, which is really cool.
We had a question from our audience which I’d like to propose here; I think it’s a good one. So we’re on the topic of datasets and how some of these government datasets can be used with great success for AI for good projects… I was wondering - maybe we can start with Emily in the healthcare space… What are some go-to resources out there for people that are maybe wanting to either contribute in the healthcare space, or look at data from the healthcare space? Of course, some of that has some privacy concerns and all of that with that… So what’s the situation there in terms of data that people can access, and data that needs to be protected in certain ways?
[00:12:27.29] Yeah, that’s a great question. I think right now there’s a lot of data that’s been public and easily accessible. I know in New York City there’s the Open Data Platform, where different datasets from actually different agencies, including from the Health Department… So there’s a range of things, and I think they’re pretty much up to date. There might be a lag in how recent the data is, but there’s actually quite a bit variety of data that can be used by anyone who just wants to play around with the data. I think there’s also New York State’s public datasets, and other states as well… I think right now data is very much accessible to anyone who wants to take a look.
In terms of security - yes, there’s a lot of security, especially patient demographic information, and all of that. Most of the data that’s cleared is probably aggregate or has no way that it can be connected to a particular patient. So that is very much kept very tightly within the Health Department.
What about in the realm of energy, Danya? What’s the situation in terms of – it sounds like you have this tool where people are able to utilize your analysis to understand their own energy use… What about maybe for participants in this conference who are interested in maybe coming up with their own analyses, or doing a little bit deeper study as related to energy? What is available out there for them to potentially utilize?
Yeah, there are a lot of great government and otherwise datasets surrounding energy. Two things I highly suggest – one (I sound like a broken record), I would highly suggest you go to eia.gov, where you can see… They just collect a ton of data about all energy across the board… But also going to the EPA website. They have a really cool green energy calculator, where you could put in how many miles you drove, or how many airplane rides you took - which I’m sure is not very many this year - and how that led to CO2 emissions and other environmental factors.
Another place I would also suggest is just going to Kaggle. I don’t know if you guys have used Kaggle before, but that is a really great place to get a lot of different types of data - both healthcare data, and energy data, and pretty much any type of data you want. Also, one of the things that’s really nice about Kaggle is that oftentimes people have already done analyses, and it’s a very open source space, where you can contribute to other people’s things, or pull from other people’s things. So that’s a [unintelligible 00:14:52.20]
Yeah, those are great. I’ll kind of throw in my own contribution here - so I work mostly in the language space for an NGO, and there’s a lot of great language and speech data out there. In particular, Mozilla has done an amazing job with Common Voice, which is a large dataset of transcribed speech which is out there, in all sorts of languages… And there’s a whole bunch of open data that you can use for machine translation projects if you search for Opus, which is an Open Parallel Corpus. You can download a bunch of that… And I would encourage people to dig into that.
It’s pretty interesting when you start doing some natural language processing on languages other than English, especially because, for one, it can help benefit tools and support for languages, but also, you run into all sorts of things fairly quickly with scripts other than Latin script, and really long words, or languages that don’t use spaces… So you have to think about all sorts of interesting problems, and so it’s also really interesting from that perspective.
[00:16:03.04] While we’re still on the topic of data and using data for good, I know that one element of this is also making sure that we’re responsible; no matter what project we’re working on, making sure that we are using data in a responsible way, and not either showing bias against certain populations, or possibly even unintentionally doing things that might harm people. Danya, do you have any thoughts on that front, in terms of in your own work, or just what you’ve seen in practice across industry, some good practices or things that people should keep in mind with respect to that?
Yeah. So one thing that I strongly advocate for is to be doing race-conscious data analysis, instead of racial-blind analysis. Using things like ZIP codes can very easily be a proxy to race and socio-economic status… But that doesn’t necessarily mean that you should not look at those things, like just covering your eyes and being like “I’m not gonna look at race, I’m not gonna touch race.” It’s not the way to do it. To make sure that you are being equitable and you are really thinking critically about how you’re data analysis affects different people. You need to really be thinking about who those people are, where they come from, what their needs are… And to do that, you should be looking at things like race, and ZIP codes, and the relationship between those things when you’re doing your analysis.
So when is the time to do that? Is that when you’re sort of doing your initial pre-processing and setting up your project? Is that while you’re doing your analysis? Is that afterwards, and monitoring? Where can that fit in?
I think it fits in in many places. I would say that it should be in every step of the analysis. So when you’re collecting data, making sure that you’re collecting a representative sample, for example. When you’re doing the analysis, making sure that you’re not making decisions or taking averages or doing things that are biased… So in every step of the data analysis lifecycle, being cognizant of that.
I can give a more particular example… At Arcadia, we recently started a chapter, a Diversity, Equity and Inclusion chapter. That contains people from all across the organization. So we have engineers, we have data scientists, we have people that work on the member experience team… And we all come together and we talk about how our different products relate to diversity and inclusion, and how the data that we collect and the data that we use to create products affect that.
So by having outside of your day job, an outside chapter that looks into these things and considers these things is also good. Having checks in place to make sure that we are looking at different types of diversity data.
Another thing that we’ve been doing recently is just looking at the demographics of our customers to make sure that we’re not over-indexing in a certain population, and if we are, applying the corrective features.
So it’s definitely one of those things where you can mess up, and that’s okay; and you can fix it. And if you’re aware of it through all the different steps of your data analysis process and [unintelligible 00:19:18.00] public development process, I think you can really do some good.
Yeah, I appreciate that perspective. I wanna return over to Emily to the healthcare space… I know one of the things that you mentioned when we were talking prior to the panel is your work connecting patients to services during this pandemic. This maybe connects to one of the questions we got as well from the audience, which is – you know, I guess in that scenario there’s also this element where certain populations, certain demographic factors have been shown to have higher risk and higher concentrations of Covid, higher death rates, all of that sort of thing… How do you balance collecting and using sensitive, maybe racial information or demographic information when maybe some people might not want to give that information, but you might want to utilize it particularly in healthcare for certain purposes? What questions go through your mind, and how do you handle some of those sensitive pieces of data, making sure that you’re not exploiting those, or gathering them when you shouldn’t, but also gathering them when you should, for good reasons?
[00:20:32.15] Yeah, I think the city already has a good picture of where those areas are… So reaching out to the community with low resources, socio-economic status… A lot of programs are developed to help those particular communities. So I think we kind of already know where that is, and a lot of the data reflects – we use a lot of the data to find where these disparities are, and they always align. Wherever there’s a high index, it also matches poverty numbers. So there’s a good match correlation on that.
In terms of sensitivity, I think the data has – we always use the data carefully, and we already have these ties to the community, so I don’t think there’s a problem that we might backfire in our communication. I think there’s a lot of trust in that sense.
Yeah. I think that’s a really great point, in terms of the connection to the community, and not creating projects without any communication and trust between you and the community, and then sort of forcing a solution in and saying “Hey, this is gonna fix all of your problems.” There needs to be an open line of communication there. I think that’s a really good point.
So I think we’re pretty much out of time, but I do wanna just give one more question to both of you. We can start with Danya. I would just be curious to know what excites you about the future of using data for good? What excites you about the potential there and what impact it could have?
Yeah, I love data… I also especially love R. One of the things that I’m really excited about that has grown in probably the last decade is the use of open source data, and contributing to software, like R… And then also conferences like this, and different organizations, like R Ladies, that really try and get more people of various types of communities to come in and be data analysts, and make it so it is accessible across the board. Things like general assembly, and these other types – or like Code Academy; these are online classes where you don’t need to get a degree to build a new data analysis and have an impact, and bring your own personal bit to it; you can go off and do that online.
So I feel like a lot of the barriers to being data-literate and being able to make really smart decisions and choices has lessened, and I think it’s gonna continue to lessen. So it’s very exciting; I think it’s an exciting time to be in this data space.
Awesome. What about you, Emily? What are you excited about?
Yeah, I agree and I echo everything that Danya said. I think also particularly in local government there has been an increase in using open source tools, using R particularly in the Health Department… I think it was a couple years ago where we got an R server, and we were all excited about that… And that has really pushed a lot of new R users. We’ve been mostly SaaS users. So I think we’re going in a really good direction, and there’s a really big interest within local government about data science. It’s just been very accessible.
I think the biggest part is a sense of community. In every sector, data analysts have some form they can reach out and ask others like them if they have any questions… I think that’s been the biggest thing, that sense of community.
Awesome. Well, thank you both for such a great discussion. I really enjoyed it. As a reminder to everyone, this will be published again on the Practical AI podcast. Check that out if you’re into podcasting.
Thank you again to the R Conference for making this happen. I’m really glad that this conversation happened, and this content will be accessible. Enjoy the rest of your conference.
Our transcripts are open source on GitHub. Improvements are welcome. 💚