Copilot lawsuits & Galactica "science" (Practical AI #202)

All Episodes

There are some big AI-related controversies swirling, and it’s time we talk about them. A lawsuit has been filed against GitHub, Microsoft, and OpenAI related to Copilot code suggestions, and many people have been disturbed by the output of Meta AI’s Galactica model. Does Copilot violate open source licenses? Does Galactica output dangerous science-related content? In this episode, we dive into the controversies and risks, and we discuss the benefits of these technologies.

Changelog++ members support our work, get closer to the metal, and make the ads disappear. Join!

44 minutes
Recorded Nov 23, 2022
Published Nov 29, 2022
Download (43MB)
Transcript
🎧 20,362

Featuring

Chris Benson – Website, GitHub, LinkedIn, X
Daniel Whitenack – Website, GitHub, X

Notes & Links

📝 Edit Notes

Related to Copilot:

Related to Galactica:

Books

Chapters

Chapter Number	Chapter Start Time	Chapter Title	Chapter Duration
1	00:00	Opener	00:28
2	00:31	Welcome to Practical AI	00:42
3	01:14	Intro	01:13
4	02:27	My barber does crypto	11:57
5	04:50	What could go wrong with AI?	09:34
6	14:33	Is code really unique?	04:26
7	18:59	Governance in AI	06:48
8	25:54	Galactica	15:32
9	41:26	Wrap up	01:58
10	43:25	Outro	00:45

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Daniel Whitenack

Welcome to another Fully Connected episode of the Practical AI podcast. This is where Chris and I keep you fully connected with everything that’s happening in the AI community. We’ll take some time to discuss the latest AI-related news and dig into some learning resources to help you level up your machine learning game. I’m Daniel Whitenack, I’m a data scientist with SIL International. I’m joined as always by my co-host, Chris Benson, who’s a tech strategist with Lockheed Martin. How’re you doing, Chris?

Doing fine, as we are recording this episode the day before Thanksgiving…

Daniel Whitenack

Yes, U.S. Thanksgiving is tomorrow.

That’s right. And I know that we both have our day jobs, and we just have nothing to do today, do we? There’s just there’s not much going on…

Daniel Whitenack

Right… If only. If only.

We were talking beforehand, and both of us were “Oh gosh, it’s quite a busy day for the day before Thanksgiving.” But you know what? We have a few minutes to talk about some fun stuff here.

Daniel Whitenack

Yeah, exactly. I hope you’ve got your tofurkey or whatever you’ve got ready for tomorrow. I don’t know what we’ll have…

Absolutely. I got myself some vegan bird here.

Daniel Whitenack

Nice. Nice. I it. I it. So I’m gonna maybe start with a story, Chris, because this is kind of what prompted some of my thoughts around this episode… So I live downtown in the town where we live here, and there’s a barber a couple blocks away, I go and get my hair cut from this barber… And he’s big into crypto. when NF Ts was really hot, he was pouring thousands and thousands of dollars into NFTs, and he’s got all this stuff he’s doing… Anyway, he lost a bunch of money with NFTs… But then, the last time I went to go get my hair cut, we were talking about this recent controversy around FTX… And just a sort of disclaimer, we’re not going to be talking about crypto or Bitcoin this episode, or blockchain… But it sort of prompted my thinking, because basically, for those that aren’t aware, recently there was this crypto exchange, FTX, the founder/owner, Sam Bankman-Fried, basically, he was a kind of an industry leader, well respected, but he’s kind of turned into industry villain - he lost most of his fortune and bankrupted a bunch of things… Like, a $32 billion plunge in value of this FTX exchange. And I was talking to a couple of people interested in this, like my barber, who maybe - I don’t know how much he is an expert, but thinking about how this is a major setback to those that are kind of promoting blockchain technology, cryptocurrencies, crypto-whatever… And it got me thinking, what sort of controversy or event could prove to be a major setback to the AI industry? Or is such a setback possible? So that’s my first question to discuss on our day before Thanksgiving… I guess we can first give thinks that such an event maybe hasn’t happened, although maybe smaller controversies have happened…

[04:30] Yeah… Although, before we kind of move fully over to the AI side from the crypto side, I happen to be staring at Sam Bankman-Fried’s Wikipedia page, and I’m looking at his hair… And as you mentioned the barber and stuff, there’s gotta be a joke there. That’s all I’m saying.

Daniel Whitenack

Yeah… [laughs]

So moving back over to AI… Well, I kind of feel like you’ve set me up, because you’re like “What could possibly go wrong with AI?” And you know…

Daniel Whitenack

That would be a major setback to the industry, right? So not just like a bad thing. So there’s certainly – I think we can both say there’s been bad things happen with AI, no doubt, right?

Absolutely. I think it would be the degree of badness, potentially, on a scale of bad things.

Daniel Whitenack

What’s the scale of badness, 0 to 10? What’s at the 10?

Well, 10 is that you have significant loss of life that’s caused by AI inference. And specifically, because I work in the industry I work in, I’m gonna say unintentional loss of life by that. I’m not saying that there’s AI – I should be careful. We don’t have AI that’s intended – I’m just saying, in the future, sometime, as things develop… I’m having to put in all the careful things… That yes, if there was AI in some industry, and it resulted somehow in unintentional loss of life, then that would be a very bad thing.

Daniel Whitenack

Right. So like if all the airlines started flying autonomously, and there was an airliner that was flying autonomously and had significant loss of life, or something that.

Indeed. And when you really think about it, that is something that people are already talking about for the future, is AI running various types of vehicles, some of which are on the ground, some of which are in the air… And there may be instances of that out there in the world. So yes, an airliner would be a big thing.

I have to say, as we’re talking about this kind of scenario though, I’m totally recognizing the tragedy of that; I have always found it very interesting as a perspective. So in terms of loss of life, we react to it, depending on what the cause is, in a different way. And so different results – there are some things that people look in the news, and they hear about people dying, and it’s kind of remote from them, and they kind of move on very quickly, and go –

Daniel Whitenack

Or it’s a story they’ve heard before.

Yeah, “That’s a bad thing, and I’m sorry to hear that happen”, but they kind of move on. And then there are other stories where they kind of get very emotional about it. My suspicion is that should such a story in the future evolve where it was AI-driven, that would get to a whole new level of that. And I think the interesting thing for me, psychologically, is the fact that in all cases, it was the same loss of life. But the way we choose to react to it can vary. So it’s just an interesting psychological point from my standpoint… I don’t think it would stop AI, but I do think such an event would create a lot of pause.

Daniel Whitenack

[07:47] Yeah, I think in my mind it’s not a ceasing of AI research or something like that, but more maybe a slowdown, or intense regulation until more reasonable regulation comes into play. We both talked quite extensively on the podcast about how government regulation and laws around algorithmic decision-making and that sort of thing are lagging quite far behind the scale at which people are using this technology… Which is sort of a scenario that would kind of create some awkwardness.

One of the things that I wanted to bring up this episode, as we talk through this issue, is one of those awkwardnesses that has been created. And some people might see it as a bigger deal, some people might see it as a really big deal, or not a problem at all… So we’re not lawyers are in a position to weigh in on how this will all go, but I think we can present some sort of things that are happening right now… And the one that came to my mind was GitHub Copilot, which - I’m actually a huge fan of, so maybe I’m biased in this discussion. As far as I know, we’re not sponsored by GitHub Copilot, or Microsoft, or anything… But I do like the product, and I use it.

It is interesting… So I’ve found this article, and the article’s title is “GitHub Copilot isn’t worth the risk”, and it’s sort of geared, I guess, towards a CTO type… And the thought is “Should you allow your engineers to use GitHub Copilot?” And it was kind of really – I mean, it was really good timing for me to see this article, I think, because literally, a couple of the data scientists on my team were asking me the week before, “Is it okay, is there a policy against us using GitHub Copilot? Or is there any issue with us using GitHub Copilot in our day to day work?” So I had already been thinking about this, and one thing that struck me is I was already using GitHub Copilot without maybe realizing some of the implications around some of the things brought up in the article. But now, people on my team are asking me, “Should they use GitHub Copilot?” And so I thought that the timing was really good.

One thing to acknowledge here is, if people aren’t familiar with GitHub Copilot, it’s sort of an AI-enabled assistant that kind of is there in your IDE, or your code editor with you, and suggests certain blocks of code, or converts comments into code… Like, you can say, “Function that transforms this data into that”, and it’ll kind of draft that out for you, and it’s quite nifty. So the first acknowledgement is GitHub Copilot is obviously very powerful, and I would argue useful; otherwise, we probably wouldn’t be having this conversation.

I would, too. I personally like – I’ll use it for my personal things, and I really it, because especially if I go in and out of coding, where I’m coding sometimes, but then I’ll go periods of time where I’m not coding, and things will slip, and it’s a great way of kind of getting back into quick productivity by getting those suggestions, and often I’ll see them and go, “Oh, yeah. Oh, yeah. Oh, yeah. Do that, and select that and all that…” So it’s a great tool. I will confess that to this day, I still am – I have this kind of discomfort with the idea… I think it’s that open source mentality of like – I don’t think anyone… And I’m not talking about the legality of it; I’m talking about the - when people submit open source to GitHub, and if you look back in the long history of that, they do expect other people to use the code, and adopt it, and all that, but I think that kind of pervasive, making a large company’s infrastructure out of it… There’s a discomfort that I’ve talked with other people about, and everyone kind of has this uneasiness about that in these conversations, about that aspect of it… So I’m guilty of using it, I like it, but I’m never quite comfortable with it.

Daniel Whitenack

[12:21] Yeah. I think that part of it is – well, maybe not feeling guilty, but what are the implications of it, which I think I’ve thought about a lot more over the last couple of weeks… And to spoil the ending, I’m still using GitHub Copilot, and I guess maybe during this episode you can tell me if that’s a wise decision or not… But the controversy, or the recent sort of swell of discussion around this I think is based – I mean, there’s a build-up to it, but on November 3rd there was a lawyer that filed a class action lawsuit against GitHub, Microsoft and OpenAI related to GitHub Copilot. And the basic charge is that Copilot’s suggestions aren’t boilerplate or sort of novel, but they bear kind of unmistakable fingerprints of their original authors… And according to a lot of open source licenses, if you’re not giving at least attribution to those copyright holders, even if it’s an open source license, then you’re in violation of the license.

Yeah. It’s an interesting idea… The thing that I wonder when I hear that is that writing code is so structured that in a lot of cases you can have different programmers coding in a very, very similar style, and maybe even selecting the same variable names, and stuff that. So does that mean that it’s actually pulling from someone’s direct copyrighted code, or if there are a thousand versions of the same function that all literally are named the same - does that imply the same thing? And I don’t know the answer to that, but it’s an interesting conundrum.

Daniel Whitenack

Yeah, Chris, I think what you’re talking about about when does code show unmistakable fingerprints of its original authors, and when is it boilerplate - that in and of itself to me is a hard one to navigate, I think, because… I was just having a discussion with my brother-in-law - Ed, shout-out if you’re listening, which I don’t think you do… But if you are listening… He’s learning JavaScript, and learning some frontend development, and that sort of thing… And we had this discussion the other day, because he’s like “Well, there’s this piece of this app that I’ve used, and I can see the code, and I’d like to just sort of take that little bit and modify it over here in my little app to do a similar thing… It’s basically the same thing, but slightly different, but how many ways can you write this for loop? I feel I’m stealing from the sky in taking it, but it’s basically the right way to write this loop and do the thing. So do I copy that over and modify it?” I think in a normal sort of open source world, if you are copying things out, or integrating in certain libraries, or something that, like I say, there are kind of attribution elements to it, and there’s dependencies in terms of how restrictive your license is versus the source license, and all of that, and there’s all sorts of things around that… But as an individual code writer, or a programmer, you can navigate those things, because it’s not like – you’re taking code maybe from this project, X project, and you can see the license, and you do what the license tells you to do, right? You make that decision actively. But in GitHub Copilot, I’m in my VS Code, and I’m typing along, and then boom, there’s a block of code. I have no idea if that’s verbatim from someone’s repository, or if that’s something unique that’s like some morphing of various things together, right?

[16:51] I’m just curious, could that be solved if they added a feature that either specified it was from a specific source, or explicitly disclaimed that it was inferenced code, and not from a specific source?

Daniel Whitenack

Potentially. I think the most foolproof workaround, I think, or solution, is to train the model that you’re using only explicitly permissive licensed code, right? So this is the stance that – there’s another offering called Tabnine, and Tabnine is specifically, in my understanding, trained on permissively-licensed code, which would not have some of these same copyright issues…

Like MIT versus GPL.

Daniel Whitenack

Yeah. So I think the one that’s been called out a lot with GitHub Copilot is GPL. I’m just looking at a tweet here from Tim Davis at DocSparse… This is one of the ones that originally got a lot of attention, where he’s saying “Copilot emits large chunks of my copyrighted code with no attribution, no LGPL license. My code on the left, GitHub on the right”, and he shows the pictures of the two, and he says “Not okay.” So I think this is what started to get that going, is the mixing of licensed code within the training dataset of GitHub is part of the issue. And we talked about this a little bit with large language models, right? Large language models are kind of like stochastic parrots; they’re putting all of these things together from various sources that they’ve found with language, right? So when you have this weird mix of code that generates this weird mix of a block of code in your editor, it may be quite difficult to understand or trace back on the inference side what is actually coming out that is copyrighted in certain ways, and that sort of thing.

As we’re into this kind of swamp of technical mixed with legal considerations on this happening, and the expectation that it will continue to happen across multiple solutions, what does governance look like for something this? And I say governance loosely; it could be legal remedy, it could be – we have AI ethics that we to talk about… What does the world look like when you have this swamp of a little bit of he said/she said in terms of “Was it his code, or was it not his code?” How do you resolve something that? How do you find a framework that allows you to have confidence that you’re within the boundaries of what is considered reasonable, acceptable and legal?

Daniel Whitenack

[19:46] Yeah, I mean, I think it’s an open question. One of the things I was discussing with my team - we kind of had an open discussion about this because I was really curious on all their input… What is the actual legal recourse here? So the individual maintainer of some random tool on GitHub that’s licensed GPL, or something that - is that person going to sue GitHub? Or more relevantly, is that person going to sue my organization because I used GitHub Copilot and output some block of their code? I think the likelihood of that happening is probably very low, because these open source maintainers - I mean, we love our open source maintainers, but generally, they don’t have a lot of capacity for extra things. They’re just kind of trying to get along, maintaining their projects and keeping up with all the issues, in their spare time, potentially.

So one of the things stressed in the article is it’s probably not the individual maintainers that are going to deal with this legally, but it’s some sort of open source advocacy group. The one that is called out in the article, which I should mention is from Elaine Atwell, and we’ll link that in our show notes… But the one that she references is the Software Freedom Conservancy (SFC), one of these open source advocacy groups.

So it’s much more likely that an advocacy group like this would sue certain companies that are using this product… But even then, they’re probably not going to go after, I wouldn’t guess, the company that has one developer using GitHub Copilot to write some random service in their organization. They would probably target large organizations, maybe with hundreds or even thousands of developers, that maybe are all using GitHub Copilot and violating a bunch of things, right?

So one element of this is, is it a reality that my team is going to get sued? My guess would be no. But that’s a separate issue to whether it’s a good idea to use this, and it’s a separate issue, like you’re talking about, as to what is the proper governance around something this, that would prevent or help with responsible usage. Those are all kind of – they have a slightly different nuance to those questions, I feel…

It’s not that far from – you know, when you think about those questions that you’re raising there, it’s very similar to other AI ethics discussions that we’ve had, and it kind of comes down to who has responsibility in these cases, and who has agency in these cases… And then there’s some place you’re going to draw a line on what is acceptable. And this is a thought that hit me right as you were talking about large language models a moment ago, is that, once again, you’re in – and this is outside my expertise, obviously… But you’re in a body of knowledge that’s being worked on, presumably is kind of public and open, but at some point things become copyrightable, and I’m sure an attorney could clarify that that knows all about that… But there’s almost a need for a guarantee that if you’re going to use the tooling and the new methods that we’re talking about, that there is an assurance of some sort, that it is going to fall within what is currently legally accepted use. And then there’s also the question of it’s what has historically been reasonable, given the new types of technology that people had never thought about - does that continue to be reasonable? We acknowledge that the legal frameworks have fallen way, way, way behind in these areas, for the most part. So how do you resolve that? I mean, there’s kind of an ethical concern, there’s a legal concern, there’s all the various licenses specifically… It’s quite a mess. What’s the path forward?

Daniel Whitenack

[23:52] Yeah. And that’s why I kind of came to the conclusion with this one - as much as this is a controversy, it’s not going to grind the AI industry to a halt… Because it’s so messy that we probably won’t understand the implications for years. That would be my guess. It’s gonna be years before we understand that. And by then – I mean, GitHub I think is launching the enterprise sort of usage of Copilot, if they haven’t yet, by the time you’re listening to this episode… So there’s gonna be a lot of people using it, and that’s going to muddy the waters even further.

The lawsuits will take several years to work themselves through, and by that time, the risk associated with being sued will have caused various actors in the process to go into risk mitigation of various types, probably market-based, rather than legal, so…

Daniel Whitenack

Yeah. I think we can probably watch GitHub specifically, and Microsoft and OpenAI, the ones involved in Copilot, and sort of look at some of the ways in which they modify the service to understand how they’re being pressured maybe to change what they’re doing, based on the ongoing proceedings, and all of that, right? If they sort of change how you use Copilot, that’s maybe an indication to us that they’re being – if it’s not a new feature, maybe that’s due to some of these restrictions and implications of the legal side of things. So yeah, it’ll be an interesting one to watch.

I actually got an email even from one of the members of our leadership team who I talk with occasionally about AI things, and how the industry is shaping up, and that’s what he said - “This is gonna be an interesting one to watch.” So definitely gonna be interesting, and we’ll keep you updated here on the podcast.

Daniel Whitenack

Well, Chris, the other one, which is kind of in the same – the other thing that I wanted to talk about today, which is sort of in the same theme, I guess, is one that also has to do… I think you used the term like some large body of knowledge, or something, when we were talking about GitHub and the large body of open source software knowledge that GitHub is leveraging… There’s another thing that has been quite controversial, I should say, quite interesting, I would say… Quite interesting, but also has generated a lot of controversy in the past weeks., and that’s this Galactica model, which - you can go to galactica.org to learn about it. This is a model from Meta AI. And the sort of idea behind this model is “Hey, we have all of this organized scientific information, we have a body of scientific work of papers, academic papers, which include narrative, theorems, and math formulas, and tables, and all sorts of things… We have this kind of mass of papers”, and what the team did is they released a new large language model trained on 48 million papers, textbooks, reference materials, compounds, proteins, and other sources of scientific knowledge. So that’s what I took from the galactica.org site… Which is pretty cool; I mean, the idea of it. And you can go through – there’s an Explore page on the Galactica site, although I think the site has been changing quite a bit in the recent weeks… But there’s still – at this time, there’s an Explore site on the Galactica site. And you can see – the examples they give are language models that cite.

[27:51] So the input prompt example they give is “The paper that presented a new computing block given by the formula”, and then it gives a math formula. And then the Galactica suggestion is attention is all you need. Vaswani et all. 2017. So this is kind of a way to organize scientific knowledge and kind of learn about scientific knowledge, but also, they give this example which I think probably gets to the more controversial things, which we can talk about here in a second, “Scientific from scratch”, or I think some people might interpret that “Science from scratch”, or something.

They give the example of translating a math formula into plain English, or finding a bug in Python code, or simplifying a math formula, or something that. And so there’s all these prompts you can give it. “Translate the math formula into plain English”, or something that, or into Python code, which seems to be quite useful to me… I’m not sure how that code was licensed. That’s maybe another separate issue. But that’s not the main controversy that’s come about with this. But in general, maybe just first impressions of this work, Chris… What is your thought?

I think it’s a great idea. And we’ve seen these with proteins, and such… We’ve seen amazing work in these different areas for doing that. And we will continue. But it’s also – sometimes in our industry, meaning the larger artificial intelligence industry, we are so busy trying to get the next big thing out, and kind of be the thing of the moment, that sometimes I think missteps are gonna happen, and I think this is a case of a misstep, where you had a large organization that’s trying to get out there… Because honestly – yes, it’s Meta, it’s a big, amazing AI capability, but there’s other big ones too, and it doesn’t take long, as we’ve discovered over the last few years, for the next amazing thing to replace today’s amazing thing. And so sometimes maybe we need to get it right before we get it all the way out.

Daniel Whitenack

Yeah, so before I talk about the individual observations about Galactica, something just occurred to me, which I don’t know if I’ve kind of distilled in my mind to this degree… It’s that even in my own work in developing AI models and developing AI systems, I think one of the principles that I’ve learned is the communication and expectations you set when you do an initial release of an AI system or an AI model really, really drive people’s sort of initial perception and their ability to adopt it. What I mean with this – or I can give an example from my industry, right? We do some language translation, and if I come to a translation team and say, “Hey, it’s awesome. I’ve just built this great machine translation system. You’re no longer going to have to do translation systems. Just make a couple of edits here and there, and you’re good to go.” That’s immediately going to – so what the translation team is going to look for in that system is all of the ways that it doesn’t work right, and doesn’t fulfill the expectations that I’ve given to them, right? Whereas if I come to that team and I say, “Hey, I really appreciate what you’re doing, and I understand that you have pain points, inefficiency around your process… I think that maybe this model, or this system that we’ve created could help you. Could you all help us understand how this system can best be used in your process, and we can kind of give you some suggestions, some prompts for getting started?”
[32:01] Then what they’re looking for is not so much why this is bad, and is taking over our jobs, or encroaching on what we’re doing, or is really dangerous, but their thought process is “These people are wanting us to tell them how we can use this.” And generally, in those cases, I’ve found people do find the positive things too, right? They find like “Hey, I didn’t expect this to work great in this situation, but it actually produced pretty good output. Can you do more of that? But in these other cases it did really bad, so don’t do that anymore.”

So you get more useful feedback on the initial release of something if you kind of approach the public or your internal teams or whoever your stakeholders are, and ask the community to help you understand the behavior and utility of what you’re releasing, versus telling them the utility, and telling them “This is going to solve this problem”, when in actuality they find out that it doesn’t.

Yeah. The way that you approach that has a big impact on trust for the system. And as we’ve seen over and over through the years here, AI is one of those technologies that people have to develop a sense of trust in in terms of what’s possible, but then also a validation of trust for any given system. And the way that you’re outlining positioning that approach makes a big difference on how people are going to engage from a trust perspective as well. They’ll give it a chance if they position it the way you suggested in the second version there, versus the first.

Daniel Whitenack

It’s difficult, and it’s not always that clear-cut when you’re in the process.

Yeah. Yeah.

Daniel Whitenack

So for those at Meta, if you’re listening, I totally also sympathize with you. It’s very hard to figure out how to communicate these things well, and release these things well. I think likely this Galactica system is amazingly innovative, and a great achievement, but it also has behavior that could be either non-ideal, or worse, harmful in certain cases. And so it’s part of the community, but also the creator’s responsibility to anticipate some of those…

I guess I can give an example… I also found a fun article – fun or disturbing, I don’t know how I should phrase it… An article from Janelle Shane. “Galactica, the AI knowledge base that makes stuff up.” We’ll link that in our show notes…

That’s pretty harsh there, isn’t it?

Daniel Whitenack

Yeah… Thanks, Janelle, for writing this. Basically, she starts with the conclusion. “I’ve tried Galactica. It’s not helpful.” So the prompts that she gave - one prompt is “The number of giraffes that have flown on a lunar mission is currently”, and then Galactica responds with “The number of giraffes that have flown on a lunar mission is currently 28, including 15 on the Moon and 13 on Mars. Of those 28, 17 have flown on the Moon, and 11 have flown on the Mars. The first lunar mission was in 1968, when the Luna 13 mission sent a giraffe to the Moon. In 1970, the Soviet Lunar Probe Luna 15 sent a giraffe to the moon.”

[laughs] Sorry.

Daniel Whitenack

It’s pretty good, right?

It’s pretty good stuff right there.

Daniel Whitenack

Yeah, I mean, that’s pretty good. I think I think I probably don’t need to give that many other examples to illustrate…

[35:56] No, I think you highlighted it quite well. [laughter] The funny thing – I’m doing the same thing, where I’m just looking at some of the various articles… And Ars Technica says “New Meta AI demo writes racist and inaccurate scientific literature.” But people get the idea. It’s a trust issue, and an accuracy issue, one related to the other… Despite the very hard work, I’m sure, of that Meta team, no one’s going to trust that model. If they fix it and come back out with it, all the focus is going to be on “Is this legit, the results that I’m getting out of it?” Which is a shame, when you think about it.

Daniel Whitenack

Yeah. And I think about different approaches here that people have taken… I think on the one side, OpenAI and some of the models they’ve released in a very controlled way via an API, they have attempted to address part of this release problem where they understand there could be even intentional misuse of this around misinformation, right? Or harmful usage. They try to anticipate that and create an API with controls around that, etc.

The other approach, which I think is kind of more open source, or open approach - something like Stability, where they released the model, Stable Diffusion, under an open license. So it’s out in the public, but they very much within the licensing, the OpenRAIL license, first off, tried to license and include licensing around restricted use that they could envision using the model, but put it out in the public with the hope that the community can help put some necessary guardrails around usage, and provide feedback on how the model can be used, and that sort of thing.

The third approach - I think maybe another approach would be to say, “Well, here’s our great model. It can solve this problem”, and you kind of ignore the fact that maybe it doesn’t always solve that problem, and maybe it also has harmful use… So I think it’s not necessarily any one of these is always right, or always wrong, probably… But it is worth considering these release options and what their implications are.

It is. In fairness, I remember - on that particular release from OpenAI that you were talking about… I remember in our conversation on the show, we were a little bit critical, kind of going “They’re kind of holding back”, and all that… And I don’t remember where we ended up on that, because things have evolved, but I do remember having the discussion on whether that’s appropriate… And then we have something like this today, and it makes it look a lot more reasonable in retrospect… So it really depends on the moment that you’re in, and what’s just happened in terms of that perspective there.

I also remember – in terms of licenses, trying to anticipate specifics, I remember thinking, whoever wrote a particular clause may not have had great insight into some of the use cases in that clause as well. It’s a hard nut to crack, trying to come up with the right solution here.

Daniel Whitenack

For sure. And granted, I think some of the response from Meta individuals - not virtual individuals; I mean individuals at Meta… Well, this sort of prompt, the giraffe prompt - I think one of the phrases that they used to describe that sort of prompt was “Causally misusing the model”, which is sort of blaming the people using it. I think, to be fair, that prompt is trying to draw something out of the model, which the creators of the model, they would explicitly say, “Well, this is an adversarial prompt”, right? “You already know there’s no giraffes that have flown on a lunar mission, right?” So there’s that perspective. I think there is an element of truth in that. But I think generally, the community have said, “Well, how close is the sort of goofy, causally misusing the model - how close is that to the gray area of misinformation, and people intentionally using it to create misinformation? …especially around science, or important things like health, and other things that.

[40:35] Yeah. It’s kind of funny - starting this conversation, specifically about this Meta instance, we’re kind of laughing about it, and stuff, but I think we’re gonna see so many of these instances in the years ahead… And you made a point, that I think we sometimes need to respond with a bit of empathy for the data scientists and AI engineers that are trying to create these… Because they’re trying to do some pretty cutting-edge stuff, and mistakes are gonna be made. And in the end, my understanding is nobody was hurt by this.

Daniel Whitenack

Yeah. We need to both be critical and empathetic. My previous boss would say “We need to be tenacious and gracious.” Both of those things aren’t mutually exclusive. So yeah, that’s a good point. As we wrap up here, I do want to share a new learning resource that I kind of came across in the past couple of weeks… I don’t know if you remember, Chris, at one point I think we shared a learning resource from Christoph Molnar, his book on interpretable machine learning, which is really cool… Well, he’s has a new book called “Modeling Mindsets: The many cultures of learning from data.” And my understanding is that this book goes into various sort of approaches to modeling, whether you think about Bayesian statistics, or other approaches, and talks about kind of what can we learn from these different modeling mindsets that could benefit us in our own sort of modeling work… Which is, I think, quite an interesting proposition.

His subtitle is “Becoming a better data scientist by understanding different modeling mindsets.” So understanding these diverse modeling mindsets can help us, whatever modeling is modeling problem or solution you’re trying to come up with. So Modeling Mindsets…

I think it’s a good time for a book that as well, when you think about it, in terms of benefiting from the different ways that you can approach a problem… Because I’ve recently seen some engineers very much stuck in a particular mindset, trying to solve a problem… So that one hits close to home for me.

Daniel Whitenack

A particular lane, yeah. There are other lanes.

Yes, indeed.

Daniel Whitenack

Yeah. Cool. Well, thanks for the discussion today, Chris. It was a fun one leading up to Thanksgiving. I hope you have a great holiday with your family, and I look forward to chatting next week.

You too, Daniel. Have a good holiday. Talk to you later.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art