Legal consequences of generated content with Damien Riehl, VP of Litigation Workflow & Analytics Content at vLex (Practical AI #232)

All Episodes

As a technologist, coder, and lawyer, few people are better equipped to discuss the legal and practical consequences of generative AI than Damien Riehl. He demonstrated this a couple years ago by generating, writing to disk, and then releasing every possible musical melody. Damien joins us to answer our many questions about generated content, copyright, dataset licensing/usage, and the future of knowledge work.

Changelog++ members save 1 minute on this episode because they made the ads disappear. Join!

43 minutes
Recorded Jul 12, 2023
Published Jul 18, 2023
Download (41MB)
Transcript
🎧 31,460

Featuring

Damien Riehl – Mastodon, Twitter, LinkedIn
Chris Benson – Twitter, GitHub, LinkedIn, Website
Daniel Whitenack – Twitter, GitHub, Website

Sponsors

Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com

Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Typesense – Lightning fast, globally distributed Search-as-a-Service that runs in memory. You literally can’t get any faster!

Notes & Links

📝 Edit Notes

Chapters

Chapter Number	Chapter Start Time	Chapter Title	Chapter Duration
1	00:07	Welcome to Practical AI	00:37
2	00:43	Damien Riehl	00:53
3	01:36	Regulations on AI	02:16
4	03:52	All the music	03:28
5	07:21	Human vs AI output	04:09
6	11:30	Working alongside AI	03:21
7	14:51	Finding the gray area	05:56
8	20:47	Ai is not copyright-able	03:33
9	24:19	Where is the line drawn?	03:22
10	27:42	Snake eating its own tail	02:26
11	30:08	Can I use these models?	01:08
12	31:15	Lessening value of IP	02:50
13	34:05	No more patents	01:49
14	35:55	4 Worlds	02:33
15	38:27	Scarcity or abundance?	01:09
16	39:36	Outrunning the tsunami	01:49
17	41:26	Goodbye	00:31
18	42:05	Outro	00:32

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Daniel Whitenack

Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist and founder of Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?

Doing well, doing well. I’m really excited about today, because there’s so many questions you and I have brought up in the show without the ability to answer, and I know we might get some answers today.

Daniel Whitenack

Yes. And actually this one - not only will it be super-practical and interesting, but it’s also a tip from one of our listeners who suggested this guest. So we’re really excited to have with us Damian Riehl, who is a lawyer and technologist with experience in litigation and digital forensics and software development. So welcome, Damien.

Thank you so much for having me. I’m thrilled to be here.

Daniel Whitenack

Yeah, I feel very selfish this episode, because I just have like a million sort of like legal implications, copyright questions related to generative AI, large language models, all sorts of things… But before we get into some of those specifics, I know over the course of this show we have commented on various things that have come about, like GDPR, and then California data privacy stuff, and now we have like the EU AI Act, and all of this sort of regulation stuff… And then you’ve got other things, on the other side, on the litigation front, where companies are getting sued for cogeneration based on maybe questionable training of models and other things… So maybe before we get in, from someone who is an expert in this area, and thinking about it all the time, how do you view where we’re at in relation to AI technology, and regulation, and kind of the legal side of things? How are those things catching up to one another, or outpacing one another, and where are we at now, as opposed to like maybe a year ago? What’s changed?

Sure. And maybe before I answer that, briefly - I’ve litigated for about 20 years, so I was a litigator for about 20 years. I did tech litigation, so I’m coming at this from a perspective as a lawyer, but I’ve also been a coder since 1985. So I have the law plus tech background. So for your listeners benefit, I’m not just a stuffed shirt that doesn’t know what he’s talking about. I can walk the tech walk, and talk the legal talk, if you will.

So really, as far as regulation, having litigated since 2002, I’ve seen ways that the EU and the United States have tried to regulate technology. And of course, they’ve had various degrees of failure, I would say, largely because the three of us know exactly how lots of technology works, but sadly, the Congress people and the regulators do not. And so it’s really – the law is by nature slow, and trying to get up to speed on a fast-moving area such as AI is very difficult. So I would say that if past is prologue, I don’t anticipate much good things coming out of regulation of AI in the near future.

Daniel Whitenack

Some of these things are related to like this generative wave of AI, where people are generating a lot of content with AI… I know that also you have a background - you mentioned your sort of coding background, but you also have a background with generative technologies, maybe some with large language models and other things… But I know you have a very interesting story of some generative things that you did with music. Could you describe some of that?

Yeah, absolutely. So I have both my current state, with my job that pays me money, that is with vLex, where I’m doing lots with large language models right now. We have a billion legal documents that we’re running large language models across, and doing embeddings to be able to do outputs, for example, a legal research memorandum, and eventually be able to provide motions, briefs, pleadings, that sort of thing. So that’s my job that pays me money. For the job that doesn’t pay me money at all, which you referenced, is my “All the Music” project. This is a project that I started with my friend, Noah Rubin, who - one thing your readers and listeners might find interesting is that around 2018 I did cybersecurity. The biggest thing I did was that Facebook hired me and my company to investigate Cambridge Analytica. So I spent a year of my life on Facebook’s campus, with Facebook’s data scientists and my former FBI, CIA, NSA people that worked with me, to figure out how bad guys use Facebook data. I did that for about 50-some weeks in a row. The stuff we would do on Monday would make the New York Times and The Times of London by Friday.

So at the end of a 14-hour day on the Facebook campus, I retreated to my hotel with Noah Rubin, my friend; we were in the lounge, and over a beer I said, “You know, Noah, how we can brute-force passwords by going aaa, aab, aac?” I said, “What if we could do that with music, where we would go do-do-do, do-do-re, do-do-mi, do-do-fa, until we mathematically exhausted every melody that’s ever been and every melody that ever can be? He said, “F-yeah”, but he didn’t say “F-yeah.” So within a few hours, he had a prototype where he cranked out 3,000 melodies. To date, we’ve now cranked out 471 billion melodies (with a b), mathematically exhausting every melody that’s ever been, and ever can be. We’ve written all those to disk. Once they’re written to disk, they’re copyrighted, automatically. So we’ve copyrighted for 471 billion melodies, and then we placed everything in the public domain, to be able to protect “You stole my melody” lawsuit defendants.

[06:06] And so the idea is that before my talk in 2019, every defendant in one of those lawsuits has lost. After my talk, which has been seen 2 million times, ever defendant has used largely my arguments, and has won. And my arguments are largely that maybe if a machine cranks this thing out at 300,000 melodies per second, maybe we shouldn’t give one person a monopoly, which has a copyright - a monopoly of life of the author plus 70 years for what the machine cranked out in a millisecond. So that’s largely my “All the Music” project… Which in a sense is generative AI, it’s a brute-force generative AI, but it’s generative AI. But the real question is, if the output of that is copyrightable, I essentially carpet-bombed the entirety of every melody that’s ever been, and if I were a megalomaniac, I would sue everybody, right? But I’m not. I put them all on the public domain. But if machine-generated works are copyrightable, these are the bad things that can happen.

I love that story, I just want to say that.

It’s pretty fun. And it’s because it’s been seeing so many times now, 2 million, I’ve been able to meet with some good friends now… For example, the former chief economist of Spotify, I’m now friends with. And the guy who was responsible for the first commercial mp3 to be downloaded, Jim Griffin - I’m friends with him. So anyway, so it’s opened a lot of doors for me.

Daniel Whitenack

That’s awesome. And one thing that comes to my mind, as you’re talking about that is – because I’m also a musician, and I think a lot of people would think of “Oh, these sort of melodies, or chord progressions”, or however you want to frame it - there’s a very human element to that, that involves creativity. And I think this would maybe be extended to - if we think about knowledge work more generally, whether that’s like you being a lawyer and writing briefs, or us being programmers and writing code, or marketers being marketers and writing copy - now these generative models can generate a lot of those things in a very compelling, and even I think people would perceive it as a very creative way, however you think about that creativity and coherence.

So in your own work, maybe it’s on the lawyer side, or the coding side, how is that project, and maybe some of the work you do day to day with large language models - how is that shaping how you think about this sort of knowledge work and the output of humans versus the output of models?

I would say two aspects to this. And I think that the word creative is ambiguous, as all words are; most words are. When I generated 271 billion melodies, I was creating those melodies, but it was by no means creative. So in that way, do-do-do, do-do-re, do-do-mi, do-do-fa - those are just mathematically exhausting everything that’s ever been. So that is a very simple version of what large language models do. Just saying “What is the statistically next sentence?”

And so this really gets to the heart of what we think human creativity is. Is it creative as in brute-forcing, or is it truly lightning in a bottle creativity that we want to protect with intellectual property laws and other things like that? I think what we’re learning from my project, and from the large language models, is that maybe human creativity ain’t as special as we think it is.

As an example, on a Tuesday, a jury found that Katy Perry had violated copyright of a melody that sounded like this: [09:29] That was on a Tuesday. On my talk on a Saturday, I said that that particularly melody shows up in my dataset 8128 times. So Katy Perry got dinged for $2.8 million over something that I had thousands of times just through brute force. So was that melody creative? No, it was just a brute force. After my talk was made public, the judge actually went back and reversed the jury verdict, saying that melody was so unoriginal as to be uncopyrightable. Essentially what I was arguing in my TED talk.

[10:02] I think going to the heart of what is human creativity, it’s good that we have large language models and projects like mine to be able to go to the heart of “There are some things that we should protect, and there are some things that are just unpredictable, because they’re unoriginal.”

You’re putting an obstacle against what was otherwise maybe the weaponization of IP. Is that a fair way of putting that?

100%. We’re using it as a shield, not a sword.

Right, which I like. Because having seen a lot of – for a non-attorney having seen a lot of IP concerns out there in business, sometimes you’re just kind of like “There’s not much there.” So…

Absolutely. I did that with the copyright side. My friend, Mike Bommarito, who was one of the guys who, you may have heard, beat the bar exam; they used GPT-4 to beat 90% of the humans on the bar exam. One of those was Mike Bommarito.

Mike approached me and said, “I love what you did with copyrights. Wouldn’t it be great to do that for patents?” And so right now we’re doing – I was doing “All the Music” project, we’re now doing “All the Patents” project. What that project is - it’s going to be taking all the patents that have ever been filed, taking each of the claims for each of those patents, putting those claims together in vector space, and clustering them that way, and then generating every possible combination of all those claims, in all those existing patents. So if anyone in the future tries to recombine an existing claim into a new thing, they can point to our thing as prior art, to be able to say “No, no, no. Bommarito, [unintelligible 00:11:21.13] and maybe Rubin, they already did that in 2023. You can’t do that again, because they did that as prior art.”

Daniel Whitenack

As both a coder and a lawyer, who I’m assuming some of these things you’re generating are generated… But I’m also assuming in your day to day work and in your day to day coding, there are portions of what you’re doing still that are not completely generated, or at least that you’re editing heavily… How has this type of work influenced how you think about your own job moving into the future, working alongside these models, or at least in an environment where these models exist?

So I will say - and anyone who works with me will agree - that I am a coder, but I’m a crappy coder. I’m probably one of the worst coders you’re ever going to meet. So a lot of my work with large language models is in the textual area, rather than the code area. But in the textual area - I’ll give you an anecdote that answers your question. I was reached out by the editor of a large legal magazine to say, “Hey, Damien, I want you to write the cover story on GPT.” I said, “How long do you want it to be?” He said about 17 double-spaced pages. And I was like “Man, I don’t have time”, because my rule of thumb is one hour per double spaced page, so that’s 17 hours that I just don’t have time to do.

But then I realized, “Oh, wait, the topic is GPT.” So what I did is I created an outline of headings and subheadings, that’s about two pages worth, and I said to GPT “For each of the bullet points, give me four sentences.” Essentially, a paragraph for each of the bullet points. And it sh*t out – that was my one for the day – crapped out the 15 pages worth And then I spent the next three hours editing, moving, adding, working with the text; not accepting the 15 pages outright, but working with the text and regenerating, and then I got it out the door three hours later, and the editor is like “Oh, this is perfect. I don’t need any edits, let’s get it out the door.” So that took a 17-hour project down to three hours.

That’s thing number one, is that this is not just accepting the machine output as is, but it’s really us using the output as an assistant, much like Copilot on GitHub is using it as an assistant. These are essentially pair-coding, if you will, co-authoring with the machine.

I was doing a talk with the US Copyright Office Assistant General Counsel, and he was talking about the regulations that they’re putting out, saying that if machine-generated, therefore uncopyrightable; if human-generated, therefore copyrightable. And if machine-generated, you have to be able to disclose what aspects of the thing is machine-generated.

[13:48] So if you think about if I were to file a copyright registration with my article that I just drafted, what extent was that machine-generated, and what extent was that human-generated? Because I spent three hours adding, editing; if the machine-generated in a sentence three of the words that were unmolested by me, the other 20 words in the sentence were actually mine. Do I have to disclose what three words were machine-generated, versus the ones that I edited? And that’s with text. How would I do that with music? If I said to the machine, “Hey, generate a melody, and generate a chord structure, and generate lyrics.” And then I spent from 1 AM to 3 AM rearranging all those things, and then getting it out the door, if the copyright office said “What aspects of that was human-generated and what’s machine-generated?” I would honestly say, “I have no friggin’ idea, because there is no track changes with my DAW that I make my music on. There’s no track changes, I didn’t track changes on my lyrics that I messed around.

So I think this idea of trying to bifurcate what is machine-created and what is human-created is a fool’s errand, and we’re really going to have to reckon with that.

Daniel Whitenack

Well, Damien, I have some very selfish questions that I’ve been pondering over in my own life as I’ve encountered them… And hopefully this won’t seem like popcorn questions, because I think it’s very related to what you’re talking about… But I think practical developers are hitting these snags as they’re developing apps with this technology, that are entering a sort of gray zone, so I’d love to get your thoughts on a few of these.

One example that I can think of is, you know, a lot of people are building chat interfaces; very popular now to say, “Oh, build a chat interface over a website”, or “Build a chat interface over documents”, or “Build a chat interface over data”, something like that. People are doing this very frequently, and it’s very useful. My question is - that chat interface, or those messages, are generated content, and that’s what the user is seeing. But I see this huge gray area where - let’s say that I want to chat with Harry Potter, right? I take the book of Harry Potter and I put it in my vector database, and someone asks a question of Harry Potter, and I go retrieve the content… I’m just injecting that into a prompt, and I’m sending the prompt with I guess the book content into a model, the model’s outputting some generated answer, and I’m sending that to the user.

Now, I’m assuming - I’m no lawyer, but I’m assuming I can’t sell a new copy of Harry Potter unless I have certain rights and agreements in place… But what if I put this interface up on the internet, and I start selling access to it? So I guess, with that kind of very real-world scenario, what sorts of elements do I need to consider there, and what’s known to have a good answer, what’s the gray area, what’s kind of maybe being litigated right now?

I’m going to answer your question, and it’s going to be a fun walk. So take a walk with me.

Daniel Whitenack

Okay, perfect.

So this walk - it’s going to begin with the Google Books project. So you might remember that Google Books ingested every book that ever existed, perhaps also including the Harry Potter book. A bunch of publishers said, “Hey, no fair, because you can’t ingest all these things, because these things are copyrighted. Every one of these books is copyrighted.” They sued, and then the district court and the appellate court, the Second Court of Appeals said “Yes, all those things are copyrighted, but Google’s use of that is actually fair use.” And the particular type of fair use is called transformative use. That the use that Google was using was transformative to what the original purpose of the book was. The purpose of a book is to read it, enjoy it etc. Google’s purpose was to index it, to be able to create a word index, to be able to then search all of the books, and to be able to provide the end user with a snippet, say maybe a page or two of that.

So because I as a user couldn’t use Google Books to be able to essentially replicate the book process, but instead I’m using it to search, that is a transformative use, therefore a fair use, that is not an infringement copyright. So that was back in the day.

[17:55] Now think about how large language models work. So a large language model, if you have the input, it is ingesting, say, the entirety of Harry Potter, but really what it’s doing is placing those in vector space, right? It’s saying that these words are similar to those words in the vector space. And once that happens, largely it jettisons the thing.

In copyright law, there is the idea “expression dichotomy.” Ideas are uncopyrightable. So if I have the idea of a man in a black hat, fighting a man with a white hat, over a woman who is tied to a railroad track, those are ideas that are uncopyrightable. You’ve seen lots of movies like that. But the expression of the idea, any particular movie that has that in there, that is copyrightable. So ideas - uncopyrightable, expressions of the ideas are copyrightable. So if you apply that to what’s happening when the large language model ingests all the books, it’s essentially putting all the words into vector space. So it’s saying, “This is a Bob Dylan-ism”, or “This is an Ernest Hemingway-ism”, or “This is a Harry Potter-ism.” Each of those are ideas, not expressions of ideas.

And so really, it’s taking the expressions and effectively jettisoning those in favor of the ideas. So that’s on the input side. And then on the output side, one could imagine that it’s taking those ideas - Bob Dylan-ism, Ernest Hemingway-ism - and then it’s outputting them in a new expression. And if you believe the copyright office, machine-generated output is similarly uncopyrightable. So we’re kind of faced with an idea that inputs, ideas - uncopyrightable; outputs, the expressions of ideas created by machines - similarly, uncopyrightable. To your particular point, this has not been tested in court. So a judge who, by the way, may not know what he/she is talking about might rule against what I’m about to say right now, but at least I would make a really good argument, that the ingestion of the thing is extracting the ideas from the book. And that is a transformative use. Because if you think about Google Books, they were printed three pages or so verbatim, of these books. That’s way more bulk than just – think about the vector space. It’s not reproducing any expression. It’s really taking the ideas. So if Google Books is permissible, almost certainly, the large language models should also be. And that’s really what is being argued right now in the cases that are happening with the GitHub Copilot case happening in the West Coast, and then the Stable Diffusion case in Delaware, and there’s others like it, where if I were the lawyers, I would be arguing exactly what I just argued right now.

So to your specific use case, you are kind of interrogating this copyrighted work, but I would make the argument if I were representing you that this would be a transformative use. And just like you as a human would have read the book, and you as a human could provide output based on that book, in the same way a machine should be able to read that book and provide output that is just taking the ideas of the thing, not necessarily the expressions of the ideas.

So with that new expression coming out that you’re describing, and assuming that the copyright office view stands, it’s not copyrightable - that’s a massively different way of producing content from before till now, and in the future. What does that mean for business and in the world at large, considering – I mean, that’s a major change in how everything works. How do you see the future rolling out if that were to stand?

First, I think it should stand, because otherwise my “All the Music” project would essentially carpet bomb all the music, and we would just have machines creating new expressions that would essentially make human expression obsolete. So number one, I think it should stand, because if not, we are in a world of hurt with the copyright.

Number two - you’re right, never before in human history have we had a machine that creates new things. We’ve had the printing press, where we take my ideas and then I can replicate it a whole bunch of times. We had the digital revolution in the ’70s, ’80s, ‘90s, 2000s, where now we can replicate human stuff. But never before have we had a way that the machine itself is making new expressions of ideas. And so as a result of that, some of the smartest people I know that are thinking about this said that the web as it stands is probably – large language models are gonna stop right around November of 2022, because anything after that, you’re going to have a whole bunch of machine-generated content that is going to be essentially large language model created things. If you know about the tech, and I assume your audience does, because it’s statistically likely, it is smooth; humans are jagged in the way that they write text, and that’s what – ChatGPT and others see the jaggedness of humaneness. Machine-generated content is smooth, not jagged. So that’s what ChatGPT says.

[22:29] Can you define that real quick, what jagged versus smooth means in this context?

Sure, yeah. So jagged means random; smooth means statistically almost deterministic. So the idea is that we as humans say random things, and we put things in a way that maybe hasn’t been said before, whereas a machine, at least an LLM machine, is going to be able to say “What’s the most statistically likely word?” and therefore that is smoother than our jagged randomness.

The idea is that as the machines are creating the smooth, deterministic, statistically likely next word, that is essentially - as new large language models ingest that smooth text, it’s going to further smooth the corpus. And we’re going to miss all of the human-created jaggedness that is going in there. So some of the smartest people I know are saying that maybe the web as it stood in November of ‘22 - maybe that’s the last time we’re going to have a lot of human-created stuff that is truly jagged, because here on out, we’re just going to have machine-created stuff that is smooth.

And one last thing I would add is that one of the last bastions of human-created jaggedness that we have is the courts. Because it turns out that – you know, people have talked about us being in a post-truth era, and post-fact era. There is a person that’s literally called a Fact Finder, and his name is a judge. They spend years trying to find facts. And then they write things called judicial opinions, that have found facts that have been battled over years in the courts. So one of the last places where we can find this jagged, almost certain to be human written thing, that is actually based in fact, in our post fact world, maybe is judicial opinions. And my employer, vLex, has about a billion of those across the world. So maybe as we think about what are new corpuses that the large language models can train on, that are truly jagged, that are full of factual things and not bull**** that’s on the internet, that is unvalidated… This is truly validated, human-created content that is high-quality, that might be a source to be able to ingest.

Daniel Whitenack

Maybe this gets a little bit back to your article example, where you interacted with the GPT output to write an article, and at a certain point it kind of morphs into its own thing, what portion of it is machine-generated, what portion isn’t. I know this is also happening just from seeing things like people are generating, for example, adult coloring books using AI models, and posting those in an almost automated way to Amazon, and then like someone can literally order of a book. I think of other examples maybe where - hey, this book was written a long time ago, and so the wording is really difficult. What if I used a large language model to rephrase it in modern English, and then I just post that and start selling it? So how long will this be debated in terms of like the copyright around this, and what should be on people’s minds as they’re creating this kind of content that they actually want to commercialize? Maybe that’s a more practical question.

Sure. If I were to create this machine-created coloring book, for example, which under the US Copyright Office today, that entirely machine-created thing is therefore uncopyrightable. This really goes to the heart of what is copyright in the first place. And all copyright is is a monopoly. It is a government-sanctioned monopoly, giving you the author a monopoly of life of the author plus 70 years on the thing you created. But as an exchange for that monopoly, the government says “This has to be original. It has to be your creative work that does this. And if it is truly original, and it is truly creative, we will give you that monopoly of 70 years; life of the author plus 70 years.”

[25:59] So really, the question is, is there anything copyrightable in the machine-generated work? Well, probably not, because there was no human creativity in that thing. So that’s thing number one. But then let’s look at another scenario - what if somebody else did a human-created coloring book that was identical to what the machine had done? Does that turn it from unoriginal, therefore uncopyrightable, with a machine-created one [unintelligible 00:26:21.02] if a human does it, it is copyrightable, even though they’re identical?

Yeah, essentially, because the machine wasn’t copyrightable, and you’re recreating it, even though there might be no creativity on the human’s part, because they’re literally looking at the output of the uncopyrightable machine output, but they can steal the idea without the creativity involved, and then copyright it. Am I understanding correctly?

That’s right. And we’ve dealt with this, the courts have dealt with this for a few hundred years. And you can imagine, Shakespeare is in the public domain; it’s not been in copyright for hundreds of years. You can build atop Shakespeare, say, with West Side Story, that was based on Romeo and Juliet. The writers of West Side Story don’t get copyright in the underlying story of Romeo and Juliet, but they do get copyright in whatever they put atop Romeo and Juliet. So the fact that it’s New York City, and all these things. So they just get what’s called thin copyright on top of the public domain thing.

So you can imagine, going back to our coloring book example, if someone then copies that and then adds a little human touch on that, they don’t get the underlying copyright thing, because that’s public domain; they only get what they’ve added atop the machine-created thing. And really, the question is how much can you really add atop that is really creative enough to make it worthwhile? And I would say, if it’s just another line here or there, that’s not sufficiently original or creative to add copyrightability.

Daniel Whitenack

I do want to get back to another couple themes, but maybe one more selfish question, which is less related to, I guess, the inputs and outputs; it would be more related to the models themselves, what they’re trained on, and how they’re released. Of course, we’re seeing a lot of different approaches to how models are being released, in the sense that “Well, is a model code? Is it data? Do I use Creative Commons, or do I use Apache 2?” Also, the data that was used in the training - maybe that’s a mix of copyrightable material, or maybe it’s not even known. Maybe a model shows up on Hugging Face, and I don’t know what the mix of the dataset was that was used in training. As you’ve been working with these large language models and advising around this, and thinking about these concepts, how do you see that side of, I guess, training data, fine-tuning data, model release? What’s on your mind as you look forward to this next season, which I assume will continue? We just had an episode with – I think it was titled “The Cambrian explosion of models.” There’s so many being released. This will continue. How do you see that side of things developing over the next season that we’re entering?

I think that what you’re asking really about is the provenance of everything that comes downstream. That is, what is the provenance of the input, and what is the provenance of essentially being able to manipulate that input to create a thing that’s called a model. Of course, the output of the model could train new inputs to be able to go into new models, right?

Daniel Whitenack

Yes, a cyclical sort of thing.

That’s right. It’s like a snake eating its own tail. And so within the law and criminal law, there’s a thing called the fruit of the poisonous tree. The first act might be innocuous, but then it leads to a chain reaction, a bunch of dominoes that leads to the end. So this is – the fruit of the poisonous tree is a legal concept, that you can imagine is similar for the questions you asked. If there is input data that maybe has questionable licensing… So for example, LLaMA, right? It was released for open source, but only for academic purposes. So you can imagine if someone were to be able to create a model for commercial purposes that is based on that ostensibly licensed for academic purposes; that is maybe a fruit of the poisonous tree question, to be able to say, “Is that model now tainted, because it was ingested in opposition to the license?”

Daniel Whitenack

[30:07] Yeah. I’ve wanted to use some of these LLaMA-based models quite recently, and I just haven’t, because it makes me ask a lot of questions. So am I right with that hesitation, or is it yet to be determined, but you would make certain assumptions, or what do you think?

Yeah, so I should clarify that I’m a lawyer, but I’m not your lawyer. So nothing I’m saying is going to be legal advice, okay?

Daniel Whitenack

Yes, correct. Correct.

So I would say that yes, anytime that you are ingesting items that are licensed, and then you’re using them in a way that is maybe against that license, or not permitted by that license, I think anyone should be worried when that happens, speaking generally. I would also say that yes, anyone who does that should be worried. Also, proving such things is tricky, because there is the law, and then there’s what can be proved in a preponderance of the evidence in the court of law. As the dominoes fall, and as the snake keeps eating its tail, the provenance of what data did you actually use, and where did it come from –

Daniel Whitenack

It gets murky.

It does get really murky. And so that’s something I imagine litigation is going to happen a lot.

So I’m absolutely fascinated by that, and want to take it farther… So I’m thinking back on years of business, and all the IP concerns… I worked for big corporations, lots of other people work for various… If the snake continues to eat its tail, and you’re seeing this happen over and over again, the value of current IP, generally, I would argue, diminishes over time, because its usefulness in business as things are progressing ever faster in the business plus technology world… You know, something that was a great piece of IP a few years ago - it might still be covered legally, but you’re not necessarily going to use what was 20 years ago versus what you did yesterday.

With that kind of utility of current IP diminishing, and with this sequence of snake eating its tail that you’re describing, and fruit of the poisoned tree I believe you called it - that has to have massive, massive repercussions for how business uses IP in the large, in general; your entire strategy about it. Because right now, organizations, they will come up with an idea, they will immediately go copyright it, whatever the appropriate mechanism is, and get that in, they lock it in, it’s part of their business strategy. That seems to me from what you’re saying to fail in the future; it is no longer a good strategy. What does that mean in the large? I mean, that’s a gigantic question, I think.

What I think you’ve described, and what I think we’re seeing in the society is the intellectual property laws that we’ve created since the beginning of our founding history - the Constitution says that we will protect inventions; that’s constitutional. So what I think we’re seeing is our intellectual property regime that has existed since the 1700s creaking under its own weight, with this new largest language model generating in the way that’s never been done in human history.

I think you’re right, the value of patents - what is the value of a patent if I can use a large language model to, much like I described with all the patents… You know, we with “All the Patents” start saying “Every patent that’s ever been done, and every idea, every claim and every patent - let’s recombine those.” But you can imagine - and I’ve heard that there are companies out there that are doing not what’s been done before, but new ideas, and then making a ton of new claims and filing those new claims with the US Patent Office. And to be able to say, “Here’s a new idea.” And if you carpet-bomb the US Patent Office with all of these new things that are just machine-generated… There’s currently a case, the Thaler case, that he had machine create this patent, and the patent office said “Ah, machine-created patents are not a thing. You can’t do it.” But they only knew that because Thaler told him that it was machine-created. How many of these things are being filed that nobody’s told anybody that it has been patented? And is that fraud on the patent office? Probably. But the question is, who’s going to find out if it doesn’t go to litigation?

[34:06] For a moment, I want to ask you to stop being an attorney, and be a speculator here. I want you to kind of blue sky it; like, where can this possibly go, or what are the different paths? What’s your gut tell you in terms of how this plays out? Because you’ve described multiple ways in the last few minutes where the whole system can essentially collapse under its own weight. Not just one way; you’ve done it several different ways where that can happen… Which isn’t surprising, because over the episodes of the show we keep talking about the rapid change that all this is bringing in AI; it’s the most fascinating moment in human history, in my view, and you’re describing the weight of all the structure of the past in terms of the legal considerations unable to keep up with what’s going on now, and it’s only accelerating. Where do we go from here? What does that mean?

Sure. If past is prologue to what’s going to happen, I would say that - you know, we’ve seen over the last decade business patents essentially going away; we’ve seen software patents almost pretty much go away with the [unintelligible 00:35:02.03] decision and others… So patents have already been diminishing in value over the last 10 years or so. And I think this is just going to accelerate that diminishment. Because if I’m going to compete in the marketplace, largely, anything I invent today is going to be obsolete in three years anyway. So what’s the good in patenting a thing that is obsolete in three years? I think Elon Musk said, “I’m open sourcing all my patents, because a patent is merely a license to sue.” And that’s true. I spend a million dollars or $2 million to get the patent, and then I have to spend millions on top of it to sue somebody over that patent. That license to sue often just doesn’t make business sense, and I think it makes even less sense as the system is collapsing under its own weight.

So if you’re looking for me to speculate, and you are, I would hope that the patent regime is going to fall away in importance, and people are just going to innovate.

Daniel Whitenack

As all of us on this call are knowledge workers, and program or do legal work, that sort of thing, I think we’re all benefiting in terms of productivity moving into the future. Outside of the IP stuff, the copyright things, all those things, I’m coding much faster now, not because I’m necessarily a better coder, which maybe I’d like to think I am, but I’m probably not… It’s because I’m using generative tools and suggestions in a much more robust way.

I was fascinated, in one of your recent talks when you were talking about kind of the practical consequences of that. If I can work 50% faster, do I still work the same amount, or do I work less? And what are the implications of my employer’s viewpoint on my work, and that sort of thing? Could you talk us through a little bit about your thinking in that regard?

Yeah, so I’m going to talk about four worlds. The first world is the 2022 world, before the large language models. And in that world, I would work 40 hours a week, full-time, and I would give 40 hours a week of 2020 to productivity as a result of that. And as a result of that, an employer would hire a workforce like me to do that. So that’s world number one.

In world number two, I know of people anecdotally that are working three full-time jobs, because they’re getting at least 100% or so productivity gains, maybe 10x productivity gains based on the code that you said. So they have three full-time jobs. So he’s essentially working 30% of the time for each, but still providing 100% of the output for that. And their employer is saying, “Wow, that’s great output.” They don’t care. So that’s world number two, I think what we’re in today.

World number three is probably the employer is gonna say, “Hey, don’t give me 30% of your time, give me 100% of your time, and maybe give me 10x output of 2022 level output. I want that productivity gain from you.” So that’s world number three. But I think shortly thereafter is going to come world number four, where the executives are going to say, “Wait, wait, wait, if we lay off two thirds of the workforce, and then still require them to work 40 hours a week with their 10x productivity, I can say to my shareholders “Look at all the costs that we cut by laying off two thirds of the workforce, and we’re still getting 5x productivity on top of our 2022 productivity. We’ve cut costs, we’ve increased productivity. Aren’t we great?” I think that that’s probably the world that we’re headed for.

[38:27] And there’s a world six beyond that, which - that leads, very obviously, to that recognition of cut the workforce, and stuff. I don’t know if we want to go there or not to finish up, but we’ve got some tough social issues to navigate there.

Really, what we’re describing here is there’s a scarcity mindset, and there’s an abundance mindset. The scarcity mindset is that – around 1979 accountants were really worried with this artificial intelligence that’s called the spreadsheet. Because they said, “Wow, all we do all day is use ledgers, and we add and subtract numbers, and machines can do that in seconds. That’s gonna put us all out of work.” But what happened was that when the clients realized, “Oh, it’s not going to take me a week to get that ledger back, but it’s going to take seconds… Let’s do scenario two, and scenario three, and scenario four, and run more scenarios”, and now we have more accountants than ever, because the tools are actually a force multiplier, that now there’s more accounting work, rather than less. So that is an abundance mindset, that is not a scarcity mindset.

So the real question in my mind, and maybe it should be on all of our minds, is “Is the scarcity mindset that I described with worlds one, two, three and four - is that going to be our future? Or is there an abundance mindset where we just have 10x or 100x productivity and we keep growing and growing?”

Daniel Whitenack

I think that’s a great transition, kind of as we get to the close here. Maybe one question that I’d like to ask you… We’ve talked about various interesting scenarios and maybe things that are honestly kind of uncomfortable for a lot of our technical listeners around legal questions, and lawsuits, and copyright, and that sort of thing. From your perspective, as you look to the future, kind of this next year, what are you encouraged by, and/or how would you encourage our listeners, maybe those practical developers or practitioners out there? How would you encourage them to engage in this conversation and these topics moving into the future? And what are you excited about or encouraged by moving into the future?

I think about AI as largely a tidal wave, or a tsunami, and we are running faster than the tsunami. How do we run faster than the tsunami? You learn how to use Copilot to be able to code faster. You learn how to be able to do things that the machine cannot yet do. That’s running faster than the tsunami.

So really, I’d say to lawyers that are worried about AI, that AI will not take a lawyer’s job, but a lawyer that uses AI will take the job of a lawyer that does not use AI. And so really, I would say the same thing for coders who are listening, that learning to use the tool to run faster than the tsunami… There’s another joke; there was a bear at a campground, and two guys, and the one guy gets out of his tennis shoes, and the other guy says “You can’t outrun a bear.” And he said, “I don’t have to. I just have to outrun you.” So in that sense, learn how to use the large language models to outrun your competition, because as the wave crashes over them, it’s not going to crash over you. I think that we all have to reckon, eventually the wave I think may crash over all of us, but until then, I think we should be running as fast as we can.

Daniel Whitenack

Awesome. Yeah, that’s a great encouragement, and thank you so much for humoring us with all of our random questions, some of which were selfish on my part, but I’ve learned a lot, and I really appreciate your insights, Damian, and the work that you’re doing. I look forward to seeing your future projects, and I’m sure that our listeners will find this super-interesting. Thank you so much.

Thank you. I don’t often get to speak to audiences as sophisticated as yours, so I really enjoyed the really deep and probing questions, and I really am grateful for the opportunity.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art