Changelog Interviews – Episode #578
What exactly is Open Source AI?
with Stefano Maffulli, Executive Director of the Open Source Initiative (OSI)
This week we’re joined by Stefano Maffulli, the Executive Director of the Open Source Initiative (OSI). They are responsible for representing the idea and the definition of open source globally. Stefano shares the challenges they face as a US-based non-profit with a global impact. We discuss the work Stefano and the OSI are doing to define Open Source AI, and why we need an accepted and shared definition. Of course we also talk about the potential impact if a poorly defined Open Source AI emerges from all their efforts.
Note: Stefano was under the weather for this conversation, but powered through because of how important this topic is.
Featuring
Sponsors
Vercel – Zero configuration for over 35 frameworks Vercel is the Frontend Cloud makes it easy for any team to deploy their apps. Today, you can get a 14-day free trial of Vercel Pro, or get a customized Enterprise demo from their team. Visit vercel.com/changelogpod to get started.
Synadia – Take NATS to the next level via a global, multi-cloud, multi-geo and extensible service, fully managed by Synadia. They take care of all the infrastructure, management, monitoring, and maintenance for you so you can focus on building exceptional distributed applications.
CIQ / Rocky Linux – CIQ is Rocky Linux’s founding support partner. They support the free, stable, and secure Linux distro called Rocky Linux.
Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | This week on The Changelog | 01:30 |
2 | 01:30 | Sponsor: Vercel | 02:45 |
3 | 04:15 | Start the show! | 01:21 |
4 | 05:36 | The value of open source | 02:37 |
5 | 08:13 | The Open Source Initiative | 02:32 |
6 | 10:45 | Who formed the OSI? | 03:34 |
7 | 14:19 | Defending Open Source licenses | 00:45 |
8 | 15:04 | Operating internationally | 02:31 |
9 | 17:36 | The responsibility of protecting Open Source | 06:15 |
10 | 23:51 | Sponsor: Synadia | 05:10 |
11 | 29:01 | Now we have Open Source AI | 04:06 |
12 | 33:07 | How to you take action? (calling Zuck) | 02:33 |
13 | 35:40 | The 4 principles of Open Source AI | 02:14 |
14 | 37:54 | The preferred form of modification | 04:31 |
15 | 42:25 | The components of an AI system | 03:33 |
16 | 45:58 | Jerod shares a hypothetical example | 05:17 |
17 | 51:14 | Sponsor: CIQ / Rocky Linux | 03:50 |
18 | 55:04 | Open Source AI done right? | 02:58 |
19 | 58:03 | What's the benefit of being Open Source AI? | 02:28 |
20 | 1:00:31 | Is AI being commoditized? | 03:10 |
21 | 1:03:40 | Who's participating in this definition? | 03:43 |
22 | 1:07:23 | But what if there's no benefit? | 02:37 |
23 | 1:10:00 | This is a lot of work | 01:34 |
24 | 1:11:35 | What's at stake? | 02:24 |
25 | 1:13:59 | You should get involved | 00:45 |
26 | 1:14:44 | What's next? | 02:35 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Well, Stefano, it’s been a while… Actually never, which is a good thing, I suppose, but now we’re here. Fantastic. We were at All Things Open recently, and we tried to sync up with you, but we missed the message, and so we were like “We’ve gotta get you on the podcast.” And obviously, this show, the Changelog, was born around open source. And I kind of find it strange and sad that we’ve never had anybody from the Open Source Initiative on this podcast. I’m glad you’re here to change that, so welcome.
Thank you. Thank you for having me. It’s a pleasure. I’m sorry we missed each other in South Carolina. It was a great event.
Oh man, we love All Things Open, we love Todd and their team there. We think All Things Open is the place to be at the end of the year.
Oh, for sure.
If you’re a fan of open source, you’re an advocate of open source, and just the way that it’s permeating all software. It’s won. Open source has won, and now we’re just living in a hopefully mostly open source world, right?
Absolutely, absolutely. I mean, just last week there was an article published that estimated the value of open source software as a whole. The numbers are incredible. These researchers from Harvard business school went and looked at the body of open source as it is, consumed or produced, and they put dollar numbers on it.
I envy those people, because I don’t know how, I’m not an analyst… Jerod, maybe you’re like somewhat of an analyst, right? You have an analytical brain, from how I know of you…
Okay.
I don’t know how you would quantify the value of open – I mean, I know it’s quite valuable… But literally, how do you value, how do you quantify the value of open source? What do they do? What are the metrics they key off of, do you know?
They counted lines of code, they counted the hours, they estimated the hours that it would take to rewrite from scratch all the software that is in use, and they use the datasets that are available already with some of those counts… And using those two datasets, they estimated the value that it would take to replicate all of the open source software that is available, and they put the numbers around $8.8 trillion.
Wow.
I would actually just say all the dollars really, personally. I would just say all the dollars.
Yeah. Well, I mean, it’s a huge number. All the dollars.
Right. Doesn’t every dollar today like really depend on open source at some layer? So really, couldn’t it be just all the dollars?
Right, it’s an impressive number, and it’s really hard to picture it, how big it is. I had to look it up… So it’s three times as much as Microsoft’s market cap, and it’s larger than the whole of the United States budget. 2023’s budget in the United States - that includes Medicare… 6.3 trillion.
That’s a lot of trillions there. More trillions than I’ve got, Jerod, of anything? I don’t have trillions of anything, really. Maybe – not even in cents. [unintelligible 00:07:35.22]
I don’t think so.
You don’t keep a bucket?
I almost asked Siri to tell me –
You’ve gotta turn those into the bank and see what they’ll give you.
Yeah… That’s fun to think about, really.
Well, I hear a number like 8.8 trillion and I start to think “Why don’t you round that up to nine?” And then I realize, that’s like a fifth of a trillion dollars if you’re gonna round it. That’s a lot of money to round.
[laughs] But it’s a nice rounding error in your favor, if it was your own dollars…
Right?
Oh, yeah.
I wouldn’t mind that, for sure.
Yeah, round it off, hand it out to some folks. Hand it off to some maintainers. That’d be nice.
[00:08:11.16] Yeah. Well, I don’t know if everybody listening to this podcast will be – I think a lot of them will be, but in light of recent feedback, Jerod, I don’t want to assume that our listenership is super-informed of what the Open Source Initiative is. I can kind of read from the About page, Stefano, but I’d prefer that you kind of give us a taste of what the OSI is really about. What is the organization? It’s a 501(C)3, it’s a public benefit corporation in California… But what exactly is the Open Source Initiative, for all that value we’ve just talked about? What is it?
Oh, yeah. In a nutshell, we are the maintainers of the Open Source definition. And the Open Source definition is a 10 point checklist that has been used for 26 years. We have celebrated 25 years last year. It’s the checklist that has been used to evaluate licenses, that is legal documents that come together with software packages, to make sure that the packages, the software, comes with the freedoms that are written down; they can be summarized as four freedoms, that come from the free software destination… That is the freedom to use the software without having to ask for permissions, the freedom to study and to make sure that you know and to understand what it does, and what it’s supposed to be doing, and nothing else. And for that, you need access to the source code. And then the freedom to modify it, so to fix it and increase its capacity, or help yourself… And the freedom to make copies, that is for yourself or to help others. And those freedoms were written down in the ’80s by the Free Software Foundation, and the Open Source Initiative started a couple of decades after that, picking up the principles and spreading them out in a more practical way… At a time when a lot of software was being deployed, and powering the internet, basically. This definition [unintelligible 00:10:19.28] licenses gives users and developers clarity about the things that they can do. It provides that agency and independence and control, and all of that clarity is what has propelled and generated that huge ecosystem that is worth 8.8 trillions.
So who formed the initiative? And then how did it sustain and continue? It seems like the definition is pretty set… But what is the work that goes on continuingly?
Yeah, well, the work that goes on continuously is, especially now recently, it’s the policy, the monitoring of policy works, and everything that goes around it. The concept of open source seems to be set, but it’s constantly under threat, because evolution of technology, changes of business models, the rise and rise of importance and power of new actors constantly shifts and tend to push the definition itself of open source in different directions, the meaning of open source in different directions. And regulation also tends to introduce hurdles that we need to be aware of.
The organization, what we do - we have three programs. One is called the legal and licenses program. And that’s where we maintain the definition, we review new licenses as they get approved, and we also keep a database of licensing information for packages… Because often, developers don’t use the right words or miss some pieces, a lot of packages don’t have the right data… And we’re maintaining the community that maintains this machine called ClearlyDefined.
[00:12:14.05] On the policy front - that’s another program the policy and standards front - we monitor the activity of standard setting organizations, and the activity of regulators in the United States and Europe mostly, to make sure that the all the new laws and rules, and the standards, can be implemented with open source code, and the regulation doesn’t stop or doesn’t block the development and distribution of open source software.
Then the third program is on advocacy and outreach, and that’s the activities that we do with maintaining the blog, having the communication, running events… And in this program, we’re also hosting the conversations around defining open source AI, which is a requirement that came out especially a couple of years ago, [unintelligible 00:13:00.16] So we were basically forced to start this process, because Open AI is a brand new system, brand new activities, and it forces us to review the principles to see if they still apply, and how they need to be modified so they apply to AI systems as a whole.
And we are a charity organization, you mentioned that… So our sponsors are individuals who donate to become members, and they can donate any amounts, from $50 a year, up to what have you. And we have a few hundreds of those, almost 1,000. And then we have corporate sponsors, who give us money also, donations, to keep this work going. It’s in their interest to have an independent organization that maintains the definition. And having multiple of these donors, corporate donors makes the organization stronger, so we don’t depend on any one thing individually of them. So despite the fact that we get money from Google, or Amazon, or Microsoft and GitHub, we don’t have to swear our allegiances to them.
Do you also defend the license so far as going to court with people who would misuse it, or no?
It hasn’t happened, but we do have – I mean, not under my watch. But we do have experts on our board and in our circle of licensing experts, we do have lawyers who will go to court constantly to defend the license, defend trademark, protect users.
And they are there as like expert witnesses?
Exactly. And we have provided briefs for courts, opinion pieces for regulators, and responses to requests for more information in various legislations.
How challenging is it to be a US-based/founded idea, now organization, that represents and defends this definition that really, going back to the trillions… I mean, all the money, all the dollars. Like, it’s a world problem, it’s not just a United States problem. How does this organization operate internationally with challenges that you face as a US-based nonprofit, but representative of the idea of open source that really impacts everyone globally?
[00:15:40.24] Yeah, that’s a very good question. In fact, it is challenging. So I started at the organization only a little over two years ago… And I’m Italian, and so I do have connections to Europe, and knowledge about Europe. We do have board members that are based in Europe and other board members in the United States, and it is actually quite challenging to be involved into these global conversations, because now, a little bit like maybe in the late ‘90s, open source is becoming increasingly – getting at the center of geopolitical challenges. And not because of open source per se, but because software is so incredibly – existing everywhere, and most of that software that exists is open source. So there have been a lot of challenges as the trade relationship with other actors like Russia, Ukraine, now with the war in Israel and Gaza, and the trade wars with China, between China and the United States… There are a lot of geopolitical issues that we’re at the center of, and we’re finding it really complicated. In fact, we have raised more money to increase our visibility on the policy front. At the moment, we have two people working, one in Europe, and one is more focused in the United States, both of them are part time… But we do have budget to hire at least another one, if not two, policy analysts to help us review the incredible amount of legislation that is coming. We’re just talking about United States and Europe.
I guess even one more layer than that is that – I don’t know if it’s a self profession of the defendership of the term of open source; I understand where it came from, to some degree… And I wonder, how do you all handle the responsibility of not so much owning the trademark term of open source, but defending it? So in a way, you kind of own it by defending it, because you have to defend it. Like, it’s some version of responsibility, which is a maybe a byproduct of ownership, right? There’s a pushback happening out there. There’s even a conversation of recent, where they can’t describe their software as open source, because the term means something. And we all agree on that, right? We understand that. I’m not trying to defend that. But how do you operate as an organization that defends this term?
Yeah, I mean, this is really funny, because we don’t have a trademark on the term open source applied to software. We have a soft power, if you want, that is given to us by all the people who, just like you just said, recognize that the term open source is what we have defined. We maintain the definition, and it’s kind of recursive, if you want, but corporations, individual developers, in all their institutions, like Academia, researchers - they recognize that open source means exactly the list of licenses, those 10 points, if you want, the four freedoms that are listed. And we maintain that. And this has become quite visible also even in courts, where they do understand that if someone is – there was a recent case involving the company neo4j. And during that litigation, that is quite complicated [unintelligible 00:19:13.03] I’m not a lawyer, I’m not going to dive into legal things… But the one key takeaway that is easy for me to grok and communicate is that the judge recognized that the value of open source is in the definition that we maintain… And calling open source something that is not a license that we have [unintelligible 00:19:38.10] approved is false advertising.
And that held up in court.
Oh, yeah.
Interesting.
[00:19:45.16] So is that what you would say to people who are perhaps - maybe nonchalant isn’t the best word, but unimpressed by open source as a definition, and they think it’s stodgy, and tight, and the thing that they’re doing is close enough, and they like the term, they’re going to use the term, and they’ve got open-ish code, or source available, or business source…? Because there’s a lot of people that are kind of pushing not just against the definition itself, but against the idea that we need a definition; or like you guys get to have the definition. What do you say to them?
Yeah, they’re self-serving, they try to be self-serving, and they’re trying to destroy the comments that way, quite visibly. I think that users see through them. And it’s not even in their interests, but you know how it works - sometimes corporations, their greed goes up to… They care only about the next quarter, and who cares about what happens next? Maybe the next CEO will have to take care meanwhile, and they’re just going to laugh all the way to the bank. And that is the approach that I’ve seen many of these people who complain, or who try to redefine open source because it doesn’t serve the purpose - what we maintain, it doesn’t fully serve their purpose. So instead of respecting the comments and share the ideas, they act like bullies and find all sorts of excuses to redefine. I’ve seen it happening. I’ve been in free software and open source most of my career, since I was in my 20s, and I’ve seen what was happening with the early days, with the proprietary Unix guys that were going around telling us that “This Linux thing is never going to work. You’re joking.” Then they started to be scared and started saying “Hey, you’re giving away your jewels. Why are you doing this, depriving us of our life support? Our families - we’re gonna be begging on the street.” I remember having this conversation with a sales guy from [unintelligible 00:21:49.20] And Microsoft coming up with a program in the 90s, early 2000s, The Shared Source Program, because they just could not wrap their head around the fact that you could make money sharing your source code. But they were forced by the market to show at least a little bit of what was happening behind the scenes. They were losing deals.
So we’ve seen it already… They’re gonna keep on going like this, but there is plenty of interest in maintaining – plenty more forces on the other side to maintain, to keep the bar straight, to keep going where we’re going… Because that clarity is – it’s such a powerful instrument to be able to say “I’m open source, therefore I know what I can do, I know what I cannot do”, and have that collaboration straightened up. The legal departments, the compliance departments, the public tenders, they all tend to have a very clear and speedy review of processes… That instead if everyone has a different understanding of what open source means… You know, “Do we go back to the brand”, right? And I’m in Italy now, and I’m surprised to see a lot of Starbucks stores opening. And I’m absolutely baffled. Like “Why is this happening? This country has plenty of bars.” At every quarter there’s a cafe with a decent coffee. Why do you need a brand? It’s because people have been going around, traveling the world, they see the brand, they recognize it, they know what they can do, they know what they’re gonna get, and they go there. And it’s the same with open source.
Break: [00:23:37.14]
So last year on this time Meta released LLaMA, their large language model, and to much fanfare and applause, and they announced it as open source. We know a lot has transpired since then, but at the time, what was your response to that, personally, or as the executive director of the OSI? What were you thinking? What were you doing in the wake of that announcement?
Well, we were already looking at open source AI in general. We were trying to understand what this new world meant, and what the impact was on the principles of open source as they applied to new artifacts being created in AI. We already had come to the conclusion that open source AI is a different animal than open source software. There are many differences. So two years ago, over two years ago, one of the first things that I started was to really push the board and to push the community to think about AI as a new artifact, and that required and deserved also a deep understanding, and a deep analysis to see how we could transport the benefits of open source software into this world. The release of LLaMA 2 kind of cemented that idea. It is a completely new artifact, because - sure, they have released a lot of information, a lot of details, but for example, we don’t know exactly what went into the training data.
And LLaMA 2 also came out with a license that really has a lot of restrictions on use… Having restrictions on use is one of the things that we don’t – I mean, the Open Source definition forbids. You cannot have any restrictions on use. And at surface value, the license for LLaMA 2 seems innocent, right? One of the things says “Well, you cannot use our tools for commercial applications if you have more than a few million”, I don’t remember exactly how many, “a few million monthly active users.” Okay, maybe that’s a fair limitation. And in my mind, I was like “So what does it mean, that the government of India cannot use it? The government of Italy, maybe?” If you want to embed this into… So that’s already an exclusion, and I have to think about it, think about “Yeah, I’m a startup, I’m small thing. But what happens when I get to 6 million users?” All of a sudden you have to lawyer up and change completely your processes?
But then there are a couple of other instructions inside that license that are even more innocent at the surface, but when you start diving deeper… Like, “You cannot do anything illegal with it.” Okay, alright… So let me see. If I help someone decide whether they can or they should have an abortion, or if I want use this tool in applications to help me, I don’t know, get refugees out of war zones, into another place… And maybe I’m considered a terrorist organization by the government that is using that. So am I doing something illegal?” It depends on whose side, who needs to be evaluating that.
It’s these licensing terms that the Open Source Initiative really doesn’t think they’re useful, they’re valuable, and they should not be part of a license. They should not be part of a contract in general, and they need to be dealt with at a separate level.
So that’s what I was looking at, was like “Oh, LLaMA 2. Oh, my God… It’s not open source, because clearly this licensing thing would never pass our approval.” And at the same time, we don’t even know exactly what open source means. Why are you polluting this space? So I was really upset.
Yeah. So then do you spring into action? Like, what does the OSI do? Because you’re the defenders of the definition, and here’s a huge public misuse. Do you write a blog post? Do you send a letter from a lawyer? What do you do?
Do you call Zuck?
[00:33:23.02] Luckily, we were already into this two-year process of defining open source AI. Actually, I was already in conversations with Meta to have them join the process and support the process to find the sheer definition of open source AI. And in fact, they’re part of this conversation that I’m having with not just corporations like Google, Microsoft, GitHub, Amazon, etc. but also, we’ve invited researchers in Academia, creators of AI, experts of ethics and philosophy, organizations that deal with open in general, but knowledge open data like Wikimedia, Creative Commons, Open Knowledge Foundation, Mozilla Foundation… And we’re talking also with experts in ethics, but also organizations like digital rights groups, like the EFF, and other organizations around the world who help into this debate. Like, we had to first go through an exercise to understand and come to a shared agreement that AI is a different thing than software. Then we went through an exercise to find the shared values that we want to have represented, and why we want to have the same sort of advantages that we have for software also ported over to the AI system.
And then we have identified the freedoms that we want to have exercised, and now we’re at the point where we are trying to name the list of components of AI systems, which is not as simple as binary code, compiler, compiler and source code… So it’s not as simple as that. It’s a lot more complicated. So we’re building this list of components for specific systems. And the idea is by the end of the end of spring, early summer, to have the equivalent of what we have now as a checklist for legal documents, for software, and have the equivalent for AI systems and their components, so that we will know… Basically, we have a release candidate for an open source AI definition.
Yeah, you mentioned that, and there’s – I think you posted this eight days ago, a new draft of the open source AI definition, version 0.0.5. It’s available, I’m gonna read from I think what you might be alluding to, which is this exactly what is open source AI. And it says, linked up to the [unintelligible 00:35:56.06] document, it says “What is Open Source AI? To be open source, an AI system needs to be available under legal terms that grant the freedoms to 1) use the system for any purpose and without having to ask for permission; 2) study how the system works and inspect its components; 3) modify the system for any purpose, including to change its output; 4) share the system for others to use, with or without modifications, for any purpose. So those seem to be the four hinges that this “What is Open Source AI” is hinging upon, at least in its current draft. Is that pretty accurate, considering it’s recent, eight days ago?
Yeah, those are the four principles that we want to have represented. Now, the very crucial question, what comes next, is if you are familiar with the four freedoms for software, those set by the Free Software Foundation in the late ‘80s, those freedoms have one little sentence attached to it, to the freedom to study and the freedom to modify… They both say “Access to the source code is a precondition for this.” That little addition is meant to clarify the fact that if you want to study a system, if you want to modify it, you need to have a way to make modifications to it that is not just the – it’s preferred form to make modifications from the human perspective. It’s not that you give me a binary and then I have to decompile it, or try to figure out from reverse-engineering how it works. Give me the source code. I need the source code in order to study it.
For the AI systems, we haven’t really found yet a shared understanding or a shared agreement on what it needs to have access to the preferred form to make modification to an AI system. That’s the exercise that we’re running now.
Yeah, that’s interesting. The preferred form of modification is really interesting, because like you said, you don’t want to give a binary and expect reverse engineering, because… That’s possible, and that’s possible maybe to a small subset; it’s not the preferred route to get to Rome. It’s just like “That’s not the road I want to go down. I want a different way.”
[00:38:10.23] Yeah. And you want to have a simple way. Some licenses even have more specific wording around defining what source code actually means. The GNOME GPL is one of those. There are very clear descriptions and prescriptions about what needs to be given to users in order to exercise those freedoms, their freedoms as a user.
For AI it’s complicated, because there are a few new things for which we don’t even have – there are no court cases yet… I keep repeating the same story - when software came out for the first time [unintelligible 00:38:47.18] research labs, they started to become a commercial artifact that people could just sell. There was a conscious decision to apply copyrights to it. There was not a given fact that it was going to be using copyright, and copyright law.
That decision was a lucky one, honestly, or it was well thought out - I don’t know which of the two - because copyright as a legal system is very similar across the world. And building the open source definition, the free software definition, the legal documents that go with software for open source software and free software, those legal documents built on top of copyright means that they’re very, very similarly applied pretty much everywhere around the world. The alternative at the time were conversations about around treating software as an invention, and therefore covered by patents. Patent Law is a whole different mess around the world; the whole different applications they have, the whole different terms… Much more complicated to deal with.
So for AI, we’re pretty much at the same stage where there are some new artifacts. After you train a model, and that produces weights and parameters that go into the model, those models - honestly, it’s not clear what kind of legal frameworks apply to those things… And we might be at the same time in history where we could have to imagine and think, and maybe suggest and recommend what the best course of action will be. Whether it makes sense to treat them as copyrightable entities, artifacts, or nothing at all, or inventions, or some other rights, or exclusive rights.
And the same goes into the other big conversation that is happening already, but for which there is no – I don’t have a clear view of where it’s gonna end, is the conversations around the right to data mining. And if you follow the conversations around ChatGPT being sued by the New York Times and Getty Images, Stability AI [unintelligible 00:41:01.08] and GitHub being sued by anonymous etc, etc. a lot of those lawsuits hinge on what’s happening, why are these powerful corporations going around and crawling the internet, aggregating all of this information and data that we have provided, uploaded… We society. Some commercial actors, some non-commercial actors. We have created this wealth of data on the internet, and they are going around claiming it, and basically make it proprietary, and building models that they have for themselves. And on top of that, you can already start seeing “Oh my God, they’re gonna be eventually making a lot of money out of the things that we have created.” Or even more scarily - sometimes I think about this myself… I’ve been uploading my pictures for many years without thinking too much… So there is another base out there; I’m sure that someone has built another base out there with my pictures as I was aging… And now these pictures of me can be used, could be used by an evil and evil government, or an evil actor to recognize me around the streets at any time… [unintelligible 00:42:12.24] So is that fair? Is that not fair? Those are big questions, and there is no easy or simple answer.
[00:42:23.18] Yeah. So did you enumerate and I missed it, or can we enumerate the components that you have decided so far are part of an AI system? The code, I heard, the training data etc.
Yeah… There are three main categories, maybe four. One is in the category of data, one is in the category of code. The other category is models, and there is a fourth category that goes into other things, like documentation, for example. Instructions of how to use scientific papers.
In the data parts, some of the components are the training data, the testing data… In the code parts go the tooling to – like for the architecture, the inference code to run the model… Anything that is written by a human in general; you can also have in there the code to filter and set up the datasets and prepare them for the training… And then in the models you have the model architecture, the model parameters, including weights, hyperparameters and things like that. There might be intermediate steps during the training… And the last bit is documentation, samples output.
So there is an initial list of all of these components that – the Linux Foundation worked on creating this list specifically for generative AI and large language models. And we’re working with them - I mean, we’re using their list as a backdrop or as a starting point to move forward this conversation.
Now, the question that we need to ask - having this list, and if you go to the draft five, you will see an empty matrix, basically. It’s a list of components, and [unintelligible 00:44:17.20] and then on a row next to them there is a question, “Do I need it to run it? Do I need it to use it? Do I need it to copy it? Do I need it to study it? Do I need this component to modify the system?” And we’re referring to the system. This is one of the important things - the open source definition refers to the program. And the program is never defined, but a program - pretty much we know what it is. AI is – and again, this is a very complicated question. It looks very simple on the surface, but when you start diving a little bit deeper, it becomes complicated, because what is an AI system, right?
So we started using the definition that has been – it’s becoming quite popular in every regulation around the world. It’s a work done by the Organization for Economic Cooperation and Development, the OECD… And they have defined an AI system in very broad terms. And this definition is being used in many regulations, like from the United States executive order on AI, NIST also uses it… In Europe the AI Act uses it, although with a slight, very small, minor variation… It seems to be quite popular, but there are detractors. And indeed, it is quite generic, too. Sometimes when you read it carefully, it may even cover a spreadsheet; it’s really bizarre.
[00:45:57.16] So let’s say that hypothetically I’m like a medical company that has been working on a large language model, and I have proprietary data. So I have like readings, and reports and stuff that we’ve accumulated over the years. And I create an LLM based on that data, that ultimately can answer questions about medicine, or whatever. And I want to open-source that. I need to be able to make it so it’s usable, studiable, modifiable and shareable. And it seems like the training data, even though that’s the most proprietary part - and perhaps the most difficult part to actually make available, or sometimes impossible - is necessary not to use, but to study and modify, it seems like. So if I release the model, the code, all the parameters, everything we use to build a model, everything except for like the source original data, under what you guys are currently working on, that would not be open source AI, would it?
Honestly, that is a very good case. An example for why I think we need to carefully reason around “What exactly do I need to study? What kind of access, what sort of access do I need?” Is that the original dataset? Because if it is the original dataset, then we’re never going to have an open source AI.
Right. That’s where I was getting to. This is not going to happen.
It’s not going to happen. So maybe - and this is my working hypothesis that I threw out there… Maybe what we need is a very good description of what that data is. Maybe samples, maybe instructions on how to replicate it… Because for example, that might be data that is copyrighted. You might have the right under fair use or under a different exclusions of copyright, you may have the rights to create a copy and create a derivative around the training. But not to redistribute it. Because if you redistribute it, then you start infringing.
So I think we need to be carefully thinking about [unintelligible 00:48:02.22] And the reason why I became more and more convinced that we don’t need the original dataset is because I’ve seen wonderful mixing, wonderful remixing of models, even splitting of models and recombinations of models, creating whole new capabilities, new AI capabilities, without having to retrain a single thing.
So I’m starting to believe, really, that the AI weights in machine learning, the weights in the architecture - it’s not a binary code. It’s not a binary system, binary code that you have to reverse-engineer. If you have a sufficiently detailed instructions on how it’s been built, and what went into it, you should be able, you might be able to create new systems and reassemble it, study how it works, and execute again. Modify. So the preferred [unintelligible 00:49:05.14] to make modifications is not necessarily going through the pipeline, or rebuilding the whole system from scratch… Which for many reasons may be impossible.
I do like the idea of a small subset of the dataset, that’s anonymized, or sanitized in some way, shape or form, that’s like “This is the acceptable sample amount required for the study portion, or the modification portion.”
Yeah. It could be the schema, for example.
Right. Provide your own data in here, if you can - which you can obviously find other ways to use artificial intelligence to generate more data… So that’s a whole thing, right? But I feel like that’s acceptable to me, to provide some sort of sampling, or as you said, the schema. I think that makes sense to me.
[00:49:56.23] Yeah. The research is going also in this direction, with data cards and model cards, lots of meta data specifications… I do think that that might be a viable option. I would love to have – I mean, we will see in the next few weeks and months how that conversation goes… But I do believe that that’s one way that we can get out of this process with a definition that is not just theoretical, something beautiful that you put up in a picture in a museum and nobody can do anything with it. It needs to be practical, I keep repeating… The open source definition had success because it enabled something practical. And it had success because other people have written it, other people have decided to use it. If you keep on insisting from your pedestal that “You shall do this and that”, it may not be finding [unintelligible 00:50:49.20] crowds that follows you.
Right. Yeah. And then if no one’s using it, what’s the point, right? You’ve lost the thread.
Break: [00:51:02.21]
Fully acknowledging that it’s a work in progress, and you’re not done… Given your current mental model of the definition as it is working, are there systems out there today that you would rubber-stamp and say “This is open source AI”? I’m thinking of perhaps Mistral has a bunch of stuff going on, and they’re committed to open and transparent, but I don’t know exactly what that means for them… Have you looked at anything? Do you have things you’re comparing against as you build, to make sure that there’s a set of things that exist or could exist, that are practical?
Not yet. We have an affiliate organization called EleutherAI. They are a group of researchers; they recently incorporated as a 501(C)3 nonprofit in the United States… And from the very beginning, they’ve been doing a lot of research in the open, releasing datasets, instructions, research papers, models and weights and everything like that. So I’m really leaning a lot on them to shine a light on how this can be done… But I don’t want to be too restricted in my mind. They are very open, with an open science and an open research mentality. I think that open AI and open source AI that is not as equally open necessarily, but it can still practically have meaningful impact; it can generate that positive reinforcement of innovation, permissionless collaboration etc.
So yes, I lean on EleutherAI, but I’m also very open, and I’m sure that there will be other organizations, other groups as we go and elaborate more on what we actually need to – what is the preferred form to make modifications to an AI system, that we’re going to discover more.
So no open source AI yet… So there’s no rubber-stamp for anything out there currently.
Well, I mean, like I said, I could rubber-stamp Pythia and the EleutherAI, but I don’t want to say that that’s necessarily the only thing…
Right. There may be more stuff.
And again, those are the guys because I know how they work. Yesterday or the other day OLMo was released by the Allen AI Institute… And that seems to be also quite openly available for models, weights, science behind it etc. I haven’t looked at their licenses, and I haven’t looked at it carefully, so I can’t really tell. It might as well be an open source AI system…
I was trying to get to a definitive, really… Is there or is there not a stamped open source AI out there yet?
You know, I can tell you what is not. LLaMA 2 is not. Open AI is not.
Touché. Alright.
A deny list, more than a permit list…
Yeah, so I suppose one of the questions which maybe is obvious, but I’ve got to ask it, is what is the benefit? If I’m building a model, and I’m releasing a new AI, what is the benefit to it being open source, to meet this open source AI definition? What is the benefit to its originator? And then obviously, to humanity I kind of get that, but… What’s the benefit? It’s pretty easy to kind of clarify that with software, right? We see how that’s working, because we’ve got 30 years of history or more, in a lot of cases. We’ve got a track record there. We don’t have track record here. It’s still early pioneer days… What’s the benefit?
That is a very good question… And I don’t have an answer for it. I mean, I know the benefit for humanity, I know the benefit for the science of it… And those benefits are what triggered the internet. Like, if software started to come out of the labs without the definition of free software, without the GPL license, without the BSD research, I don’t think we would have had such a fast evolution of software, computer science… We would no have the internet that we see today if everyone had to buy a license from Solaris, Sun, from Oracle etc. If a data center would have to – you know, you would have to go and call Sun Microsystems or IBM’s sales team before you could build the data center, instead of using just boxes and [unintelligible 00:59:36.23] and Apache Web Server on it… We would have had a completely different history of the digital world in the past– I mean, completely different.
[00:59:50.02] So I can see the benefit for society and science. For some of these corporations, I’m assuming that they have made some of their calculations on stopping the competition, or creating competitive advantages… Maybe in pure Silicon Valley approach, like, “Get more users. We’ll figure out the business model later.” There is some of that going on, most likely… But I haven’t had that conversation yet with any of the smart people I know, thinking about the business models behind this, or the possible ways of [unintelligible 01:00:20.22] from this open source model.
Do you think that they’re becoming commoditized? If we specifically talk about these large language models, if we call AI that for now, recognizing it’s an umbrella term and there’s other things that also that represents… Do you think that they are becoming commoditized, and will continue to enough so that open source can keep up with proprietary in terms of quality, or even surpass, just because of the number of people releasing things? I don’t know, that’s why I’m asking, honestly; what are your thoughts on it?
Obviously, recently I saw this new system that – it’s a text-to-speech system, and it’s built by this team of developers from a company called Palabra. They built this system by splitting a system from Open AI, another from either Anthropic, or I don’t remember exactly… But they split an AI system; they took it and they flipped their input for outputs, and they attached another model of their own training, with small datasets, and they built a brand new thing. This is the kind of stuff that is inspiring. At one point there’s going to be – I’m sure that the quick evolution of this discipline will make it so that smaller teams, with smaller amount of data, will be able to create very powerful machines. And maybe the advantages of these large corporations that are now deploying, delivering and distributing openly-accessible AI models, maybe in their mind having optimized hardware, cloud resources that they can sell - maybe that’s where they’re going. It’s one of their revenue streams they imagine that [unintelligible 01:02:14.27] coming from.
Yeah, that is exciting. I did see – I think it was like CodiumAI just recently announced a model that beats Deep Mind on code generation - according to benchmarks that I haven’t looked at - as well as Copilot… And that’s from a smaller player. I’m not sure if that’s open or closed or what, but it is kind of pointing towards “Okay, there’s significant competition”, and like you said, remixing and the ability to combine and change, and even in some cases swap out and take the best results, that we will have a vibrant ecosystem of these things. And I think open source is the best model for vibrant ecosystems. So that rings true with me… It doesn’t mean it’s right, but it sounds right.
Yeah. This is a tough one. This is really a tough nut to crack, really. I mean, even at the forums you have, I believe you’re calling it the DeepDive, right? It’s DeepDive:AI. And this is the place where you’re hoping that many folks can come and organize. You say it’s the global multi-stakeholder effort to define open source AI, and that you’re bringing together various organizations and individuals to collaboratively write a new document, which is what we’ve been talking about, directly and indirectly. Who else is invited in this? How does this get around? How do people know about this? Who is invited to the table to define or help define? Is this an open way to define it? What is happening, or who’s participating?
[01:03:55.13] Yeah. At this point it’s now public, so anyone can really join the forum and can join me in the bi-weekly townhall meetings. So that part is public and everybody’s welcome to join. We’re going to keep on going with public reports and small working groups, with people that we’re picking, but only because of agility in the collaborations. We’re picking people that we know of, or that we have been in touch with, coming from a variety of experiences. We’re talking to creators of AI in Academia, large corporations, small corporations, startups, lawyers, people who work with regulators, think tanks and lobbying organizations… We’re talking to experts in other fields, like ethics and philosophy. We keep on chatting with – we have identified six stakeholders categories, and we’re trying to have our representations also geographically distributed, from North America, South America, Asia-Pacific, Europe and Africa.
Last year we had conversations with about 80 people, representatives of all these categories, in a private group, just to get things kickstarted. And we have had meetings in-person starting in June in San Francisco, in July in Portland, and then other meetings in Bilbao in Europe… We had meetings in-person with some of these people, going at different conferences… But starting this year, this first half of the year, we’re going to be super-public. We’re going to be publishing all the results of the working groups, and we’re going to be taking comments on the forums, and then we’re going to have an in-person meeting - we’re aiming late May, early June - with at least two representatives for each of the stakeholder categories, to get in a room and iron out the last pieces in the definition, removing all the comments, and come out of that meeting with a release candidate. Something that we feel like there is endorsement from a dozen different organizations across the world, and across the experience.t
Then we’re going to use - and we’re raising funds for it, to have at least four events in different parts of the world, between June and the end of October. One of these events is definitely going to be at All Things Open. We’re going to gather more potential endorsements, and as soon as we get to five endorsements from each of the different categories, I think we’re going to be able to say this is version one, and we can start working with it and see where we land. And maybe next year we’re going to have - by that time, I mean by October, November, the board will also have a process for the maintenance of this definition… Because most likely, we’re gonna have to think about how to maintain it, how to respond to challenges, whether they’re technological or regulatory challenges, or just we missed the mark, and we realize later, and we’ll have to fix it.
Yeah. I kind of want to backtrack slightly, I guess, as I hear you talk about this, and kind of coming to a version of blessed sometime this year, based upon certain details… When I asked you - and I know this is your response, and not so much a corporate response… In terms of what’s the benefit of being an open source artificial intelligence, what’s the benefit of being open source AI - all this effort to define it, and then what if there’s not that many people who really want to be defined by it? I guess that’s a an interesting consideration, is that all this effort to define it, but maybe there is no real benefit… Or the benefit is unclear, and then folks just – it’s almost like saying… It’s definitely a line, right? It’s like, “Okay, everything is basically not, and there’s very few that are”, basically. Or at least initially, and maybe as iteration and progress happens, that more and more will see a benefit, and maybe that benefit permeates more clearly than we can see it now.
[01:08:19.00] Yeah. I don’t want to think about that…
Okay. [laughs] “I don’t think about that.”
No… It’s one of those things… Like, if you start any endeavor thinking about a failure, you’re probably going to fail. So it’s not one of the outcomes that – I see a tremendous amount of pressure. I mean, it’s unlikely that that’s going to happen, that’s what I wanna say. I have had a lot of pressure from corporations, regulators… DAIR has a provision in there, a text that provides some exclusions to the mandates of the law for open source AI, and there is no definition in there. So regulators need it, large and small corporations need it, researchers need some clarity… I hear a lot of researchers, they want data. And it doesn’t mean that they want necessarily the original data, some of them at least, but they do want to have a good dataset. And that only comes if there is clarity about what are the boundaries of what is allowed for them to accumulate data. Because data becomes very, very messy, very quickly. Privacy law, copyright law, trade secrets, illegal content, content is illegal in some parts of the country, or in some countries and in other countries it’s not… It becomes really, really messy very quickly, and researchers don’t have a way to deal with it right now. They need help.
I agree that you should keep doing it. I didn’t mean to sound like it should be a failure… Sometimes I think it might be beneficial to think about failure at the beginning, because it’s like, well, you’ve gotta consider your exit before you can go in, in a way. And I’m not saying you should do that, but I’m glad you are defining it. It does need to be defined. I didn’t mean to be necessarily like “What if…?”, but there’s a lot of effort going into this. I can see how a lot of your attention is probably spent simply on defining this, and working with all the folks, all the stakeholders, all the opinion makers etc. that are necessary to define what it is. It’s a lot of work.
It’s all work. And you’re absolutely right, this is taking most of my attention. And yes, I do see a couple of failure options. We can fail if we’re late, and if we get it wrong. But for getting it wrong, the fact that it’s defined with a version number, I think we can fix it over time, and we really shouldn’t be expecting to have it perfect the first time. It’s changing too quickly, the whole landscape.
And the other, getting in late, is also part of the reason why I’m pushing to get something out of the door. Because a lot of pressure exists in the market to have something. Everyone is calling their models open source AI, recognizing that there is value in that term implicitly. But if there is no clarity, it’s going to be diluted very rapidly.
Before Jerod and I got on this call, one thing we had a loose discussion - and I quickly stopped talking, because we have a term… I think it’s pretty well-known in broadcasting and podcasting, is like “Don’t waste tape”, right? And I didn’t want to share my deep sentiment, although I loosely mentioned it to Jerod in our pre-call, just kind of 10 minutes before we met up… It was basically “What is at stake?” I know we talked just loosely here about failure as an option, and what is failure, and is it iterative on the version numbers you just mentioned… But is there a bigger concern at stake if the definition that you come up with collectively is not perfectly suited? Does the term open source in software now - is the term now fractured, because the arbiter of the term open source has not been able to carefully and accurately define open source AI? Is there a bigger loss that could happen? And I’m sorry to have to ask that question, but I have to.
[01:12:33.00] [laughs] Yeah, you don’t want me to [unintelligible 01:12:33.08]
Sorry about that.
[laughs]
I think so far we’ve been able to win, in quotes “win” in the public when we push back on the term of open source because it’s pretty well accepted, right?
Yeah.
And whether - and I’m gonna say this, but… Whether we like it or not, OSI has been the guardian, so to speak, of that term. Some say you’ve taken that right… I think you’ve been given that right over decades of trust… And in some cases there’s some mistrust. And that’s not so much me, it’s just out there in the – not everybody’s been happy with every decision you come up with, and that’s going to be the case, right? If you’re not making some enemies, you’re not doing some things right, I suppose, in the world… Because not everybody’s gonna like your choices. But I wonder that, I personally wonder - if you can’t define this well, does the term open source change, or is becoming open to change?
There is that risk, I’m aware, but that’s one of the reasons why I’m being extra careful to make sure that everyone’s involved, and has a voice, and has a chance to voice their opinion. And all of these opinions are recorded publicly, so we can go back and point at the place where we made a bad choice, and be able to correct, or not.
Stefano, real quick, what’s the number one place people should go if they want to get involved? The URL, “Here’s how you can be part of that discussion.”
Discuss.opensource.org.
Here we go.
It’s where we’re gonna be having all our conversations.
Alright, you heard it. That’ll be in the show notes. So if you are interested in this, even if you just want to listen and be lurking, and watching as it makes progress, definitely hit that up. If you want your voice heard, and you want to help Stefano and his team make this definition awesome and encompassing and successful.
Yes.
I think the more voices, the better, the earlier on, the better, so that we can have a great open source AI definition.
Thank you.
Thanks, Stefano. We appreciate your time. Thank you so much.
Thank you.
Our transcripts are open source on GitHub. Improvements are welcome. 💚