Data synthesis for SOTA LLMs with Karan Malhotra, researcher at Nous Research (Practical AI #255)

All Episodes

Nous Research has been pumping out some of the best open access LLMs using SOTA data synthesis techniques. Their Hermes family of models is incredibly popular! In this episode, Karan from Nous talks about the origins of Nous as a distributed collective of LLM researchers. We also get into fine-tuning strategies and why data synthesis works so well.

Changelog++ members save 2 minutes on this episode because they made the ads disappear. Join!

47 minutes
Recorded Jan 31, 2024
Published Feb 6, 2024
Download (45MB)
Transcript
🎧 25,327

Featuring

Karan Malhotra – LinkedIn
Chris Benson – Website, GitHub, LinkedIn, X
Daniel Whitenack – Website, GitHub, X

Sponsors

Read Write Own – Read, Write, Own: Building the Next Era of the Internet—a new book from entrepreneur and investor Chris Dixon—explores one possible solution to the internet’s authenticity problem: Blockchains. From AI that tracks its source material to generative programs that compensate—rather than cannibalize—creators. It’s a call to action for a more open, transparent, and democratic internet. One that opens the black box of AI, tracks the origins we see online, and much more. Order your copy of Read, Write, Own today at readwriteown.com

Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Notes & Links

📝 Edit Notes

Chapters

Chapter Number	Chapter Start Time	Chapter Title	Chapter Duration
1	00:00	Welcome to Practical AI (Dance Party!)	00:43
2	00:43	Karan Malhotra 👀	01:14
3	01:57	Origins of Nous Research	08:27
4	10:24	What is synthetic data	06:23
5	16:47	Effects of model licensing	05:36
6	22:23	Map of Nous	04:22
7	26:45	How is Nous organized?	03:56
8	30:41	Sponsor: Read Write Own	01:08
9	31:48	Fine Tuning advice	03:11
10	35:00	Stuff to look for	05:45
11	40:45	What's next?	04:18
12	45:03	Thank you!	00:57
13	46:00	Outro (Dance Party!) 👀	00:39

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Daniel Whitenack

Welcome to another episode of Practical AI. This is Daniel Whitenack. I am the CEO and founder at Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?

Doing great today. It was nice seeing you a few days ago in person.

Daniel Whitenack

In the flesh.

In the flesh.

Daniel Whitenack

Yeah, that was great. I think you posted a picture on LinkedIn, so if anybody doesn’t know what we look like and has some crazy reason to want to know, there’s a smiling mug of us on Daniel’s profile.

Daniel Whitenack

Yes, yes. And the reason we met is I was on a client visit on-site, and we were prototyping out some stuff, like chat over your docs, and natural language to SQL stuff, and all sorts of things with Prediction Guard… And one of the models that we were using was from Nous Research, and that works out great, because we have Karan Malhotra here, who is from Nous Research, co-founder and researcher there. So welcome. Glad to have you, Karan.

Hey all. Thanks for having me. I’m extremely excited to chat with you guys.

Daniel Whitenack

Yeah. Like I said, I’m a huge – well, this is our first time meeting, but I feel like we’re already friends, because I’ve had so much of my own benefit and interaction in working with models from Nous Research. A lot of amazing models that you’ve posted on Hugging Face, and research that you’re doing.

I’m wondering if you could just give us a little bit of a background about Nous specifically, and kind of how you came together as researchers, and started - to me, from the sidelines, it seemed like “Oh, all of a sudden there’s these amazing models on Hugging Face, and I don’t know who these people are, these Nous Research people, but they’re amazing.” So give us a little bit of the backstory there.

Absolutely. Yeah. So just as a general overview, we are one part like open source research organization; we put these models out for free, we put a lot of research out for free, some datasets, so people can build on top of these open models. On the other hand, we’re very recently a company as well, a C Corp. So we’ve been working pretty hard, after getting some seed funding, on building together some exciting stuff. I won’t go too into it during the overview point, but we’re continuing to do our open source research, and development, and release of models indefinitely.

The way we started is very interesting, and it would be pretty out of nowhere to the outside, for sure. It was extremely fast for us. We’re a collective of people who have been playing around in the open source language model space for a while, ranging from like GPT 2 release, to LLaMA release, to like the first Transformers paper… We’ve got people from various eras of gen AI, of when they came in. And for myself, it was GPT 2. I stumbled upon a colab notebook and started fine-tuning, made some Edgar Allan Poe and Lovecraft tunes…

Daniel Whitenack

I’ve done the same. That’s awesome.

And we just got pulled into this world of “Look at these next token predictors that are just managing this [unintelligible 00:03:56.13] together the most wonderful and amazing stories.” That slowly turned into a deeper, and deeper dive of “Well, how can I use this for learning information? How can I learn to use this for production, and automation?” It’s evolved over time.

For us, we started off just working with different open source collectives, actually. Once Open AI kind of released GPT 3, and had closed-sourced it - you know, we were used to open source GPT 2. We were like “Oh man, what are we going to do? How are we going to continue to play with the level of customization and interactivity that we had with GPT2?” Then Eleuther had released GPT-J 6B. The Cobalt AI community, this community of people who tuned models and inference models started to pop up, I think around 2020-2021, in the face of this. So a lot of us started to have places to centralize and play with these models. We got to contribute, and learn how to become better open source AI developers etc.

Eventually, there was a need for more concrete organizations to do this kind of focused work on the creation of these models. We were stuck with like okay architectures for a while, like Pythia, but thanks to Meta - you know, we wouldn’t be here without Meta, I’ll say that. First and foremost.

Daniel Whitenack

The great LLaMA.

Yeah. Prior to LLaMA, everyone’s like “Oh, Facebook - evil. My data” etc. And here we are, they are kind of like the shepherds of this new era of open source AI movement. So when LLaMA came out, there was a paper that came out called Alpaca, by Stanford Lab. And this was about distilling data from bigger models, like GPT 3, ChatGPT, GPT 4, and being able to train smaller models on that distilled, synthetic data; something they called instruction data. So the Alpaca format really opened up the playing field for everybody to start making these instruct-style models, these actual for-prod use style models.

[00:06:00.08] So there was an idea I had in my head of like “Well, the Alpaca guys are using only GPT 3.5 outputs. What if I only generated GPT 4 outputs? It will be a little expensive, but you’ll probably get a better model out of it than Alpaca.” At the same time that I was looking at this, there was a guy on Twitter named Technium, who had just started putting together his own synthetic dataset based off Alpaca, and the GPT 4 only as well. So I was working with a group at the time called Open Assistant, under LAION. They’re a really big nonprofit. And while I was working on that, we had some GPUs they were cool with us using towards the development of new models.

So I reached out to Technium and I said “Hey, I have a little bit of compute. You have GPT 4 data in the same format, I have GPT 4 data in the same format. Let’s train a model.” So we trained a model called gpt4-vicuna. This model was on the Vicuna fine-tune; we fine-tuned the fine-tune, basically. The Vicuna model was an Alpaca-style fine-tune, and we tried our dataset on top of it. It was good, it was okay… But then we thought “We’ll probably get a better result if we just train on the base LLaMA model.” And the resulting model was the very first Hermes model.

Daniel Whitenack

Gotcha. The OG.

The OG. And that’s kind of how it started to come together, was we both had a data thesis on “Use GPT 4 only, and follow Alpaca.” And we trained on LLaMA, and we got Hermes. And we didn’t know what benchmarks were; we didn’t know anything about any of this stuff. We just made a model. And it got a ton of attention. We put it out under this name, Nous Research. Nous comes from the Greek word for intellect. We thought it would be a good name for an AI company. [laughter] But it was just a place for fun projects, and fine-tunes, and stuff. It was just a name we were using for our collaboration. And people started swarming and asking “What’s Nous Research? What’s this sudden, mystical open source organization that put out this best model?” And we’re like “Best model? We just tried something.” It was really organic. And it got to the point that people started telling us “You must have trained on the benchmarks. These are doing too well.” And we were like “What’s benchmarks?” [laugh] We were not really coming from an academic place as much as from like an enthusiast that became so committed that it became our life. It became our day to day.

So from there, people started to ask us “Can I join Nous Research?” Now, there wasn’t a Nous Research to join. It was just two guys, right? What ended up happening was we formed a private Discord server, and we thought “There’s a lot of people, who range from somebody who’s like 16-17 years old, savant on Twitter, hasn’t even been a college yet, insane at transformer stuff, to mid 30s, working a really, really good FAANG-esque job, and just wants to really create and let loose.” That was another class of volunteer. And then you have older gentleman, who has already exited a company or something, who has just been playing with code for a while and wants to jump in and hang out.

So we ended up being this really eclectic group. We don’t know what your name is, we don’t know what your race is, we don’t know your gender, or anything. It’s just Discord profile picture, Twitter profile picture, right? So we came together, grew to about like 40 people, all working together on various different projects, like Hermes tunes, data synthesis, the Capybara series, context length extension etc. And just from this kind of interaction between Twitter and Discord, and bringing people in that we thought were cool, we ended up becoming what people would call an open source research org. [laughs]

Daniel Whitenack

Yeah, you sort of stumbled into creating this amazing research organization which is ruling the world, which is awesome.

It’s what Open AI might have been…

Daniel Whitenack

Well, yeah…

[00:10:04.05] That’s really sweet. Thank you guys.

Daniel Whitenack

Yeah. And I love it, it’s so cool to hear that story and that background… And I see, in my own sort of little snapshots here and there, connecting that in my mind over the past couple of years, as I’ve seen you all post different models and that sort of thing, this is something we’ve definitely touched on on the show before, but some of our listeners might not kind of fully grasp when you say these sort of like synthetic datasets that you’re focused on. in this Alpaca format. Could you kind of explain a little bit – we’ve talked a lot about fine-tuning, and preference tuning, and RLHF, and different things… But what does it specifically mean, that you would take synthetic data? What does that mean in your case, and like why does that result in something good in fine-tuning an open model? People might think “Oh, this is synthetic data. Why should I expect it to be any good?” So could you kind of help explain that subject a little bit?

Yeah, absolutely. I mean, out of context, synthetic is like as meaningless as artificial, right? Data is data. But in this case, it’s referring to a particular class of data that’s been generated by another language model, or another AI, another diffusion model etc, that can actually be used to further train models. Now, you might say, “Why would you want to do something like that? How is it helpful?” What was important to us is we were all GPU-poor. We were all running on laptops, or maybe a 3090, maybe a 4090. As individuals, we don’t have data centers. So training or even tuning a large model in the early days, like 70 billion parameters, something like that was just unfeasible for us. And knowing that GPT 3 has something like 175 billion parameters, and 3.5 and 4 can only go up from there, the question became “How can we make these small 7-billion parameter models even compete with these massive, massive ones?” These ones that I want to run offline, these ones that I might want to run on an edge device, on a phone, on a drone etc. How can I make them even useful? So there’s two things to talk about here. One is synthetic data, and the other is distillation.

So synthetic data is just referring to any kind of data that’s created by a model, in this case. And the reason that’s useful is in particular distillation. So if I told you to go study comp-sci for 10 years, for example, and put in that massive time investment, and really focus on general programming. And then I told you “Now it’s time for you to learn about AI, and transformers and stuff” and put you through all the math prerequisites etc. you’re gonna come out with like a really strong foundation of how to do the work. But the problem is, you’ve put in a massive time investment.

Now, if I take that guy, who’s spent 10 years doing engineering, then another five years doing AI, and I ask him “Hey, can you teach somebody just the really important, compressed tidbits that will help them just get up and running to do the work?” That’s data distillation. That’s knowledge distillation.

So you look at these big models, like a Claude, or 70B model, or GPT 4, and you can see they’re amazing, they’re brilliant at everything. They have a bunch of high-quality data they’re trained on, and they have a bunch of low-quality data they’re trained on, that they can interact with an express in a high-quality form. So instead of me having to read a massive 10-pager for why some chemical reaction or some like tax base process, whatever you want it to be - instead of reading a massive document on that, and then feeding that to a language model, we can just have that really smart model that already understands it really well compress that information into an instruction, or into a conversation, into like two sentences, three sentences, five sentences, half a page. And we can just train a much smaller model on that compressed information, and it will learn the compressed information, to the degree that a language model learns something; not perfectly, but…

[00:14:19.21] Because of that, what the Alpaca guys did was they generated a bunch of seed tasks from GPT 3.5 on various different domains and topics, and created these kind of compressed instructions, with instruction, an input question from the user, and then an answer. So the instruction could be like “Given the following math equation, explain step by step why this is the answer.” And then the input is the equation, which is your question, and then the output is the compressed answer. So all of that, we can take as one sample in the dataset, and we can make hundreds of thousands or millions of samples like that, of various different domains and various different tasks.

So the Alpaca guys did this, less than 100k examples, I believe, and they trained the LLaMA models on these, and they found massive boosts to performance, that this distilled information, like a human, successfully compresses and transfers over. So when I saw that, and then independently when Technium saw that, and then independently when many others saw that, we were like “This is so intuitive. This is exactly how I’ve learned anything, by just going on Discord and Twitter and bothering people to give me the compressed bit of how I do something. We should try doing this with even higher-quality models than 3.5.”

So we created - I can’t remember the exact number at the moment, but at least 50,000, maybe 100,000 examples originally, for Hermes 1, like this, just using GPT 4. And then we trained on that, and ended up getting performance that was extremely, extremely massive boost compared to the other models that were not trained using this kind of method.

So without these giants that have already established themselves in this space, we wouldn’t be here. Without Open AI, without Meta, we literally wouldn’t have the model and the data to do the kind of work that we did to make Hermes.

What it allowed for us is like for local models to finally be comprehensible, and for us to finally have offline capabilities, to kind of take the good stuff from something like GPT 4 or something else and make it uncensored. So it still has all this understanding of all these topics, but it doesn’t have all that RLHF inside it necessarily, that safety-izes it, so that when people utilize the model, it has all this intelligence, but it has more freedom of thought to kind of converse with you on topics that Open AI may reject.

Gotcha. One of the things I was curious about as you were going through that was a few episodes back Daniel and I were kind of talking about the effect of model licensing on the community, and the different kind of licensing concerns that were coming out from whether it be Meta, Open AI, you name the organization… Is that ever a challenge for you, since you’re kind of using those to get started in terms of the inputs? Has that been a concern, or do you anticipate it being a concern?

I think that, of course, generally, US and international regulation on this stuff is evolving; the conversation is evolving very much. So naturally, you have to keep it top of mind; you have to think about these kinds of things. But thankfully, because all of our model releases are like open source, and we don’t profit from them… Like, if somebody goes off and creates a product using our model, good for them, but we don’t necessarily take on that liability or that worry of saying “Hey, we’re gonna sell you this model that was created with GPT 4 outputs.” We actually actively try to stay away from doing that. But because the data distillation paradigm is so effective… You know, if a model comes out that’s better than GPT 4, and it’s open source, and I can use it locally, and in their TOS it says “You can use this to make a commercial model”, that we can apply the same techniques that we’ve been preparing and researching and understanding from these closed models, and use it there.

[00:18:11.01] So right now, we don’t stand to, or try to, or have any plans to profit from using any of these outputs. We’re not about that, because we want to be careful and respectful of these model creators and these companies. But that being said, we’re learning all these techniques and developing all these techniques that will be useful for when that time comes, and for when that’s available, especially with the advent of something like Mistral. If we do distillation from a Mistral model, like Mistral Medium, or something like that, that’s completely, from my understanding - barring their TOS saying otherwise, but I believe it doesn’t - it’s completely okay in that situation for us to create models like this, that can be used commercially etc. Regarding the TOS stuff though, as much as we err on the side of caution, I’d find it hard to see a company enforce their TOS when these larger models are likely trained on not all copyright-free stuff. I’d find it hard-pressed to believe that these closed source companies, their models are totally copyright-free, and totally copyright-clean.

So if some other company that was feeling a little more rambunctious than ourselves was to say “We’re going to commercially release on this”, I imagine it’d be difficult for them to be come after without the other group opening their books. And there’s actually a pretty interesting interaction that happened regarding this between Google and Open AI, if you guys are familiar. [laughs]

Daniel Whitenack

Yeah, I saw this interesting picture the other day, it was like “The interesting web of AI”, and it was like how Microsoft, Google, Open AI – it’s like on one side there’s the ones, and it shows how they’re connected to the other ones, this visualization, and how many of them overlap in these strange ways between, whether it’s Together, or Mistral, or Meta, Google, Microsoft, Open AI… This sort of very interesting web of connections, that probably makes some of these things rather difficult.

We’ll leave it for the lawyers to sort out.

Daniel Whitenack

Yeah.

Yeah, that’s the thing, we can look at an example. You hear that phrase like “Good artists copy, great artists steal.” So the data distillers - we’re copying. We’re just distilling this information; we’re trying to make our models more like those, and we don’t really plan to commercialize, we’re just doing it for free for everyone. But the great artists are, you know, Google. You look at Bard, and it tells you “I was made by Open AI.” Now, it’s fine for our open source model to say “I was made by Open AI”, because we’re very transparent about this is trained on GPT outputs. But when Bard violates the TOS with a paid product… [laughs]

Daniel Whitenack

Bold…

Yeah, that says “I was trained by Open AI”, right? You’d think that Open AI would come after this multibillion-dollar company immediately. Instead, you see a tweet from – first you see Google deny it, then you see a tweet from Sam Altman, which was something along the lines of (I’m paraphrasing) “I’m not mad that they trained on our outputs, I’m mad that they lied about it.” And I’m sitting there like “Okay, you’re mad about this, but aren’t you’re gonna pursue the legal action in your terms of service?” No, no. Because everyone would have to open their books up, too.

That being said, I don’t condone the commercial use of that kind of stuff, making a paid model from GPT 4 outputs. I wouldn’t advise anyone sell a model made with them, just because we want to respect people’s TOS and stuff; they worked hard, and spent billions to make this stuff, or hundreds of millions, however much they spent. But there is certainly room for hypocrisy in that realm of the large corps.

[00:22:04.17] So that’s my thoughts on the licensing stuff, and that’s definitely my own individual thoughts. We’re a pretty decentralized collective at Nous, so you’ll find people with all sorts of opinions, all over the place… And as a company, we don’t hold any view whatsoever on that.

Daniel Whitenack

Yeah. I’m wondering - maybe this gets a little bit to the distributed nature of this, but I know that there’s sort of various collections of what the Nous Research Group has done over time. You mentioned Hermes, but then there’s these other kind of categories of things too, like the Yarn models, Capybara, Puffin, Obsidian; just looking over the Hugging Face now… I’m wondering if you could just give us from your perspective a little bit of a map of these different things, and how people might categorize the different collections of what Nous has done. I definitely want to talk about the future things and ongoing things as well, but as it stands now, what are the kind of major categories of what the collective has invested their time in over time?

Certainly, certainly. So within the stuff that’s viewable on Hugging Face at least, we’ve got the Hermes series, of which - I told you guys the initial story of how it went down, but from there, Technium kept going. I haven’t personally had any interaction with the Hermes model since the initial. From there, Tech just continued to create more and more synthetic data, collect from more and more sources, use more and more open datasets… And he’s just got the, I guess, award-winning data thesis. The guy really knows how to go about curating and synthesizing good data.

So Technium - it’s his baby, the Hermes project. So everything you’ve seen since is really – his work, and anyone who has kind of collaborated with him, but almost… You can’t call anything a solo project, because of the open datasets were used, too. Everything is built on the shoulders of giants, and the shoulders of each other as little people… But Tech really has helmed the Hermes initiative so far. I think that’s our most popular model series, and he released the Open Hermes as well, because we had some data in the original Hermes that we never released publicly, and we wanted to make that kind of an option for everybody. So that’s Hermes… It still follows the same kind of philosophy of synthetic data, and it now uses the ChatML format, instead of the Alpaca format. It’s what we kind of upgraded to.

Then you’ve got a Capybara and Puffin, which are both done by a volunteer and OG member, LDJ. You may be familiar with Luigi Danielle Jr. So the Capybara series was using an amplify instruct method, this novel method that LDJ had worked on alongside another one of our researchers, J. So LDJ and J - it can get confusing, but the two of them worked on the Capybara series, created the dataset, trained the models. And then Puffin was the idea of using handpicked, smaller samples from some of our larger datasets to make sleek datasets for an easy tune, and see how that works kind of in the spirit of the Lima paper, where they just used a few examples to get really good results.

Those are really the popular tunes using synthetic data for like general use. Yarn is this novel context length extension method at the time of creation by [unintelligible 00:25:28.00] and EleutherAI. So what happened there was these guys were already looking into context extension for a while, and when we kind of came under the Nous banner to do the work, it opened up a little bit of resources from compute sponsorships, it opened up a more centralized place for them to be able to do that collaboration…

[00:26:00.15] I had no hand in the Yarn models whatsoever. And that’s the exciting thing, is everyone really gets to work in their own spheres, in their own kind of autonomous circles, and then we just check in and see “How’s the research going? How’s it coming along?” Because we really work with people that we heavily believe in, and we believe in their idea… So if we don’t already have an idea, we kind of just say “Please freely create, because we brought you in, because what you will freely create will push forth our agenda anyway.”

So I think those are our big model releases and series that we have available. Outside of that, we have a bunch of stuff on our GitHub as well. Stuff that’s being worked on, stuff that hasn’t necessarily come out yet… There’s a lot of that. [laughs]

So I’ve got a question for you as a follow-up. It’s pretty fascinating the story that you’ve been telling us here, because of that kind of organic creation of the organization, or collective… And I’m wondering, as you’ve done that and you kind of went through and talked about the different model groups, and kind of talked about the owners or spiritual owners, if you will, of each of those families, how do the different members of the collective interact to kind of share? How do you each push each other along, or share information, or give ideas, so that cross-family efforts can kind of benefit from the overall collective? …and as you said, now a C Corp, and you guys are more organized at this point. So what kind of culture has developed around those communications and learnings?

Yeah, absolutely. I mean, when it started, it was just like a small Discord. Maybe like 10 people. From there, we kind of created more channels, as people wanted to work on more things… And we had initially split up into three or four different topics or sectors that people could assign themselves to. One being data synthesis, of course, so we can kind of find new, novel methods and formats for distillation, and the creation of synthetic data. One being training, like people who are just like really good at training, hyperparam stuff, and people who will come up with new architectures and new techniques. Another being agents - a group of people who want to actually try to build tools, and do autonomous work with this stuff…

And then we had this one category that - it was a prediction for the future of simulation. So we had people that were very interested in kind of bringing this stuff into simulation, into Unity, into kind of seeing how all these things came together. And it was interesting, because the training built on the data synthesis, the agents built on the training, and then the sim would build on the agents. It was kind of the idea. So everybody needed to work together, because all those things are so intrinsically connected… But people would have specializations on kind of where in that workflow they wanted to work.

We didn’t end up doing a lot on the sim side of things. Now, recently, there’s a lot more interest, because we have a lot more capability, generally, as the AI community does… But as we’ve grown to - we went to 40 people, it was fine. Now we’ve gone to like 5,000 people in the Discord… It’s a little unwieldy there. So what we do is we kind of tier people in. You come into the Discord, you can see maybe two channels. And then we’ll give people a developer role. We don’t really let people select their own roles, because we want to make sure we can kind of sort through people we know, to kind of let them through… And even as we do open source research, a lot of it is unreleased, and we want to make sure that it’s kind of protected before release. So we created this developer role so people can then see like way more channels of just general development, and development conversation.

And from there, as we see contributors who have started to do more work, or show more passion towards contributing to Nous in a particular field, or who have some reputation or some portfolio in a particular field, then we’ll assign them one of those roles. And that will open up the family of channels relating to those roles, and our current projects surrounding that role. So like data synthesis projects, agent projects, training projects etc. So we kind of just tier it out, so people can interact.

And people who have been around for a while, or people we consider fellows, or part of the cohort, they can usually see pretty much everything. So they’re pretty effective in serving as coordinators for the cross-communication between these different channels and groups. And even if someone has a particular role or some channel has a particular role it’s supposed to be a part of, it’s still Discord, and we’re still very chill. So people will still work on like various different overlaps inside of just one channel as well.

Daniel Whitenack

I have a selfish question, which now that – this is one of the advantages of doing the podcast, I get to talk to all the amazing people doing amazing things, and learn from them. But I’m wondering, as a person who is also trying to fine-tune some models, either just for my own enjoyment and learning, but also fine-tuning models for specific tasks, and in specific customer use cases and that sort of thing… There’s a lot of people out there, I think many of our listeners who are thinking like, since you being part of this collective have worked since the sort of dawn of these many proliferation of fine-tunes, from LLaMA etc. and as you’ve seen all that, as you’re doing more and more fine-tunes, now as you’re looking towards the future, do you have any kind of good advice or things to keep in mind for all those fine-tuners out there that are thinking about grabbing something off of Hugging Face, creating their own versions of these models, maybe they have their own ideas about a specific take on a model? Any general tips that you’ve found to be really useful over time, or like pitfalls that you’d like to highlight?

Yeah, I mean, I can try to think of a few off the top of my head. I’ll say that hyperparameters are really important, and it’s important to try to get that right. It’s going to vary from model to model, but a lot of the time – some people think hyper params don’t really matter as much to obsess over, and some people think it’s like a secret sauce as well. So I’d say try to do a lot of research into good hyperparams, a good learning rate.

I’d also say - I could be totally wrong about this, as I’m not the trainer of Hermes today, or a lot of these models, but something I personally believe in a lot is, like, ignore people telling you to only train for like X amount of time. If you’re not overfitting, just keep going, if you can. If you have the compute, keep training, and keep going. Train for more tokens, more epochs. That’s something I heavily believe in.

In terms of trainers to use, there’s a lot of people who make their own scripts for specialty stuff, and of course you can just use Hugging Face… But the library we use is called Axolotl, like the animal, by Cassius Wing Leon of the Open Access Collective. We think Axolotl is probably the best general-purpose trainer for [unintelligible 00:34:27.04] fine-tunes etc. It, like any open source repository, is buggy, and stuff you’re gonna have to work out… But it’s, in my opinion, probably the easiest and most effective trainer to use for pretty much any model architecture available right now. So I’d definitely point everybody towards Axolotl.

Daniel Whitenack

Awesome. Yeah, that’s super-useful. We’ll share some links in our show notes as well, so people, make sure and check that stuff out. Another kind of interesting question - I think we saw these waves of models that came out maybe around synthetic data fine-tunes, or other types of fine-tunes, I see this interesting sort of thing happening over the past however many months, not that long in the scheme of things, but in the AI world maybe a while, where we’re kind of now – there’s a lot of interesting approaches, more so than just fine-tunes, but like mixture of experts, and merging, and of course, multimodal stuff coming out, now I see Nous kind of dabbling in that… You don’t have to answer for the whole collective, but as there’s so many of these things coming out and different approaches, what are some of the things within that – it doesn’t have to be one of those, but what are some of the things on your mind kind of moving forward? Or on Nous’es mind, kind of more generally.

Sure. I’ll try to go from simple to complex on the kind of stuff.

Daniel Whitenack

[00:35:59.18] That sounds great.

I think that definitely just like straight up instruction tuning is great. There’s other ways to tune, like the eval instruct method. I would advise people to try to create new instruction methodologies, that allow us to make even better formatted data. People don’t spend enough time trying to create new instruct formats. And we’ve definitely been swamped with not doing that as well. So I think towards the general community, it’s a really easy place to get started; you don’t need to really know how to code, so much as think about how a human might more effectively phrase something, or format something, and kind of remix from there. I think that’s probably the easiest place to start.

Then there’s model merging. Model merging is great. You can just like take two models and Frankenstein them together to question-mark results. You’ve got to just try and see what happens, and feel it out. Then from there, I would say there’s stuff like DPO, there’s RLHF… DPO kind of rewards things; that can let you enable rejections, or create censorship, or put some kind of general concept or attitude towards a model. We’ve found that to be pretty effective with the latest News Hermes Mistral DPO. It seems like people really like it and prefer it over just SFT. So that’s another thing that I’d heavily recommend.

From there, we get a little more complex. We have some reward model stuff we’re working on that I won’t speak to just yet, outside of saying we’re working on it, that we think is going to be like pretty big for reasoning boosts. Of course, there’s techniques like chain of thought, and tree of thought, for like multi-step prompting. Creating datasets even out of that for any of these purposes I’ve already mentioned is going to be really effective.

Now, to stuff that maybe not everybody can – actually, a lot of people would already be able to do this. There’s something that we like to call over at Nous activations hacking, where you’re kind of messing with the way that a model – I’m trying to think about how to say this in like the most layman’s terms… You’re trying to mess with how a model generally vibes about something. [laughs] So rather than just doing a system prompt or something like that, you can actually change the model vectors to kind of be like more political about something, less political about something, more terse, or more specific… And it has far more effect and control over a model than a system prompt. It’s basically like a system prompt that tells it to embody certain characteristics, but it’s not something you can really jailbreak or get around, as far as my testing has shown. Certainly not as easily as a system prompt. We have no problem jailbreaking even the most censored closed models today. It can be done by anybody, with the right words. But this activation stuff, it really creates a bit more of a robustness and fidelity to the concepts that you’re trying to tell it to embody.

There’s a few more I’m trying to think of that would be useful for people… One thing is soft prompting. It’s not really around anymore. It used to be pretty big during the GPT-J, like pre-LLaMA days, and the Cobalt AI guys really pioneered the use of it in the open source community. But a soft prompt basically takes like a massive prompt and it compresses it down to like way less tokens. So you can give your model like a huge prompt, a huge system prompt, or a huge amount of information, and use like way less tokens. So soft prompting is cool. It’s not going to be too difficult to update it for like LLaMA, Mistral, today’s architectures. Nobody’s really done it that I’ve seen. So to the community, if you guys do that, please share. [laughs] That’s actually much easier than the activation stuff, I think.

And then finally, probably the hardest unsolved is sampling methods. Today we use like Top-K, Top-P, [unintelligible 00:39:52.10] sampling etc, whatever. There’s better ways to pick tokens, for sure. There’s better ways to judge the value of tokens, for sure. Everyone has been too concerned with higher levels to get that low, and do whatever the magic math is that I can’t do, that would enable some steering, and some – even beyond steering, like alternative sampling paradigms. And I think that would probably bring the biggest change and transformation to literally all models, regardless of the tune, regardless of the architecture etc. if it gets pulled off. So I’m really looking forward to something like that happening in the space.

[00:40:32.03] That was a lot of really good advice that you have there. I was sitting there trying to take notes while you were talking through it, and everything. Going “Wait, but he said that too, and he said that too.” A really good answer there. Thank you for that. As we’re starting to wind up here, I wanted to ask you… I know about – as we’re recording this, it looks like it was just over three weeks ago, about four weeks ago when we release this episode, you guys announced your $5.2 million seed financing round, so congratulations on that. That was pretty amazing.

Thank you.

And I’m kind of wondering… So you’ve kind of started with this kind of fairytale story of kind of organically building from the ground up; you know, yourself, you connected with somebody else, a few other people joined, you get to thousands of people contributing, and really producing amazing work. And then you’re incorporating, and now you got the seed round coming… Where does that lead you? It’s kind of a sky’s the limit kind of scenario, it seems, now that you’re kind of launching as a corporation, as you said. Where can you go from here? What do you anticipate over the next couple of years, or even several years out? What’s the vision? What do you want to achieve? You’ve come a long way so far. What’s next?

AGI. No, I’m just kidding. [laughter]

I’d believe you if you said, actually.

I mean, someone will do it…

Daniel Whitenack

And then you’ll distill the knowledge.

Then we’ll distill, and then we’ll run the AGI on your Neuralink, on your contact lens, or something. [laughter] But for us, there’s a huge focus on locality, there’s a huge focus on offline, there’s a huge focus on take the power back, run the model yourself, do everything at home… That’s big for us. And at the same time, of course, we believe in scale. But there’s this idea that there’s so much unsolved at the small model size; why don’t we do that before we go to a trillion params? Because we can scale those realizations.

But for us, there’s certainly a transformation and change in attitude, and in pressures from going from pure open source volunteer, to as well having kind of this more corporate branch get created as well. But that being said, it’s been pretty consistent, our ethos and our motivation for why we do this. And like you said, it really was organic, in the sense that we’re a product of the times, we’re a product of the atmosphere of the AI community. People have said nice things, like “You guys are setting the trend.” And it’s not really true, so much as the truth is we are one of many embodiments of the sentiment that the community has, and that the world has, we think.

[00:43:13.05] There’s more than one Nous Research in this world. There’s Alignment Labs, there’s Pygmalion, there’s Cobalt; there’s people who have been around before us, people who will come along the way, people who have already formed since we have… And there’s lots of people who have kind of embodied the Nous Research ethos. And it’s not really just our ethos, as much as the overall community’s ethos. People who have come before us, people who will come along the way, who do very, very similar style of work as us, this kind of open work… And I think that’s got everything to do with the fact that this is what the people want. We’re just the everyman, just like everybody else. We’re not like billionaires, or super all ex Facebook, or anything like that. We’re just a bunch of people who really, really care about this, who want to see everyone have access to language models, everyone be able to automate their lives, everyone be able to push their understanding of any topic to the next level. And our work, as we become an organization that’s looking to be a company, and create revenue etc. we won’t let it tamper or hinder any of the open source work we do. In fact, we want it to empower all of that work, because we believe that the tools and the developments and services that we will be providing as a corporation will only serve to better feed the entire open source community. We’re not really looking to suddenly make like a closed Hermes, or something like that. We’re more looking to create tools, and do research that makes your open Hermes far more effective, far better, and good enough that you may want to pay for that tool. [laughs]

Daniel Whitenack

It sounds like something I would pay for, that’s for sure. Yeah, it’s super-inspiring. I really appreciate you taking time, Karan, to talk with us. I thoroughly enjoyed this, because I am such a fan of everything you all are doing, and the community that you’ve built… So thank you for staying true to that culture and what you’re doing, and I’m really looking forward to seeing what happens in the future and where things head. And I hope that we can talk again and have Nous back on the show, and in a year, when of course everything will be different in the AI world, I’m sure you’ll still be doing interesting things. So yeah, you’re always welcome back on the show.

Thank you so much. It’s been a pleasure to chat with you guys. Thanks for being so candid. I’m glad we were able to kind of push our message forth more, and thanks for the validation you and the community have given us to keep doing this great work.

Daniel Whitenack

Alright, thanks. We’ll talk soon.

See ya.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

View all episodes

Player art