Practical AI – Episode #193

Stable Diffusion

get Fully-Connected with Chris & Daniel

All Episodes

The new stable diffusion model is everywhere! Of course you can use this model to quickly and easily create amazing, dream-like images to post on twitter, reddit, discord, etc., but this technology is also poised to be used in very pragmatic ways across industry. In this episode, Chris and Daniel take a deep dive into all things stable diffusion. They discuss the motivations for the work, the model architecture, and the differences between this model and other related releases (e.g., DALL·E 2).

alt text
(Image from


Notes & Links

📝 Edit Notes


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another Fully Connected episode of the Practical AI podcast. In these episodes we keep you up to date with everything that’s happening in the AI community, and take some time to dig into the latest things in the AI news, and we’ll share some learning resources to help you level up your machine learning game. I am Daniel Whitenack, I’m a data scientist with SIL International, and I’m joined as always by my co-host Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?

I’m doing very well, Daniel. Having a good day. Gosh, we’ve got cool stuff to talk about today…

Yeah… But the biggest question though, did you watch Rings of Power?

So this is the conflict in my family, because I mentioned in the last episode that I’m waiting. I’m being a good husband and a good dad till they’re ready…

Okay. I won’t give any spoilers, and I probably shouldn’t on the podcast anyway… But Chris and I, for our listeners, are both big Lord of the Rings fans.

Thanks for torturing me here at the beginning of the episode.

Yeah, no worries. Anything I can do. I won’t indicate one way or the other, so… Yeah. I mean, this isn’t revealing anything, but I was really interested in and kind of analyzing a lot of the visuals of Rings of Power as I was looking through it… And of course, Rings of Power, and Lord of the Rings in general, it’s set in a fantasy world of Middle Earth, and so there’s all sorts of interesting visuals and creative elements, a lot of them with a lot of effort put in from designers, and artists, and graphics people… And it got me thinking a lot more about Stable Diffusion, which is what we’re going to talk about today… Because really, this model - and it’s the latest in a series of models, but this kind of stream of models, these diffusion models are really kind of taking over and dominating a lot of the discussion in the AI community. And Chris and I thought it would be good to spend some time chatting about them in a lot more detail than we have in previous episodes.

[04:01] So if you’re wondering more about Stable Diffusion, what it means, what it is, what it can do, that’s what we’re going to dig into. Yeah, how have you been thinking about Stable Diffusion? Where has it been entering into your life, Chris?

So it is one of those – you know, we’ve been talking about kind of these different disciplines within machine learning crossing modalities and joining up, and we’ve had a pretty exciting year in terms of what’s happened already… And I think for me, as I know I’ve expressed to you offline, this is the most exciting thing. And not just for what it is, but for what may be to follow. I hope that listeners are as excited as we are, because this is one of those moments that I think is going to really turn into something quite wonderful, and it already is looking super-cool.

Yeah, for sure. So maybe it would just be good to set the stage for what Stable Diffusion is in terms of like what it can do, and the motivation behind it… Because it wasn’t created in a vacuum, right? This is kind of the latest model in a series of these so-called diffusion models… Which I think primarily are associated with - right now, or how they’ve got the most sort of attention is for text-to-image tasks. So you put in a text prompt, and it will generate an image corresponding to that text prompt. What are some of the interesting ones that you’ve seen, Chris, or these sort of images generated from text prompts that have been interesting for you?

I think some of the things that we’ve shared a little bit back and forth, and that are in some of these articles are pretty cool… Being the geeks that we are, and seeing things like Lord of the Rings showing up, blended with Star Wars characters in one of those… There’s one that has Gandalf and Yoda mixed together… They’re just fun. And so I’m enjoying the creativity out of it. But it’s really like - I can think of so many uses that aren’t necessarily just like cool imagery from a creative standpoint, that are really functional. And we can get to that later on in the conversation. But this is one of those that has popped up from time to time, that has a kind of sense of magic about it. And of course, it’s not, I’m sad to say… But it definitely, it definitely has that surprise/awe factor, and what you’re able to do as you look at how the different parts of the system work together. And I know we’re going to talk about that kind of workflow in terms of how the model works, but…

…the backend, what arises out of that is definitely surprising.

Yeah. And like you said, this sort of text-to-image stuff is maybe the most accessible thing for people to try, and so that’s what you’ve seen most. But I’ve seen really interesting integrations and demos of the model already, because you can do not only sort of just a raw text-to-image , but you could do sort of like inpainting; so you could freeze as part of the image and fill in the rest, or like recover parts of an image… Or if you have an image of a street and you want to take this person out, you could kind of remove them, and then fill in the gap… All sorts of interesting things like that you could do as part of the workflow.

And then there’s also this sort of image-to­-image tasks, kind of doing some sort of translation of image style, or something like that… But yeah, a lot of things that are integrating the Stable Diffusion model… One of the reasons because it’s open and people can access it.

Yeah, it’s fully open source… And I think, going back to what you were just talking about for a second there, I think one of the coolest things about it is you can change the representation that’s fed into the diffusion model. So as you said, from an accessibility standpoint, you kind of start with this, you know, writing the text out, and the train model, which has been trained on so many things in human culture and civilization, has these great components to pull from within the train model.

[08:18] But you mentioned the image-to-image, and we’ve seen some interesting things where you can take things out of an image… And I know there are other techniques out there, obviously, for doing this… But the representation can be text, it can be images, it can be lots of different things, which really opens up the possibilities, and I think we’ll kind of span all the disciplines that we commonly talk about in the space.

Yeah. So to give people an idea of the accessibility, even just this morning I had a Google Colab notebook open; it did have a GPU on it, but it was just a Google Colab notebook. I use the Hugging Face Diffusers library, where you can import the Stable Diffusion model. There’s a pipeline built for using the pre-trained Stable Diffusion model… So I’m just counting, after my imports, I have 1, 2, 3, 4, 5, 6, 7, 8 lines of code to go from text-to-image.

So there’s two factors here. One is like there’s great tooling from Hugging Face, which is something we talk about all the time, so continual great work there… But the other side of it is, this is just running in a Google Colab notebook, and I’m able to access it via my browser. I don’t have to spin up an instance in the cloud with a big beefy GPU or set of them; this side of the accessibility, both the open source release of the model and the ability to use the model in a computationally efficient way - those are two of the sort of big motives, in my understanding; and I should be explicit, I didn’t have anything to do with training this model… But in my understanding, from the teams that trained this, which included a sponsor called Stability - that’s where it gets its, name Stable Diffusion… RunwayML was involved, which we’ve I think mentioned on the show here before, that has tools for kind of creative uses of machine learning…

And then academic researchers, from Ludwig Maximilian University in Germany. So this group kind of explicitly set out with motivations around accessibility, and specifically, with accessibility, a more computationally efficient diffusion model, and one that would be explicitly open source. And I think that’s why this has exploded is because if people can access something easily, and they don’t need really fancy compute to run it, then it’s going to kind of spread very quickly.

Yeah, I mean, it’s been noted in multiple places that if you have a computer with a graphics card that’s a GPU, you’re probably good to go; it doesn’t have to be the latest, greatest thing. So it really opens up to people everywhere; they can use this. And probably most people that might be interested in it already have the equipment, even without going to a cloud solution like Colab, or something; you have it in your house probably already, and you can do it.

Yeah, on a laptop with a card, or a desktop, or just a cloud instance that’s less expensive, and then trying to do something. I was reading that for other diffusion models – so we should be explicit too, this isn’t the first of these types of models. We already talked about DALL-E too, which has a lot of similarities with Stable Diffusion, and we’ll kind of point out the differences as we continue the conversation… But also a model that’s capable of doing this amazing text-to-image generation, and these other applications, like inpainting, and that sort of thing. But it’s fairly computationally expensive, and it’s not as open; you have to kind of sign up on a waitlist, get access, use it via API, that sort of thing.

[12:15] And I think – I was reading, for some other diffusion models, I read one statistic that was like 50,000 samples takes about five days to do inferencing on, on a single A100. So most people don’t have access to an A100, and maybe don’t want to spend five days waiting around for the processing of a bunch of samples. Now, 50k is a lot as well… But yeah, so that’s one just kind of baseline or foundational number that, hey, these things did exist before, but they were extremely computationally expensive.

You know, just as kind of a single point that you mentioned about it being open source - we’ve had, and we’ve talked about this with previous model releases on the show, different approaches to releasing of different types of models. And there have been things where there has been concern about how it will be used, or security and things like that… And incremental, some things stay proprietary, with just kind of a frontend interface to it, other things had been released incrementally, or the big model is withheld, but a smaller, reduced, functional version is offered… And here we are, and we just went through DALL-E, which as you pointed out, it has constraints there… And here we are with this open source release that’s quite powerful and quite amazing, and yet quite accessible to pretty much anybody who would like to start working with it. What are your thoughts around the fact that – I mean, this is feeling a little bit more like that open source software world that you and I have both come from, in the past… How do you think this may change the space going forward if others as well, with both this and other releases going forward, it tends to be more straight out open source with the level of accessibility? How does that change the space we’re in?

Yeah, I think that there’s a few elements of this… I think it has been interesting - the last episode that we had, we talked about these open rail licenses, and one is utilized by Stable Diffusion. So there is some explicit things you have to agree to when downloading the model. On Hugging Face, for example, you have to click a button that says “I agree to this stuff”, and then you can download it, and you have to use your Hugging Face token to download it… But it is open in that sense, in a sort of unique way.

But I think that if we look at models like this and ones that are released open source, I think you kind of saw in software, over time, as it was open sourced, a lot of software applications or kind of specialized software things going from kind of specialized expert groups using them to a general-purpose technology that was used and integrated into a whole variety of things that the original creators didn’t even have in mind, right? So I think we’re in a similar place here, where we’re going from maybe models that were being experimented with in sort of siloed places… But now, as you were mentioning, there’s all sorts of ways you could imagine using this model. And because I can access it, and because I can run it without expensive hardware, and because there’s good tooling like the Diffusers library, which I can pull in and do this in eight lines of code, then who knows how people will use this and sort of hack it, in a good way. So hacking it for useful, kind of pragmatic purposes.

I agree. I’m actually looking forward to seeing as it really gets out beyond its core community, and reaches all those people, and people become aware of it… Because we’re still very early days; it’ll be interesting to see some of the ideas that come out of it, both the creative art that we’ve seen already, but also some of the kind of innovative, maybe kind of business-oriented, novel ways of using it that we are not likely to think of today.

Okay, Chris, you know I like to get into the weeds sometimes… I say we just dive into this model and see kind of how it works a bit. We’ll kind of take the listeners along with us and go through and figure out how this happens. How do we go from text to image, and also how is this thing trained.

Let’s diffuse the weeds, Daniel. Let’s get into it.

Yes. Diffuse the knowledge, or whatever. Yeah. [laughs] And Chris, I think there’s certain things to listen for as we go through this process. You and I have talked about some of these building blocks that continually show up, one of them being transformers and the attention mechanism that has been applied… Of course, diffusion models have been applied in a variety of ways - encoder, decoder models, word embeddings or text embeddings… All of these things show up as we go through this. So again, this is not kind of popping up out of nowhere. It’s an assembly of things that we’ve talked about before.

Yes. This has been a little bit of a magical past year, as we’ve seen things come about, largely from that cross-pollination of different technologies that have arisen on different paths, but now they’re getting blended, and some pretty cool things are coming out of it.

Yeah. So the Stable Diffusion model, if kind of having your mind - and we can’t show you a picture, because this is an audio podcast… But if you have in your mind going from a text input to an image output, the sort of general process is that that text is embedded into some representation; that embedding plus some noise is then de-noised to an image, and that image is then upscaled, or decoded into a larger image that’s not compressed. So those are the general stages of the pipeline. You’ve got text embedded, plus random noise, de-noised, and then decoded or upscaled to an image.

Do you want to take a moment and let’s just kind of talk about – for those who are kind of coming into it, the idea of introducing noise and then de-noising… What do you get out of that productively? What’s the reason for that in the workflow?

Yeah, so it doesn’t have to be a text-to-image model, but this sort of de-noising or diffusion type model is useful, because it can take a sort of noisy input and de-noise it. So the sort of bigger idea here is that I could take a set of images in my training set and then introduce noise into those images via certain steps of noising. And then I could train my model to, in a series of steps, de-noise those images. So this could be used both for fixing corrupted images, or upscaling images, and that sort of thing… So it doesn’t have to be for like text-to-image . But this is the general idea, is that you have an original output, or original set of images, that you can kind of corrupt intentionally, and then train your model to de-corrupt those, or de-noise them. And then that model can be used to perform that sort of de- noising, or upscaling type of action afterwards.

[20:30] As we talk about the fact that attention is used in here - and I know in some of the discussions around it, it’s referred to as cross-attention… What is cross-attention as a form of attention? Does that just mean different modalities coming in, or how would you define that?

Yeah, so I think it would be good with that to kind of describe maybe the overall components or modules of this system. So there’s three main components of Stable Diffusion to make it what it is. The first is a text encoder, or a language model that takes your text and converts it into an embedded representation, or encodes that text. The next major component is an auto-encoder. We’ll come back to that, because it’s a key piece of what makes Stable Diffusion different, is what they did with the auto-encoder. But the auto-encoder - basically, you can think about it as a way to train something to upscale your image. So to go from a compressed image to a non-compressed image.

And then the third is this diffusion model, which is a UNet model. This is the type of architecture it is, a UNet model. And this is that model that takes a noisy input, and then de-noises it. So again, the text encoder encodes your text to an embedded representation, just a series of numbers, a series of floating point numbers; your auto-encoder is really a way to get to a decoder, which can decompress images, or upscale them. And then your diffusion model, which is based on this UNet architecture, which takes Gaussian noise, or some noise, and de-noises it to get closer to the text representation that you input.

So those are the three main components. And what happens is that - we mentioned this diffusion model that takes noise, and de-noises it to something that’s close to your text representation… Well, somehow you have to combine that noise and your text representation. So if you imagine, text comes into your text encoder or language model, that’s converted to a series of numbers, a learned embedding for that text, and then that learned embedding is combined with this random noise. And that’s where the cross-attention happens. So cross-attention is this way of mapping, mapping your text representation, your encoded text onto this random noise, which - the word that they use for this is condition; it conditions the random noise with your text representation. And that’s how the diffusion model, which de-noises it - that’s how it knows what it’s kind of after. That’s how it gets to a semantically relevant image that’s relevant to your text input, is because it’s been combined with your text embedding in this cross-attention mechanism and the random noise.

And the diffusion model is a form of convolutional model. Is that accurate?

[23:46] Yeah, the diffusion model that, at least the one that was used in this Stable Diffusion piece, is called UNet. It’s used for other purposes as well, but it sort of has a series of convolutional layers, one that kind of takes your image and shrinks the image down in the convolutions, and one that does the inverse of that. So this is like a down-path/up-path thing. And then there’s combinations between those two things. But yeah, it’s a series of convolutions that are combined in a certain way, which makes it UNet.

It’s interesting, as you have kind of catalogued these different components and their workflow… And we have talked about all of these things in previous episodes; these are all existing technologies, but they found a way to put them together to a remarkable effect. It’s very interesting that we keep returning to that cross-modality being the source of the current wave of creativity in the AI space, and I think this is a great example. individually, I know what all those things are; I would never have imagined putting them together to achieve this… So it was a pretty, pretty cool way of doing this.

Yeah. And I think that the key piece to emphasize about what was done here is really with the piece that we kind of glossed over quickly, which is this auto-encoder, and particularly how they trained both the diffusion model and the auto-encoder. So it’s not new to use this auto-encoder to compress and decompress images. That’s been done before. If you imagine you have a model that can encode an image, and then decode it, the encoding is sort of like the compressing of that image, the decoding is the decompressing of that image, and so you can train a model, you can train an encoder and a decoder jointly to do that compression, and then do a corresponding decompression, or decoding. And then the diffusion model sort of operates on those compressed images.

This is not new, this sort of combination of auto-encoder and diffusion model, in my understanding. What is new is that the Stable Diffusion team, this team from stability, and the group in Germany - I’ll mention some of their names, because Robin Rombach are on the paper; we’ll link in the show notes. But the thing that they wanted to do - remember, the motivation that they were after was to make a more computationally efficient diffusion model. That was at least one of the accessibility things they were after.

And so what they did was, instead of jointly training the auto-encoder and the diffusion model, they separately trained the auto-encoder and the diffusion model. And this does two things. It separates out the auto-encoder and lets you train the auto-encoder for what it needs to be good at, which is compressing and decompressing images. But it also means that the diffusion model only operates on these compressed images in the training, and those compressed images require like 64 times less memory for your diffusion model, which is why you can run the Stable Diffusion model on a consumer GPU card, because they’ve strategically separated out the training of this auto-encoder and the diffusion model, which allows the diffusion model to operate on compressed images, but still allows you to get high-quality, upscaled images out, because you’re using the decoder still.

And we’ve seen the Decoder and encoder being used… I mean, you see that in typical graphics software…

Right. Or machine translation models, all sorts of things.

[27:47] Exactly. It’s used often to clean that up. So the diffusion model is kind of where, if I’m understanding you correctly, is kind of going through that noising and then de-noising, it kind of blends what is available from the trained model together in that compressed format, and then when the decoder takes the result of that and kind of upscales it back to the uncompressed model, it kind of - in a very non-technical phrase - it kind of cleans it up and makes it what it is at that point. Is that close to being how – is that is that approximately fair?

Yeah. So if you can imagine really small images, which are generated out of random noise based on the diffusion model de-noising that noise, then those really small images are then decoded to a larger image which is inferred, which uses a separately trained decoder, which was trained in this sort of auto-encoder methodology.

I have a random question for you… Given that they’re training them kind of as these separate components, does that potentially – if I’m thinking in terms of outside of this space in software, we often mix different components together to achieve new things. Do you think that will help accelerate some of the exploration and experimentation in this by keeping those bits separate, so that you combine them as you want?

Yeah, well I think that there’s the clear computational advantage, but I think as an additional advantage, basically separating out this encoder, or the auto-encoder from the diffusion model makes it to where you can use the same auto-encoder model for all sorts of different downstream diffusion models. So this is another kind of shift that we’ve seen in other areas, where a portion of what you’re doing is general-purpose, and then you’re kind of bolting on what you need for the downstream tasks that you care about, whether that be image-to-image sort of tasks, or text-to-image tasks. Or maybe even another thing that would be like a text-to-audio task, or… There’s all sorts of different things that you could imagine doing downstream. So yeah, I think that this decouples the two. There’s a computational advantage, and there’s also a sort of functional advantage.

Well, Chris, I think one last thing to mention in terms of in the weeds stuff - I think it is really interesting to look at how the model was trained. So it’s probably worth mentioning a couple of those things, where this model, again, was trained in two distinct phases. There’s this universal auto-encoding stage, which is trained once and can be utilized for multiple diffusion model trainings downstream… And then there’s the second phase, which is actually training the diffusion model. And this model was trained on approximately 120 million image-text pairs. Well, there were 120 million image-text pairs from approximately 6 billion image text-pair dataset; that dataset is freely accessible, you can look at that as well, and we’ll link it in our in our show notes.

But I think we also talked in our last conversation about how it wasn’t – I mean, it’s expensive, but it wasn’t a crazy number to actually train this model. So it took 256 A100s about 150k hours, which would kind of equal at least that at market price around 600k. And I’m getting that from one of the team members on Twitter. So yeah, pretty interesting. I mean, I don’t know if you have 600k laying around, Chris, but it’s certainly a more accessible number than like training a model for 500 million, or something…

[32:09] Um, no, I don’t have the pocket change of 600k laying around… But as we’re looking at separating these trainings out, and the fact that, if you kind of think - you know, we talked a little bit about the idea of the magic arising out of this earlier, and the fact that you have so much human semantics captured in the diffusion model in terms of how it was trained… So there are many concepts. We talked earlier about the Gandalf/Yoda imagery that we had seen, and clearly, the training had included the concept of Yoda and the concept of Gandalf that were combined… As we go forward, do you think there is the idea of kind of a diffusion marketplace that arises, both open source and maybe some not open source, where depending on the cost that you want and things like that you can kind of get into the level of sophistication that you can support for your application? Do you think that becomes a reality, as we talk about making these accessible across a wide range of users and use cases?

Yeah… I mean, I think if you draw a parallel with what’s happened with other models that have caught on in similar ways… Like, if you imagine back to BERT, and these large language models, part of the magic of those was that the weights were open source, you could pull down a pre-trained version and then fine-tune it for a particular task, right? So I have no doubt that – and I think people are looking into this, and… There’s explicit notes on the Stable Diffusion page about limitations, and bias, and all that. So you can read that there.

But certainly, there’s bias in the dataset on which it was trained. But I think the power comes is - if you’re able to open source the model, in some sort of way, with tooling that will allow for the fine-tuning of it, I’m sure that people will sort of fine-tune or create different versions based off of the parent using maybe its imagery for particular styles of books, or publications, or imagery, or inpainting for creative arts, or for video processing, or for all of these different things. I think people will create their own versions of these, and probably some of them will be – those fine-tuned, kind of purpose-built models will be commercially available for purchase, as we’ve seen with certain language models in the marketplace… And some will be open source for people’s usage. Just like we’ve seen kind of a general-purpose BERT, and then we’ve got like a science document BERT, and we’ve got a legal document BERT, and these sorts of things. And those are open. But also, there’s companies that are making money because they’re processing legal documents with BERT, and they have their own proprietary version; or maybe they’re using the open source version and just have good tooling around it.

So to extend kind of your answer there just a little bit, one of the things that we often ask guests when we have guests on the show is kind of that, you know, wax poetic a little bit and tell us kind of where you see some of these things going… And I know that as we were diving into this topic for today’s show, and kind of exploring what we want to share with the audience, I could see so many possibilities, as could you. So let’s wax poetic for a few minutes on where this might go, and what might come down the road.

[35:45] You talked a little bit about the marketplace where people can find resources to move forward… With these technologies, one of the first things that I thought about was - we were just talking in the last episode or two about how artists were getting frustrated with the fact that you’d have machine learning practitioners come in and creating art with these things, and all that… And that’s in a very immediate, you-can-do-it-today kind of kind of situation. But we’ve watched these multi-modality evolutions coming through these models over the months… It’s not hard to envision that at some point down the road, this will move into video, and we’ll see other modalities being added to it. I think that would be consistent with the recent history that we’ve seen.

And as we do that, you’re now really moving into that creative space that previously it took a great deal of effort… You know, if we’re talking about the entertainment industry, and movie-making, and special effects, since we started with the Lord of the Rings… This could really revolutionize how special effects are achieved, and make some amazingly phenomenal special effects, as we see iterations going forward become very accessible to people at home. You’re no longer the big special effects company - and those companies would have access, too - but I could see so many industries… There’s obviously security concerns, there’s art things, there’s business things… What are some of the what-ifs that you could see, maybe not just with this particular model, but with what we might expect to see not too far down the road?

Yeah, the two areas that I’m thinking about are one, the expansion of modalities, like you talked about… So diffusion models applied to audio, for example, and what that means for both things like speech synthesis, or even creative things like music generation, or that sort of thing.

That’s a great one.

So that area is quite interesting to me, and I think it will happen. But the other what-if in my mind is how this set of technologies will be combined with others that we’ve seen to be very powerful already, that already exist. For example, I could have a dialogue system, or I could have prompts that were not created by me and fed into Stable Diffusion, but what if I create a prompt using GPT-3? Or automate the dialogue I’m having in a chat bot with language model-generated prompts, along with imagery, or video that’s created using something like Stable Diffusion? Or you could even imagine creating a storybook with both language models and sort of visual elements from something like Stable Diffusion. So I think that the creativity or the uses are also interesting in how they’re integrated with existing technologies, both that are AI-related and maybe not AI-related. So things like chatbots could be driven by an AI model, like a dialogue system model that’s state of the art. That could also just be like decision tree-based bots that are rule-based, but maybe you integrate visual elements from something like this, in a more controlled way.

So I think that this combination of the technology with both existing technologies and other language models, other models that are out there, is an area that I think will kind of expand quite a bit, and we’ll see some interesting things happen.

I think we’re looking at the birth of a kind of creative entrepreneurship, being able to really take some of this model and other recent models, and some of the new things that we expect to come in the not-so-distant future, and really have some amazing creative outputs on that. We started with Lord of the Rings, and so I’ll make a suggestion to the Tolkien business, if you will… It would be interesting to see maybe in a few years, when they’ve decided they need to refresh those stories again - maybe it’s done with some of these technologies, and it’s done kind of entirely with this set of creative technologies.

[40:22] And to your point, maybe it is released in many, many languages simultaneously; kind of native, instead of being translated, in that capacity, so we can all share in that experience… And maybe even variations to adapt to different cultures, and different – all sorts of different races, cultures, and everything… And stories can be – you can take a storyline and make it pretty special in terms of being multimodal itself. So I can imagine a lot of pretty cool things.

Yeah. I always think back to that conversation we had with Jeff Adams from the Cobalt Speech company, talking about how his vision for the future also was this sort of more holistic treatment of both language, and other things, because language touches everything. So I think that that’s some of what you’re meaning.

While you were talking, just to kind of show you how accessible things are, I typed into Stable Diffusion “map of the USA in Lord of the Rings style”, and there’s definitely – I’m sure you’ll recognize certain elements of that map that I just posted in our Slack channel, Chris, that are Lord of the Rings-esque. So… Pretty interesting.

I actually – there’s a book… I don’t remember what it’s called right now, but there’s a book of Lord of the Rings of Middle Earth maps… So it’s the Eastern and Central U.S. that we’re looking at here… But it definitely has that Lord of the Rings style going to it. So yeah, I’m enjoying that.

Yeah. In terms of learning resources for people, I think what Chris and I would recommend that you do is just get hands-on with this model. There’s ways to do it even if you don’t code; there’s ways to do it through the app… Or on Hugging Face, you can actually download the model and use their diffusers library to run the model. If you search for Stable Diffusion on Hugging Face, you can find it.

Also, another post - so we’ll put this in our show notes, but we leveraged a blog post, which was quite useful, written by Marc Päpper, that described a lot of the things that we talked about here. So if you want some visuals and that sort of thing to aid your understanding of the model, we’ll link that in our show notes. So definitely take a look… It’s been fun to diffuse some of these ideas with you, Chris. I enjoyed it very much.

I did, too. I hope our audience enjoyed this as much – with these shows where we get to explore bold, new places, I really get excited about. So until next – I’m sure there’ll be something super-cool coming up that we’ll be talking about again, but until then, thanks for joining today, Daniel.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00