In this Fully-Connected episode, Daniel and Chris explore DALL-E 2, the amazing new model from Open AI that generates incredibly detailed novel images from text captions for a wide range of concepts expressible in natural language. Along the way, they acknowledge that some folks in the larger AI community are suggesting that sophisticated models may be approaching sentience, but together they pour cold water on that notion. But they can’t seem to get away from DALL-E’s images of raccoons in space, and of course, who would want to?
Click here to listen along while you enjoy the transcript. 🎧
Welcome to another Fully Connected episode of practical AI. In these Fully Connected episodes Chris and I keep you up to date on everything that’s happening in the world of machine learning and artificial intelligence, and we help you level up your machine learning game with some learning resources and some articles and things to keep you up with the state of the art. Excited to talk with you about stuff going on in the AI world today, Chris. It seems like this season there’s just been a lot hitting the AI fan… I don’t know how else to put it.
It’s been a curious time lately. We’ve seen some interesting things arise… And yeah, absolutely. And people’s take on it is a little curious, too. I know we can go down that road as we get there.
Yeah, I guess – I don’t know if we want to sort of call it, that elephant in the room is that people are now calling certain AI models sentient. I’m sure this is something that probably a lot of listeners have seen on popular media, and that sort of thing. There was as an engineer – I don’t know actually his full position at Google, but that was kind of referring to this language and dialogue system that Google uses/works with as having some level of sentience.
I think both you and I sort of cringe when–
Yeah, I don’t know about you, but it’s almost how did we get here? what is this state that we’re in that people are calling these AI models sentient? How did we get to this state? Was it unexpected? Was it expected? What was the path that led us here? All of those kinds of questions come up in my mind. What’s your thought?
You know, a little bit of both, I guess. I guess any of our longtime listeners know that at the end of the day we are very practical and pragmatic in terms of how we look at these things. And we have been doing this long enough to where we’ve seen quite a lot of evolution in the field. And certainly, acknowledging that the goalposts kind of keep moving on various levels of kind of performance evaluation, if you will, for different models… But we’ve kind of come to a point this year where we’re seeing some models with new, more expensive capabilities than the narrow ones of the past. And so maybe this is a logical moment for people to kind of reevaluate and apply some labels to it.
[04:21] When people talk about models having sentience today, I’m really struggling with that… But I think it’s probably every time we’re hitting kind of a major set of milestones, we’ll probably have these conversations.
Yeah. So if we just sort of look back - like you say, we’ve been doing this show quite a while… And this thing kind of pops up every once in a while, and it does come into conversation, and it’s worth addressing… But I think if you look back and we look at the wider story of how kind of – we started with language models, but I think we’ll go in other areas later, and talk about kind of see where we’re at and where things have gone with vision, and other things…
But I think with language models it used to be the fact that kind of the best language models would produce language that was sort of like passable, but it doesn’t grab you, in the sense of like being incredibly coherent. Even artful type language.
So starting out, if people don’t know, these sort of language models used to be these models that would just looking at the statistical frequencies of frequently occurring N-grams. Like if I have this combination of two words, how often is that combination of two words seen with this other combination of two other words, and how many is this combination of three words compared to other combinations…? You would kind of calculate all the probabilities of these things and be able to create a language model that would give you an understanding of how probable certain sequences of texts were. Well, that’s actually very useful and still used in a lot of places.
But we kind of went from that into this zone of like recurrent neural networks, where now we’ve kind of got this element of memory, and bidirectional memory… And then we scaled that up very much with transformers, which are kind of a very computationally scalable way to look at a sequence of things coming into it, in context, and scale that up in a very computationally favorable way. And that, of course, has kind of blossomed into these incredibly large transformer-based language models trained on very, very much data.
What happens then is these models, which can model context very well, can also produce sequences of texts that are incredibly coherent and compelling, to be honest, to a person viewing them. So I don’t know if you remember when – I think we had some original conversations when the GPT models were coming out from OpenAI. Do you remember that?
Yeah, I remember really being quite surprised… I don’t know if you shared that with colleagues, or maybe people that were not practitioners.
Absolutely, I did. I mean, I shared it with my daughter, as a matter of fact, just kind of drawing her over, and she wasn’t terribly interested. [laughter] But you know, kind of pulling you over to the laptop and saying, “Look at this.” And we were talking about it at work… I mean, it was a big deal for what it was.
[07:50] Yeah. I mean, this is quite compelling. I do want to bring up this – so I have pulled up this article that I think is really relevant… And I don’t usually quote from things on the podcast, but I don’t know that I could really say this better. It’s probably worth having on the record in the podcast… But the paper is on the dangers of stochastic parrots by [unintelligible 00:08:12.29] and crew, who have done a lot of this work kind of looking at limitations and dangers of these large language models.
They’re talking about how the outputs of these models like GPT-3 or something is seemingly coherent, but they label it as coherence in the eye of the beholder. Let me just read some of this, I think it’s worthwhile. So they talk about “We say seemingly coherent, and they’re talking about like the output of these models, because coherence is in fact, in the eye of the beholder. Our human understanding of coherence derives from our ability to recognize interlocutors’ beliefs and intentions within context.” So they’re talking about, hey, when you communicate with someone, you basically assume that they have some intentionality with their conversation. They’re saying something because they are communicating – it’s a two-way thing, right? They’re communicating with you. They say text generated by a language model is not grounded in communicative intent, any model of the world or any model of the reader’s state of mind. It can’t have been because the training data never included any sharing thoughts with the listeners, nor does the machine have the ability to do that. This can seem counterintuitive, given the increasing fluent qualities of automatically generated text. But we have to account for the fact that our perception of natural language text, regardless of how it was generated, is mediated by our own linguistic competence and our predisposition to interpret communicative acts as conveying coherent meaning and intent, whether or not they do.
So that’s sort of a bunch of words, it’s a really important set of words that like - the fact that you can create seemingly coherent text doesn’t imply that there was an intent behind that, or some understanding of your state of mind, or that there was even a state of mind, like the sentience behind this…
I think that states super, super-well, the intuitive reaction that I have to the output of those kinds of models. So that represents the way I’m thinking when I’m getting that. And when people are applying these labels to it, like “Oh, clearly sentient because of this”, I’m just kind of going “Hm… Not so much.”
Yeah. But it’s probably also true that when you see certain things come out of a language model, before you go into that mode - and maybe me when I was first seeing the GPT things, it’s like “Wow. I didn’t think this was possible. There’s something sophisticated going on here.” That’s where sort of like – maybe a person’s mind would go naturally, is “Hey, this is really compelling, coherent text, and maybe there’s something more going on here than I’m thinking” because our mind, like you were talking about it, it goes to - or many people’s mind would go to the fact that “Wow, it understood exactly what I was talking about, and responded in a really coherent way, so it must have understood me.” You know, there was intent there, or something like that.
So I find it dangerous to go here, because you’re a language expert definitely on our team, and I’m certainly not, but as someone in the field though who does not have this expertise, my expectation is that, you know, languages - and I’m gonna say this using all the wrong lingo based on the kind of professional work that you do… But it’s a framework and there’s relationships between all the aspects of the language, and I don’t think I’m surprised in the large that a sophisticated model can find that. But that’s a far cry from all of the attributes that I normally associate with sentience.
[12:13] So is it impressive? Yeah, really impressive. And I’m acknowledging that. But that’s not the same thing as saying, “Okay, if you’re seeing a, b, and c here, that also correlates to, to XYZ, which is a different a different set of criteria.” And I think we’ve seen this a bunch of times over the years that we’ve been doing this… Something kind of wow will happen, and people will infer a bigger jump than it actually is. It’s an impressive jump nonetheless, but they extend it, I think, and there’s an emotional aspect to it… And I think that’s how I find this particular moment.
Yeah. Just another statement from this paper - I’ll link it in the show notes; it kind of gets to that fact, like you were saying, stitching together words and stuff to make language is not what it means, at least in my mind, to be sentient. They say contrary to how it may seem when we observe its output, a language model is a system for haphazardly stitching together sequences of linguistic forms; it is observed in vast amounts of training data according to probabilistic information about how they combine, without any reference to meaning. A stochastic parrot. And that’s what the reference is to the sort of metaphor that they’re using.
It’s interesting though that this trend, this sort of like trend in language model size, the amount of data that these language models are trained on has increased this apparent coherence of the output from these models, and I think they sort of in a lot of ways predicted in this work, and maybe others, that people would start increasingly thinking that these things have some type of sentience or something, just because they’re so much more compelling in the output. But what they’re saying is “This is coming, but beware. This is not what you think it is.” That sort of message.
So that points out a problem, in that if people are going to conflate coherence with sentience…
Oh, it’ll come up again. Yeah.
Yeah, it’ll keep coming up with every new model at this point, because what is possible is already so sophisticated now. So how do you address that? How do you parse the difference in cohesiveness versus sentience, and have a way of distinguishing between the two?
Well, Chris, we’re always talking about language on this podcast, probably because of my own very biased opinion, but there’s a lot of trends happening right now that we can’t ignore in the vision space as well, and actually in ways that are also connected to the language space. And I think those are also things that are taking AI beyond the realm of practitioners into sort of the wider public’s view.
An example of this recently I saw on my Twitter was like Cosmopolitan Magazine… I don’t know if it was their latest - at the time of the recording it was one of their latest cover photos, it was generated by the DALL-E 2 model I believe it was, from OpenAI… Which is a model that takes text input, and then outputs an image. So you can say “I want a picture of an astronaut riding on a horse, on the Moon, in the style of Van Gogh” or whatever. And it will give you that, which is pretty extraordinary. And that’s also like - these images, there’s similar models; it seems like there’s all of these models coming out within weeks of one another. I don’t know exactly how the authors would prefer to say it, Imagen, from Google, and it does a very similar thing. There’s two or three other ones; sorry if one of our listeners created one of these models and I didn’t mention yours, I apologize. But it seems like five to ten of these things came out like within a month, or something. And this is also one of those moments, kind of like the Google engineer saying that a language model is sentient… It’s like, something happened here. What led us to this point? What are the key things to maybe take away from like – why did all this happen at once? That’s what I’m thinking.
[17:12] Well, we’re making these leaps. There’s this determination to say we’ve gotten there, we’ve gotten there. And if you look at what DALL-E can do, it’s amazing. As you said, you can input the text, and you can get these – they’re not just simple images. If anyone in the audience hasn’t seen the output, you need to go look. I mean, they’re remarkable. It’s art, you know, what’s possible. And they’re super-detailed and super-complex. So it’s another one of those moments where you’re like “Wow. We’ve hit a big milestone here.” And something has changed… It must be sentient. [laughter] What we’re seeing here visually is the equivalent with the language models, with the visual output, being able to say it’s coherent. I mean, you put in a simple thing, which is the text, and you came out with this enormously complex thing, which is the graphical image. So yes, we’re at the inflection point. So this is another one of those – it begs where does that coherence distinguish itself?
This is really interesting, I think, because there’s a connection between all of this stuff that’s going on and the physics world that I grew up in. Maybe not grew up in, but that’s how I started my career, at least. And that’s this connection to like statistical physics or thermodynamics with these models, which I can describe here in a second… But it strikes me as well that there’s a whole bunch of kinds of systems that, given a very simple input create very complicated output, and I guess this is like chaos theory, and other things. It doesn’t mean that those systems are – like, that’s not an indication of sentience in and of itself, that property. So you can have like a double pendulum, so a pendulum with a ball in the end, which is connected to another pendulum with the ball in the end, and you started out in a very simple arrangement, and then all of a sudden the dynamics of that are extremely complicated, and pretty amazing.
So yeah, I think that that was a good point that you made about this sort of seemingly simple or limited input to really, really complicated or seemingly artful output, I guess is a good way to put it… What are some of the best pictures you’ve seen from DALL-E? Do you remember any of that caught your eye?
I’ve seen a lot over the last month or so. I like the animal ones, as you know…
Yeah. It’s like an animal in a certain place, doing a certain thing?
Yes, that’s unusual, and you wouldn’t typically see it. Yes, that’s definitely my drug of choice, being who I am. You know, the creation of that kind of organized and beautiful complexity that it does - do you think that’s what’s driving the leaps on analysis of what the models… You know, is this the model that’s pushed us into that? Is it just that creation of complexity, and with organization, and understanding, and as you said earlier, coherence?
I think that there’s probably several components here that when they were not put together, the output of those components was maybe limited to a specific domain, or modality of data… But now we’ve sort of had this progression in the industry where we’ve had Transformers, we’ve had things like Clip, which allows you to have these sort of text image embeddings… And then we’ve had another thing, which is where the connection with physics come in, which is these Diffusion models.
[21:11] So if you look at these text to image things that are being produced, at least like DALL-E 2, that sort of like builds on the combination of those things. So it’s almost like these major components were developed in isolation for a specific purpose, or with a specific goal, and then people were like “Well, there’s a really interesting combination when I combine these together and think about the different modalities of data that I’m working with, text and images.” I think there’s these really powerful components that have developed over time, and people are now kind of mixing different modalities, mixing different of these components. Because really, a clip model is going to output a series of numbers; there’s nothing saying I can’t put those series of numbers into a diffusion model or into a transformer model, or… Like, what happens if I switch the order and change all of these things around? I think that’s what people are really – they’re not doing it just sort of wild trial and error, but in a very intentional and well thought out way, they’re like “Well, now it would make sense if I combined this Clip model and this Diffusion model in this way.” And that turns out to produce something that’s extremely profound.
But that is also, if you think about it for a moment and take it outside of just the AI world alone, and talk about technology at large, that’s the normal way that technology innovation happens. People will push down a particular modality or something, and try something out, because they’re fulfilling a need. And there’ll be a bunch of small, incremental improvements along the way, and then somebody goes, “Wow, this is a really powerful building block. What if I switch modes now and do that?” And then they get something that’s quite remarkable. And we’ve seen that in all technology development over time… So I think that what you’ve just described was a really typical evolutionary path for technology. And you know, like in the case of DALL-E, we have these amazing visual images. So it really catches you.
Yeah, that’s another side of the wider –
It’s super marketable in terms of the output.. But it’s really – I see it as an important evolutionary innovation that drives the field forward, and gives us a bunch of new capabilities and stuff, but I don’t see it as unexpected.
Yeah, yeah. I would agree with that. I think just to kind of further break down what we’re talking about here, the most recent that came out was DALL-E 2. Well, maybe not the most recent of these types of models, but one of the ones that gained a lot of attention, DALL-E 2. And it’s an evolution – so there was an original DALL-E model, and it was actually transformer-based.
So you could kind of track back the steps, but if you think about coming maybe from the other direction, we had - like, we already talked about these; we said, “Hey, well, it would be useful for us to model sequences of things.” We could look at different pairs, and triplets of those things. That’s an N-gram model. And we could then kind of look at recurrent ways or bi-directional ways to look at the sequence in recurrent or bidirectional type of layers in neural networks. And then we said, “Well, there’s kind of a scaling problem there, and maybe some things that we maybe don’t want to look at neighbors, but we want to look at attention at a sort of wider level…” And so then came along attention, and then transformers… And that was all still kind of text-based, right?
[25:03] Well, DALL-E, in my understanding, the original one, so not the most recent one, but the original one, was based on a transformer architecture. So they basically said, “Well, what would prevent us from saying I’m going to take sequences of tokens that are words, and concatenate them with sequences of tokens that correspond to image pixels, or an image embedding? I can still put that through my transformer model, what happens”, and then I think there was this realization of the goodness of these diffusion models, which if people kind of don’t know about diffusion models, what happens is it’s a way to kind of de-noise something. So if you introduce a bunch of noise into an image, Gaussian noise or something, you can train a model to de-noise that and kind of reverse – so go from noisy to not noisy. And people were like “Well, what if we take some of these text encodings that we have, and either we condition the diffusion model on those, or we use these Clip models and other things to get these text image encoders, and then we could put that into a Diffusion model…” And you can kind of see these things start to pile on top of one another, and now we have these beautiful images.
But I think you’re right that when you see these images, it’s almost like – you don’t get that full sense of the history, and you’re like “Oh, this has achieved a new level”, when in reality there’s been a lot of building blocks that have come along the way, and it shocks sort of the wider audience. But if you look back, there is a path that led there.
It’s interesting, that path spanned a couple of different times, the visual and the language model side. We really like to, from a human perspective, classify whether something is NLP, and its language, and it’s text, or this one’s visual… But a lot of these underlying building blocks are moving across. And so I think we’ve really seen that migration back and forth across modalities over time. And it’s interesting, there was a point in time where I thought, it kind of felt like there were different branches of machine learning kind of going off in their own direction, to some degree. But I think what we’ve seen with these recent models is they’re all tied together, and they’re all coming back. And a lot of times it’s when you mix the chocolate and the peanut butter together that you get something a little bit new there, that has value unto itself.
Well, I often – I mean, I do get sort of in my own little NLP world, so it is good to look at kind of trends more widely. And I think the trends with the visual model, or the text to image models that we’ve been discussing - those are really instructive in terms of how all these things are connected. I think you could connect those things as well to speech models, which oftentimes speech models leverage kind of spectrograms or something, and so there’s a connection to processing those like images, with computer-vision type of models. My prediction, which will be wrong, because you know, all predictions are wrong… But my prediction would be that, like, we’ve sort of seen a lot of text image stuff… I think it would be interesting, and I think you’ll see a lot of people maybe exploring more of sort of audio in that mix, too… Whether that be audio image, or other things where there’s no textual representation; or maybe there’s music accompanying a video, and those sorts of things… I think all of that sort of fits in this direction that we’re going.
I totally subscribe to that expectation and from a prediction standpoint, because I think what you’re saying there is that there’s some modality mixes here that haven’t been explored yet. But it would make sense to go do those. I mean, as people have found some value here, I think that all of those different modalities and the transfer between them and such will be explored and created. So I think we’ll see a lot more of that over the next few years.
Yeah, I think that there’s a lot of areas to explore, that fit in this area. I had a discussion with some of our partners that work with deaf organizations, so organizations that work with various deaf communities around the world, and a lot of the hearing people in the tech world that have like thought about, “Oh, we could use AI for sign language”, the first thing that they think about is, “well, I’m going to take sign language and I’m going to convert it into text.” Which is not a bad thing to do. I’m not saying that at all. I think it’s actually quite interesting. But it’s not really – in my understanding, a lot of the deaf community, it’s not like the first thing that comes to their mind in terms of the technology that they would want as part of their community, because their language is sign language.
[30:40] So I was just on a call with them and we were sort of dreaming together, and it was like “Well, could we take sign language videos and generate sign language videos in a different sign language?” Because if people don’t know, there’s 400 sign languages in the world?
Wow, I did not realize that. At that level.
I knew there was more than one, but…
Yeah. Or just doing things like thinking about like, hey, a sign language content is video, right? And we sort of take for granted that we can search through our content like it’s text, and we have text search. But if you have like hours and hours of sign language video, how do you find what you want? Are you forced to like go back into the text modality and search based on text tags, which are not your language? So is it possible to sign into a camera and have that be a sign-based search?
So I think all of this, like these sorts of things partially are unexplored because there’s communities that haven’t yet been served, or engaged from a community perspective with this technology. We were just talking with Joshua Meyer on our last episode about how beneficial it is when you can partner with the community, the end user group, and they’re the ones really driving, like “Hey, we really want this for our community, and we’re already putting work in. Could you partner with us to help?” That produces a lot of maybe unexpected opportunities.
So yeah, there’s definitely still a lot of multimodal, very interesting problems and areas to explore, I would say.
Do you think we’ve just barely scratched the surface so far?
Yeah, I mean, that’s kind of, to be honest, some of the laughability, I guess, to me of the sentient thing. And I don’t mean that to demean the person that said that, because obviously, they made a decision to be very opinionated about that, and defend it… But the world is just so complex. And to think about these models dealing with all of these things that we haven’t even thought about yet, in modalities that we’ve only begun to explore - I just view it as a really fun time to be part of this technology, because we get to explore really interesting things.
You know, it’s funny that you say that… I think it needs to be okay for us to enjoy the evolution of machine learning, deep learning, AI, whatever label you want to call it on, without having to try to always stretch… I know for me, if anything, I feel almost farther away from the idea of sentience than ever. But I say that with a deep respect for all that has been achieved so far by the global communities that are driving these technologies forward. And they’re hugely valuable. And will we get there someday? Probably, because we are just biological systems, and at the end of the day, that ability to understand what it is that makes sentient creatures sentient I think will be accomplished eventually. But I think that there’s a long, long, long runway before we get to that point.
[34:10] And we’re learning as we go, and I think it needs to be okay for us to say where we’re at today, and that doesn’t have to be the ultimate goal yet. We may have many years to go before we get there. But sometimes I think people are just overreaching for where we’re at. Appreciate the moment, appreciate the fact that we’re doing some really, really cool stuff today in research and the things that are coming out, and go do some really fascinating things with what we have now that we didn’t have yesterday.
Yeah, I think that obviously there’ll be a whole range of views on whether we would ever reach sentience or not, and somewhat driven by different religious and philosophical views on what it means to be human… And I would be opinionated about some of those things very much based on my faith, but the fact that we will be create more coherence and amazing output of these models, like we’ve already talked about, is going to happen.
What I think’s interesting and that the [unintelligible 00:35:09.06] paper kind of brings up, and discussions similar to that bring up is we as practitioners might understand how these things are stochastic parrots, or maybe not… We have maybe a little bit more understanding about how the relation is between their output, the data we feed into it, the sort of variability and limitations… And I think it’s partially on us as practitioners to make sure we also tell that story well, and not just post contextless, amazing picture on the internet, and mislead people about – you know, just promoting this idea that AI can do things you never thought it could be able to do, when in reality there is a sort of like predictable path, like you were talking about, of technology and incremental advancements towards that picture that you posted on Twitter. But when you do that, it doesn’t have any of that context.
No. You know, we’re always going to have that hype machine going. There’s a lot of reasons for that. But I guess there’s a little bit of a zen attitude of appreciate where we’re at, keep pushing forward and everything, but appreciate where we’re at for what it is, and that it doesn’t have to be more than it is for it to be a pretty wonderful thing.
Like I said, I’m a little bit zen-like in that way, but we’ve seen so many of those, and it’s one reason we’re still doing the podcast after all these years.
Yeah, I mean, it’s definitely been a ride, and always learning along the way, so I’m sure it will continue to be that way. And I very much did not expect to be talking about things like photos of raccoons wearing an astronaut helmet, looking out a window at night…
[37:07] [laughs] I’ve gotta tell the audience, before we stop recording today… So Daniel sent me this raccoon picture, and before he did that, I’m already looking at all of these raccoon pictures from DALL-E that there are out there. And so somehow we’re on this behind-the-scenes raccoon kick going here… [laughter] But yeah, so an odd, odd coincidence there.
This is a totally off-the-wall idea and train of thought, that has nothing to do with anything, but Chris, I know all that you do with animals… My wife is giving me a bit of a hard time, because I’m kind of geeking out over this… In our patio, she moved some house plants out there, and we have a bird that established a home in one of the pots out there, and has laid a couple of eggs so far, and so I’m trying to determine what is the appropriate live stream camera that I can set up to observe this, and then what – there has to be some type of like alert that I can do that’s AI-driven, that tells me when an egg has hatched, or when there’s feeding going on, or something. Yeah, this has consumed my thoughts for the past few days.
I’m not surprised at all… Although I was too lazy – we just stuck a live feed camera in there and watched it. And if there was daylight, there was always stuff going on between mom and dad taking care of the eggs.
But yeah, when I was on my last business trip away from Atlanta, I was down in Orlando and we were all around the table, and I was sharing a live feed with a bunch of people around the table of the mom and dad coming and going… And so I’m telling you, there are people who are really, really into this. And if you are the guy who puts out the model that alerts them for all the things, you’ll be a very popular man in the bird-watching world.
Maybe I’ll create a GitHub repo and a few YouTube videos describing my setup. That would ideal.
Okay. Daniel Whitenack. He said it here.
Yeah. I mean, the people – maybe when you listen to this episode and you search for my repo and don’t find it, then you’ll realize that maybe I didn’t get as far as I’d hoped… [laughs]
Okay. Well, on that note…
Yeah, on that note, we will link some of the links that we’ve talked about in the show notes for everyone, so make sure to check those out, including the stochastic parrots paper and some information about the Diffusion models and everything. So that’s a great way to follow up on some of these topics.
We hope everyone enjoys the raccoon pictures that DALL-E produces as much as Daniel and I have.
Yes, of course. Alright, talk to you soon, Chris.
Our transcripts are open source on GitHub. Improvements are welcome. 💚