Differentiating between what is real versus what is fake on the internet can be challenging. Historically, AI deepfakes have only added to the confusion and chaos, but when labeled and intended for good, deepfakes can be extremely helpful. But with all of the misinformation surrounding deepfakes, it can be hard to see the benefits they bring. Lior Hakim, CTO at Hour One, joins Chris and Daniel to shed some light on the practical uses of deepfakes. He addresses the AI technology behind deepfakes, how to make positive use of deep fakes such as breaking down communications barriers, and shares how Hour One specializes in the development of virtual humans for use in professional video communications.
Click here to listen along while you enjoy the transcript. 🎧
Well, welcome everyone to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How’re you doing, Chris?
I am doing okay, although I have to confess, a few days ago I took a fall while rollerblading, so I’m on Hydrocodone for the pain…
So I could say absolutely anything.
I’m really interested in this conversation even more now.
One thing I’ll finish by saying is I discovered that when it comes to rollerblading, I am a deep fake. I am not, I am not a talented rollerblader. I’ll leave it at that.
Yeah. Well, speaking of those deep fakes, we’re going to dive into a much deeper conversation around this topic that we’ve mentioned… So Chris, we’ve talked about deep fakes a few different times, I think, it’s come up. I think the wider, even non-AI people are aware of this technology and some of the things that have been created. But today, we’re really privileged to have Lior Hakim with us, from Hour One, who is the CTO at Hour One. And he’s going to be talking to us all about this technology and what they’re doing with it. Welcome, Lior.
Thank you, thank you for having me. It’s a pleasure. We can dive right in and speak about AI and deep fakes at large, and what we are doing.
Yeah, for sure. Maybe just starting with that, like - when someone asks you “What is a deep fake?”, you mentioned in a conversation, what is your explanation of what a deep fake is?
Yeah, so basically, what we’re trying to do is a virtual human, and to humanize the connection in how we can communicate with machines, basically. Because today, we are very used to text, pages, and then frozen video, and that is changing. Interactions with machines actually in the future will be different than what we experience now. And deepfakes, or other technologies, synthetic media or other names that are named like this, is basically some bridge point of how we interact or ingest information and communicate with machines. So I think that’s exciting.
[04:01] Yeah. And I guess some people are probably aware of things maybe they’ve seen on the internet, where you know – I think there have been ones with like Elon Musk or other people, where people have created a video of Elon Musk saying something, but it’s a synthesized video. And I know that that’s maybe something that comes to my mind when I think of deep fakes.
Like I was saying, there’s many people even outside of the AI community that understand that AI is able to create these very powerful and compelling videos; whether they’re for misinformation or for good purposes, AI is sort of starting to impact people’s content that they’re viewing on social media or wherever… How have you seen that develop, and from your perspective, how do you see the trend and how AI is influencing the creative side of what people are doing with video and other things?
I think this is really an exciting time, where creativity is basically unleashed, both in synthetic media, human faces, person that does not exist, and also prompt-generated images, and styles. And all of these kinds of things are coming together, and people are getting used to basically consuming stuff that is automatically generated. Creativity is basically changing.
I think that there are, of course, with every technology that comes about, people might be frightened, or are not used yet to these types of things… And I think that we will use it for good and for misuses, and we’ll adjust as going forward. But technology is moving at a fast pace. Later on we’ll talk about, of course, what we’re doing… But I think that a general thing to consider is maybe personas in the context of deep fakes.
So we have those personas… I don’t say fully subscribe to the notion of a fake and real, because I think me in real life and me in social media and other stuff like that might be different; I might present different things. So I think it’s a whole spectrum of uses and misuses. And definitely some things can be unusual. Sometimes maybe I would like someone else to be able to use my persona for some cases, I don’t know, people on commercials, or lending my image for a character in a movie that I’m playing, having the writer’s texts, speak it, and have the director direct me. And then in other cases, I might be more frightened if something that I didn’t mean to be to happen, I see myself speaking something that I don’t know. So of course, it’s a whole spectrum, and we’re exploring this spectrum, and I think it’s a very interesting place to be.
the way you started that explanation was fascinating to me, because I think so many people are introduced to the topic of deep fakes by some of the nefarious things that get popularized in media reports and news reports. But you talked about it in terms of kind of the way we’re interacting with computers, sort of that user experience to some degree maybe… And I’ve found that really interesting, because though I’m in the defense industry now, I spent over a decade in the creative digital marketing space, and we were all about personas… And I don’t think I had really adjusted my way of thinking about deep fakes to think about really focusing on the interactions versus some of these other ways that we’ve seen, which tend to be more on the negative side. So could you talk a little bit about that for a moment? Because you’ve kind of reset my perspective… How can deep fakes take us forward in the years to come, in terms of how those interactions with automated systems play out? And what’s different? And what should we expect as the normal as we’re looking at some of the possibilities over the next few years?
[08:13] So I think your likeness, the way you look, the way you sound, the tone of your voice, the style, your gestures - everything is kind of like your set of skills that you’re regularly put for work for other people you go to work; you give your content, and then those types of traits, you will be able basically to digitize them and then put them to use in the context that you feel is suitable for what you want.
For example, you will be able to capture yourself, digitize yourself like an avatar. You can think about, I don’t know, Bitmoji. You directly design the avatar like yourself, and then you put it to use not as yourself. And the same thing can happen with real-life captured video, or with versions of yourself, filters that make you more attractive, or more happy, or if you don’t want to put makeup every day, you can just hop into a meeting or a sales presentation looking as your best self, from your home, with all the work from home, all those kinds of things. I’m just giving a few examples, not to be too abstract…
Well, thank you. Face for radio here, so yeah… I mean, that sounds fantastic.
Like we’re sitting now - we’re casual, and then people can see us talking in a nice studio, not just listening to audio, and enjoy everything without the necessary needs for us to cut our head, and shave, and do whatever we do to present ourselves. So our likeness, our tone of voice, and how we present ourselves, our persona, how we perceive ourselves basically can be digitized and then put to use. If we can control – we can of course control it by ourself, and have maximum control over the content that we deliver through our digitized character… And also, we can lend those characters to other people, to put content through us, if we think we can affiliate ourselves with that content, if we trust those people.
So it’s about creating trust, creating channels for people and communities, and then putting them to use. This is one side of the creator - I own my likeness, I own what I say, and how you’re being perceived. And then the other thing, the other side we are considering is the audience. Of course, the audience might want to see me in this podcast, or might want to see someone else giving this podcast, with another voice that he likes, with another face that fits him, or even with another language to be translated. And all of those traits, all of those modalities will be available in the future. You can consume the content at your pace, at your language, with – we call it pleasant interactions. So we will be able, by digitizing people likeness and the way we communicate with each other, for content to be delivered through machines.
I’m picking up on I guess a trend that you’ve been kind of alluding to, which is the fact that - I think where this technology has been maybe misapplied is where it’s sort of not accessible to a wider audience, and there’s a sort of concentration of who can use it and who can’t use it. But as you sort of make tooling, like the tooling that you’re developing, and give people that don’t have technical skills to spin up a GPU instance and like run TensorFlow in a distributed way, across a cluster, and all this stuff…
[12:03] As soon as you give a wider audience the ability to create with these tools, they sort of create their own persona, and they have control over that. But if they’re not able to access that technology, then there’s an imbalance of who can use it and who can’t, which kind of might produce some misusage. Because I’m thinking even of like audiobooks - maybe this predates the deep fake scenario, but for a long time with audiobooks or with things I was listening to on my phone, I could switch a voice potentially… Like, “Oh, I want to hear–” Or Google Maps is a good example of this, right? I can change the voice that’s going to talk to me from Google Maps. And that’s a preference thing on my end, right?
Now, I’m sure that there’s some complicated technology behind it, but the control decision on what is pleasant to me, like you were saying, is being made by me and maybe not by someone else. So how do you view I guess the shift between what changes when the technology gets out of the hands of maybe people like are talking here, that maybe know how to spin up a notebook and train a model, and those that have no tech skills, but the technology is appealing to them on the creative side? What changes when we get this into the hands of that kind of audience?
I think it’s an incredible shift in creativity, in the ability of people to communicate their ideas, and basically to manifest what they know, what they think, either through text, through prompt, with reference images, everything that’s happening… And I think it’s something that is happening with our industry, with the state of the tech, not only with synthetic media and virtual humans, but also with image generation and prompt-invoked image generation. I think everything is very restricted now, because the owners of technologies know that there is risk, and they’re a little bit afraid of what might happen, and they don’t know… So it’s growing, and it’s opening, the community. Of course, I can’t avoid Stable Diffusion and everything that’s happening there. I think it’s super-interesting that it’s going to be open, and I think in the end what you said about the eBooks is super-interesting; that people can listen to – we’re coming from the angle that not only you can choose the voice, but you can also subscribe to have your voice read whatever books you are willing to read, and then Chris might listen to those books in your audio, and you might be rewarded in some way for this. So this whole marketplace of skills and traits that we have is basically I think one of the things that is being built.
And I think generally that technology is adapting, and we find good uses, and misuses, as I said before. I think the same we’re experiencing with social media, with groups, moderation and stuff like this. So it’s gradually expanding into our culture, changing and reaching our culture, and I think the future will be exciting.
Lior, I think there’s a ton of things to explore on this topic. Before we get too far, I would love to kind of give people an intuition for what is possible with this technology, and how. The scenario I have in my mind is let’s say sometimes Chris is out of town, and I want to record a podcast with Chris, but he’s not present. So I want to work with Chris to create a virtual Chris, and then when Chris is gone, I can just type replies to myself, and then talk to virtual Chris, back and forth. And then let’s say Chris has given me permission to do this.
You have very low aspirations, Daniel…
[laughs] Could you explain, Lior, what technology enables that? From a technical side, what’s needed to be put into place in order to materialize that scenario?
Yes. So we have a written language, which is very easy, and we have an easy way of inputting language into the machines, basically. And then we can take them and transform them into voice, with capturing the voice; there is voice cloning, a lot of stuff is happening in this field. There’s many companies, great companies, and other open source projects making this happen. With a few voice samples, we can have a text-to-speech engine, which basically creates the audio of Chris in that scenario. And then with other systems, we basically can take the audio in whatever language that was generated by the text-to-speech, and create this speech basically to the image of Chris speaking it in real time, if we are a vlog.
Basically, we can create – and field is developing, but we can create types of looks for Chris, we can create more emotions, or sentiment in his speech… And it’s really – we’re in the early days, but basically, on our platform and platforms like ours, people can just jump in, write the text, choose characters, choose voices, choose languages, hit the Create button, and then, as you said, invoke a few GPUs in the cloud, and within minutes, if not seconds, you get a video, or a stream of that actual experience happening. I think it’s amazing.
And for the audience - so we can see each other, even though you’re only hearing the audio… And you saw the look on my face. So the question I was wanting to ask was - you know, I love the picture you’re painting of what’s possible going forward. To get there, there’s kind of – going back to kind of the ideas of trust and authenticity and stuff to get people to see the positive on that… Because it’s very easy to see the scary sides of kind of that creativity in most people’s minds, because that’s been their first exposure to the field. But you’ve shown us that we can really take advantage of creativity and kind of optimizing situations. So how do you bring people along on that, so that they understand that it’s something worth engaging in?
And I’ll give you a brief analogy that’s not directly in this field, but it’s kind of close. The field that I’m in now, I know that we are moving into an age where automation can fly airplanes much better and much safer - and Daniel and I were just talking about this in our last episode - that human pilots will be able to. And I say that as a human pilot also. But getting people to trust their lives quite literally into that, and in the context of deep fakes to trust that they can take that step and be part of that creative process, enrich their lives and see all the benefits - it seems like that will be a challenge, to kind of bring people fully along the path in a broad sense; not just specific use cases, but in a broader thing. Any thoughts on how we navigate that kind of culturally and as humans together?
[20:20] What we are trying to do is create positive use cases, and just let people see the positive uses of such things. And actually, in our company we see a lot of people that want to become characters and have their own character, and we communicate in work with our own characters with one on one, sending Slack messages or videos and stuff like that, and having ourselves on the platform and be able to create that. And aside from that, we have a lot of people asking, “When can I be my character?” So those are early adopters.
And I think as things play out and those use cases are out there, and people see other people appear in content, and they see it’s safe, and it’s put to good use, and they are being rewarded for that, I think the general positivity is something that is built gradually, and the trust is built gradually. And I think it’s a good start that we’ve seen the misuses before, so we know what to watch out for. And going forward, we can start to see better and better uses of the technology. It started with a talk with Obama or Tom Cruise, or Trump, or whatever examples of people frightening, and then we can just build from there with the good uses, and people read are already in place that might be fine, but once they see positive uses, I think they might want to subscribe to this notion and join basically this character economy or virtual human economy that is being built.
I like the idea of thinking about this like an economy. I was thinking in my own scenario - I have recorded videos for different trainings and such around like AI and technical subjects in the past… Which I love doing, but it is a lot of work to get into an environment with the right lighting, and arrange a person to record it, and then produce the video, and all of that stuff… There’s been a lot of times where I’ll create what I think is a cool tutorial or something like that, but I’m sitting in an airport somewhere, or just on my laptop… I can’t record a nice video for that, in an engaging way, for an audience… But if I had this sort of virtual version of myself, I could see myself typing out text associated with that tutorial, and pairing an engaging video with a screencast, or something, which is much easier to produce on my laptop in an airport.
But then I was also thinking - you kind of brought up the idea of the economy, and it does seem like there is an incentive, potentially, for creators to do this, because… Well, what if I then had a group of people that – like, there was sort of a brand around the content that I was creating, but there were other people that had great tutorials, and they wanted to maybe submit them to my trainings, and put my face in front of it… And I liked their content, so I was getting more good content, but maybe I incentivize them financially for part of the trainings that people subscribe to on my platform, or whatever it is. So there’s like this nice kind of exchange of value between the two, as long as I appreciate the content that they’re doing, and they understand that their face is not going to be on it, but maybe they’re recognized in some way, right? And I understand that I’m going to recognize them in some way, but my face is going to be on it, and that’s kind of part of the brand. So all of this thinking is sort of flowing through my brain…
[24:24] Have we seen those sorts of like creators finding these new ways to incentivize this yet? Or would you say we’re still in a stage where people are exploring potential usage, I guess?
Yeah, I think people, from what we see, are in need for video content. They want rich content for their audience, and then they’re looking for ways to produce this content in an easy way. A lot of people don’t know about this technology, or that it’s even possible, like just typing in your text, your narrative, building up the scenes… Basically, they know they can make a PowerPoint, but they don’t know they can make a rich video with a character in an environment, just click a button and have a video play and put it there, embed it, share it, upload it, whatever. So they’re not there.
I just wanted to say one thing, because you made me think about something, about your tutorial example. So think that what you’re able to do is record your first tutorial, and then get it into the system, transcribed into text, and then you can keep it updated and change characters. And the other thing is not only that some people would like to consume this content with your face, some other people might want to consume this content, the same content, in a different time, or a different language, with another character or another presenter. So those are the things that we are dealing with.
So to extend that ideal out, would you predict that the entertainment industry, and actors and actresses and musicians and such will be out there offering their brands as a form of interfacing with an audience? And so you might have – obviously, I’m making this up… You might want to have like the movie Grease, from way, way back, and young John Travolta and Olivia Newton John are teaching –
I’m telling you, this is pretty cool. I liked this idea. I know it sounds a little silly, but bear with me. But you’re able to basically select something that has appeal to you, but you could then put content out there in that context. And you could actually have brands extended into kind of user-generated content, where you have kind of deep fake brands supporting that. I’m being a little bit silly for fun, but I’m not being too silly… Is that kind of – is that what you see in terms of this economy going forward?
Yeah. I think like everything with economies, there will be price fluctuations… And of course, famous people, A-listers, and B-listers and everyone else will take part in this economy once it’s grown, and we’ve grown the trust, and the ability to control where you’re appearing, and what are you appearing in, at what price, or what reward… I think it’s already happening today. Celebrities, Hollywood A-listers or stuff like that, they are advertising things in different countries that they might not advertise in their countries, because it’s other language, and stuff like that. And I think if this technology can help them expand that reach, and control where I’m appearing, at what prices, when I’m appearing, what is the content I’m delivering, and if we can build the structures to make these transactions flow, we can definitely make it work. I think I would participate; I’m now giving my content and my voice to this podcast, of course, and I might be able to participate with my voice in other in other places, and give the content that I want to give to other places, not necessarily with my voice or my appearance, or stuff like that. And then everything – the modalities of the content and the dimensions of the content will be basically just transactions, and will be assimilated to be consumed by the viewership in the best manner. And this is that pleasant interaction that we mentioned before. Everything will be more programmatic and will be consumed at the right place, in the right time, with the right delivery of the right content. This is what we think about.
So Lior, we’ve talked a lot about the sort of technology in general… We’ve talked a lot about text, sort of natural language processing on the podcast, we’ve talked also about speech-to-text and text-to-speech sort of things recently… I know we had Josh Meyer on from Coqui… They have great tooling around that. But this element of like once you have synthesized voice, and then like pairing it with an avatar that sort of has mouth movements that are matching up with the voice - that’s something we haven’t really explored from the technical side. Could you kind of catch us up on what are the state-of-the-art models related to that sort of interaction, and what sort of data do you need to have available to successfully do that sort of operation?
So we’re gathering basically video data of people speaking in different languages, different people with different appearances, in different angles, and stuff like that, and then we label the data with landmarks, and with a resolution, we align the data and we prepare everything else. And then basically what we do is we create a bridge, or a latent space that basically can encode the audio and decode the face, and then we can reconnect the audio in the back. And of course, we’re using mainly GANs, and exploring different things… And in our field, the main interesting thing is video, and it’s stability, and temporal stability; things that are not required in other fields. For example, image generation - now we see on different platforms, with DALL-E, and Stable Diffusion, and others… You have a seed, basically, and a seed create the generation, and then morphing between seeds is not always flowing, so we are dealing with a lot of temporal stability and correctness of the expression.
Interesting. It’s very interesting to me how audio is represented within AI models, and oftentimes more like image than it is anything else in terms of like spectrograms, and that sort of thing. But then when I think of audio and language and video, you’re going to have a certain sampling rate for that video, you’re going to have a certain sampling rate for the audio… Those are likely going to be different, the sort of dimensionality of those things is going to be different, and maybe even different between different samples… So I was wondering, just generally, as you’ve kind of dug into this space, what are some of the data challenges that you’ve experienced working with audio and video data in the AI space? And for those out there that are kind of digging into these newer models that are either processing audio or video, or both, what recommendations could you make to people in terms of the challenges that you’ve faced in kind of really digging into this topic?
[32:47] Yeah. To answer your question, the biggest challenge with data as I see it is data pipelines. Once this data is captured, which is usually kind of easy, having the infrastructure to basically normalize, to clean, to align, and to label this data and bring it into the GPU. So I think the infrastructure for doing it and updating it is super-important for us. And aside from that, I think that clean data is of absolute – like, labeling and cleaning the data is utmost important.
For us, challenges that we might face is, for example, audio noises, and stuff like that, that are not necessarily – or a different person speaking, and not necessarily the person in the video that is captured or aligned. So those are the kinds of things that we are interested in. But as a general suggestion for all the listeners, I think thinking about the end-to-end pipeline of how to acquire the data, and then process it, all the way from the camera or whatever you’re gathering, from a link on the internet until it gets to the GPU through the data loader - we see it as one big challenge and trade-offs along the way.
Yeah… And I guess it’s likely that in terms of supporting the creation of an avatar for a specific person… For an actor in a studio you might have a lot of control over that, and be able to closely couple that, but as soon as you’re passing things over the internet, I’m sure that there’s degradation, and there’s like all sorts of things that you could come across… So yeah, that’s super-interesting.
I’m wondering – so we’ve talked a lot about digital humans or virtual humans, and this sort of avatar creation… You’ve kind of given a little bit of hints of what’s available and what you’ve built right now… I’m wondering if you could maybe summarize sort of like what the state of what you’ve built is right now, and then maybe a couple of things about what you’re excited about looking into the next couple of years, what you think is possible with the sort of features that might be added there.
Yeah. First of all, we have our SaaS offering live in production, you can register, you can try our system, there’s a free trial, you can create videos, you can select the avatar, you can check out the technology, with voices and everything. And we have subscriptions model, and you can continue and make videos on the go whenever you need them. Well, our focus is business use; we really think the world of work is a huge opportunity to create trust, and we believe that future generations are used to consume social media, or social video, and such, and are expecting the world of work to change from text to be more rich, and more engaging, and more pleasant and interactive. And this is where we’re going this is what we’re building; you can definitely sign up and check us out. And can you repeat the second part of the question?
[36:22] Yeah, I was wondering, kind of looking into the future, and maybe Chris you had something as well, but there’s this sort of like text to avatar creation, and the variability and the creativity that you could do with that… What are maybe some of those things that are on your mind? And you’re not committing to anything, by any means, but what are those things on your mind, like “Well, if we could enable this in the product, that would level it up a lot?” What are the things on your mind in that regard?
Yeah, definitely. So I won’t expose everything that we’re gonna launch soon, but I’m just saying, exciting things are on the way. We are super-excited about prompt accessibility, image generation from prompts, and people being able to add media to those videos by the text or by the narrative, recognizing this narrative and making prompt more accessible. And prompt engineering, which is a big thing now in the industry, of how you create the imagery that accompanies your narrative and create a compelling video in a rich environment. This is one thing we’re very focused on.
I think, environments in general, 3D environments, and rich videos and other things that create a whole experience, like basically watching TV - I think those experiences will get closer there. And we’re super-excited about more geeky stuff, which is like inversion, if you’re familiar with that; that’s embedding – referencing words into the prompt and then using those objects, or basically translating your likeness to another domain. For example, someone that looks like me, just with hair. You’re not seeing me on the podcast, but I’m bald. So all those kinds of things and ability to change something that looks like you, but have some of your traits.
Someone who looks like me, who is able to rollerblade successfully and well. [laughter]
I wonder – I don’t know if this is some of what you’re getting at, but I could see how some of that could be used to where you have an avatar and the voice, but you could also bring in – like, if you’re talking about a car driving down a street, and then this happens, that’s sort of like generating this sort of like almost like B-roll for your video, which seem quite interesting to me, because there’s so much of that video out there… And so in the same way that a lot of this text to image stuff is happening, it seems like you could generate some really compelling kind of transition shots, or whatever that might be.
[39:11] I’m wondering - as we sort of get closer to the end here, one of the questions that our listeners might ask is, you know, from an expert in the field, who is working in this every day, as you mentioned, this technology is only going to get more compelling. Like, TV-quality, very high resolution, very compelling… As some would write in the language space, like very coherent output… How would you recommend people think about like how do I – it seems like we’re getting into a space where I can’t tell what’s fake and what’s real. What would you tell people in terms of like as I navigate the world, and I look on social media and all that, is it even relevant anymore to think about telling the difference between those two? Or how would you recommend people kind of think about that certain aspect of kind of the cultural shift that this technology is causing?
I don’t know how philosophical to get with this question, but basically, a lot of the discussion that we’re having is, of course, Photoshop retouching, and how people appear on social media, and filters and all that stuff… It’s a build-up to this discussion. But we’re thinking about AI at large, and we like to think about not like “What are we teaching AI? And what is it learning?”, we think about what AI is teaching us, in some sense. And then we think about what is it learning. And then the bias in the models - it’s actually a reflection of the culture.
So we think basically, it’s trying to show us or to teach us, in some sense, what we are, and then we can choose and build our culture in with creating – it’s a two-way communication from these new technologies to our culture, and I think it will definitely be exciting. And as a culture, we will decide where we moderate it, in some sense.
Yeah, that’s really, really good input. I think this is one of those conversations where the possibilities seem many, and there’s definitely going to be some things, that like you say, cultures, governments, societies will have to wrestle with. But I think on the whole, I’m very excited to dive into some of these things. I’m really excited to jump over and create a few videos. I want to share a couple things in my own Slack channel and see what people’s response is, and if they recognize that this is a generated video.
But yeah, I really appreciate you taking time, Lior, to join us. It’s been an awesome conversation, and looking forward to the amazing things that Hour One is coming out with, and I hope to stay in contact and have you back on the show, both in real life, or as a virtual human, however you prefer.
Thank you for having me. It’s been a pleasure talking to you guys.
Our transcripts are open source on GitHub. Improvements are welcome. 💚