Practical AI – Episode #246
Generating product imagery at Shopify
with Russ Maschmeyer, product lead for spatial commerce at Shopify
Shopify recently released a Hugging Face space demonstrating very impressive results for replacing background scenes in product imagery. In this episode, we hear the backstory technical details about this work from Shopify’s Russ Maschmeyer. Along the way we discuss how to come up with clever AI solutions (without training your own model).
Featuring
Sponsors
Advent of GenAI Hackathon – Join us for a 7-day journey into the world of Generative AI with the Advent of GenAI Hackathon. Learn more here!
Traceroute – Listen and follow Season 3 of Traceroute starting November 2 on Apple, Spotify, or wherever you get your podcasts!
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:08 | Welcome to Practical AI | 00:27 |
2 | 00:35 | Sponsor: Advent of GenAI Hackathon | 01:19 |
3 | 01:54 | Daniel's Shopify experience | 02:13 |
4 | 04:07 | Prepping for the shopping season | 01:28 |
5 | 05:35 | AI and e-commerce | 01:45 |
6 | 07:20 | Surprising AI in retail | 01:46 |
7 | 09:06 | Retail's multi-modal uses | 03:23 |
8 | 12:29 | Chris' new store/racoon socks/ New merchant's perspective | 04:08 |
9 | 16:37 | Sponsor: Traceroute | 01:38 |
10 | 18:15 | Product photography | 06:06 |
11 | 24:21 | Why open source? | 04:09 |
12 | 28:30 | Traversing the AI jungle | 04:57 |
13 | 33:27 | Grounding generative models | 09:13 |
14 | 42:40 | Discovering little hacks | 03:36 |
15 | 46:16 | New technology and commerce | 03:04 |
16 | 49:29 | Outro | 00:45 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am the founder at Prediction Guard, and I’m joined as always by Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
Doing very well, Daniel. How’s it going today?
Oh, it’s going great. As our listeners know, or at least the ones that have listened to a lot of our shows, my wife owns a business which is an e-commerce business…
Quite a business.
…and next week as we’re recording this - for those listening at a future date - next week is Thanksgiving here in the US, which means next Friday, basically… Well, it’s already sort of started, but next Friday to that following week, Black Friday, Cyber Monday is a huge retail and e-commerce extravaganza in our world… And so I’m really excited, because leading up into that, today we’ve got the expert with us, Russ Maschmeyer from Shopify, who is project lead for spatial commerce and magic labs at Shopify. Welcome, Ross.
Hey, thanks, guys. I’m super-excited to be here. It’s been cool to follow along with the podcast and just super-stoked to chat with you guys today.
Yeah, for sure. Well, I’m coming into this super-excited, because over the past - well, it’s been 10 years, my wife’s business, and at least… So nine of those years they’ve been on Shopify, which means I have been in Shopify, I’ve dug into all the data behind, I’ve worked with the Shopify API, I’ve built chatbots on top of Shopify to sign up wholesale customers, I’ve dug into the liquid code on the site… So I’m all about whatever I can learn from you today and hear about what’s going on at Shopify. I am super-excited.
I didn’t even know he was that much into it, Ross. [laughter]
Well, when you’re the husband of an e-commerce entrepreneur, and you’re also a data scientist, occasionally favors are asked, and…
I’m feeling very third wheel now, I just want you to know…
Well, over that time, it’s been cool to see how Shopify has added so many amazing features, and is really powering a lot of huge brands; not only small brands, but larger brands. I’m sure you all are gearing up for a creative – first off, I just have to ask… What is the week leading up to Black Friday/Cyber Monday like at Shopify?
You know, it’s very busy, as you can imagine…
I was gonna say, what a loaded question…
It’s the kickoff to the biggest shopping season of the year. Shopify powers just an enormous amount of that holiday shopping season, so you can imagine the teams internally are prepping for it. They are getting products locked in place, and just operating at their optimal, maximal performances, just to support the load that’s coming in this upcoming weekend. But every year we also learn launch this really cool live globe, that’s a 3D visualization of all the live data, and orders happening all around the globe in real time. So you see orders streaming around this globe… And so this year, I’ve been also helping to lead some of those efforts, so I’m really excited for that to get its annual debut this year. You might see some ideas we talk about today appear there.
Cool. Yeah, it’s like the live view of Santa going around the world at lightspeed.
Totally. Totally [unintelligible 00:05:29.18] for entrepreneurship. [laughter]
There’s all sorts of interesting things that Shopify is doing, but specifically here we’re talking about AI and what you’re doing in regards to that. Maybe as we jump into that, could you describe a little bit from a person that’s embedded in this kind of e-commerce world, and seeing what a lot of people are doing in various industries, various stores, how do you view the impact of AI on e-commerce specifically right now and kind of where it’s headed? Where are we seeing the biggest impact in terms of AI right now in e-commerce? And I know we’re going to be talking about some of the recent things you’ve done, but kind of across the board, how does it look to you and what are people thinking about?
[00:06:21.08] Yeah, I mean, Shopify was pretty early in kind of this new wave of AI capability to say “Hey, whoa, this is a completely new class of possibility for the tools that we make for merchants, and the shopping experiences that our platform provides on the other end to shoppers.” And Shopify is really just kind of here to make commerce accessible and entrepreneurship accessible to everyone. And we’re really excited about these tools as a way to kind of further democratize entrepreneurship with people. There are so many things you have to create and produce, and ideas to develop, and knowledge to gain about markets, and positioning, and strategy, and branding. There are so many tasks that entrepreneurs have to learn and develop along the road to building a successful business. And LLMs, generative AI are all incredibly powerful tools to help accelerate that learning curve for new merchants, and to help them kind of get up that curve faster and build better businesses.
I’m curious, us consumers in the world, we hear about AI impacting retail and stuff, but for those of us who don’t have as much of a view on it, can you kind of talk about how you’ve seen retail change in these years, the last few years with AI really kind of getting in everywhere? What are some of the things that might surprise people? They kind of know there’s AI there in the background, but they don’t really know how it plays, or what it is. Surprise us a little bit; what’s something where you go “Oh, wow, I didn’t realize that”?
Well, we started by just adopting these tools in our engineering practice to begin with. We got some of the early previews of Copilot, and started using that to help accelerate some of our development work early on… But really, the place where we’ve seen it have the biggest impact in the near term is on tools for merchants. When we think about who our core customers are at Shopify, it’s the merchants who we power with our platform, and enable them to do really creative, amazing things, at the scale that they never maybe thought was possible for them. And AI is, again, sort of a way to accelerate that work, and give them more time back to - you know, instead of spending an hour and a half trying to craft the perfect product description… Because you’re not totally sure exactly what makes a good product description. Last year at our winter edition we shipped a really simple tool where you just like enter in like a couple of raw details about your product, and hit the Magic button, and it just writes a well-crafted narrative product description that speaks to product benefits, and all the great standard practices of writing a good product description. And you get that in seconds, versus an hour of human toil. So the place where we’ve seen AI really have the biggest impact early on is just in accelerating the work that merchants are already doing, and allowing them to do more.
Well, I guess it’s e-commerce, but also like web content development… It’s a very multi-modal thing, right? Like, you’ve got these product descriptions. That’s part of it. You’ve got product imagery, you’ve got website layout, you’ve got potentially ads, and integration with like other platforms… Talk a little bit about - like, within that space, because as you mentioned, there’s so many tasks to address within that space… As Shopify kind of looks at the merchant experience, how have you narrowed down on the particular problem sets that merchants really want to hand off, versus like those things – I know also from just being in it, marketing teams love to get in there and tweak things, and be part of the process… But they also really don’t want to do certain things too, or things that are just kind of grunt work, essentially…
[00:10:06.11] That sounds like it’s coming from experience right there.
Yeah. [laughter]
Yeah… I mean, we have a word that we use at Shopify, “toil”. This idea of work that kind of has to be done, but isn’t desirable work to do… And so we look for toil that merchants do. And so we spend an enormous amount of time sitting down with merchants, talking with them about how they use our platform, what they want more out of our platform, what they wish they could be doing with their business, what they are doing with their business… And from that, we’ve learned a ton about what are the ways that merchants would like to spend their time, and then what are the ways that they just kind of have to, because that’s the way the world works right now. So I think the opportunity for us is to find those moments, and to build tools, particularly magic tools into those spaces, that just sort of like make that go away. And when we do that, what we hope is that merchants will take that extra time that they have, that hour that they got back not spending on that one product description, or that one blog post, or that one email headline, like “Ah, should I use A or B? I don’t know…!” and just give them a really easy tool to generate that content, make it really high quality, give them the control to adjust if needed… And then publish it really quickly.
You already mentioned one of those, this sort of product description thing. Are there a couple other ones that you could kind of highlight, just to give a sense of the breadth of how this technology is applicable in this space?
So we’ve launched this suite of tools that we call Shopify Magic. It’s our free suite of AI-enabled features across our whole Shopify admin. And these things crop up in a few different places. It can sort of help you take the power of your own data, and make it work better for you… And we’ve applied that in places like email headline subject writing for marketing emails, and things like that, we’ve leveraged it in the context of generating blog content, obviously product descriptions is another… And we’re obviously really excited about some of the early work that we’ve also done in the image generation space. We recently released a Hugging Face space that I’m super-excited to dig into more, I’m sure a little bit later.
When you think about a storefront and the kind of content merchants need to produce, it really falls generally into one of two categories. It’s either text, or it’s images. We’re really excited about both of these spaces, and helping accelerate merchants there.
I know Daniel has used the tools a lot, but if you had someone who was a novice and they were getting into business, and let’s say they’re starting it now; so they haven’t been doing it –
It’s a new store. Chris is going to sell socks to fund your animal charity…
Raccoon socks. There you go.
Raccoon socks, there you go. I like it.
Christmassy raccoon socks. How’s that?
Will it keep the raccoons out of my trash cans? Will the socks – is that what they do? [laughter]
If you want them to, that’s no problem. For those who are going “What just happened on the show?”, in the world away from technology and AI, I’m a wildlife rehabber, and right now I have 20 raccoons at my house. So that’s what that’s all about.
It’s a full Christmas party.
Oh, it’s quite a Christmas party. You put 20 raccoons loose in a room… Oh, boy. Yeah. Okay, so back to my new store that I just opened up. I’m excited, I don’t have Daniel’s depth of experience at this… What are all of the amazing things I’m either by myself, or I don’t have a lot of help… Everyone’s tossed me to the wolves… I’ve come to Shopify because I know you have all these magical tools. Could you tell me a little bit about that experience from a merchant’s standpoint? Like, on day one, what am I getting into? How should I think about it a little bit? And how do those AI tools directly impact what I want to start doing today?
[00:13:51.29] Yeah, totally. Well, I think from a merchant’s perspective, if they were to log into the admin today, I don’t think they’d be overwhelmed with the amount of AI tools sort of all over the – I think today we really started with a focused approach that feels super-seamless, and integrated into just the activities that merchants are already doing. For example audit descriptions. Let’s say you’re a merchant, you’ve just started building your storefront, you’re super-excited to get that up, put a great face out on the web, and you’re starting to build out your product catalog, you’re starting to think about “How do I merchandise my products? How do I talk about them?” You’re a new merchant, you haven’t really done this before; you don’t know what the best practices are for product descriptions, or if you want to create some SEO content to kind of market your brand and product expertise in the space, you can go into your product detail editing page for a product you want to add, and just drag and drop your images. One of the really cool things - and I’ll say this because I’m also product lead for spatial commerce - is you can also drag and drop 3D models into that image bin, and it’ll handle it beautifully [unintelligible 00:14:56.04]
Raccoon 3D models. That’s awesome.
This is gonna be an awesome site, Chris.
I’m looking forward to it.
So if you’ve got 3D models, drop them in there, too. Those will display on your product detail page on your web storefront. And then when you get to that challenge of like “Okay, now I’ve got to write a product description… Oh, gosh, I haven’t thought this through. I’m not really a copywriter… I went to business school, and maybe I can write things, but I don’t know what – what’s a good product description? What does that even look like?” And I could go and spend an hour or two hours doing Google searches and combing through results, and sort of like collating my own idea of what makes a good practice for product descriptions… Or I could just click on that lovely little sparkle button, after entering in like “Oh, it’s white, it’s these dimensions, it’s got these materials”, and just like “Boom!” and you’ve got this incredible text description of your product. It pulls from your product title. So if you’ve mentioned that it’s like this kind of product, or this category, it gathers all that context initially, and then brings that to bear on the description that it writes, and you’ll have the ability to pick what tone you want that description to have. So we give the merchant some ability to kind of shape “Oh, do I want this to feel sophisticated? Do I want it to feel fun? Do I want it to feel like there’s deep expertise behind this product description?” And so I think those really simple tools just kind of placed seamlessly into the UI, exactly where the merchant is kind of doing these activities today anyway, is really kind of the powerful first step that we want to take to introduce merchants to these new tools. And then we’ll expand from there in some pretty powerful ways.
I think how we initially started chatting back and forth was at least partially because seeing this Hugging Face space, which is really cool, that you all put up, and I know got a lot of attention, partially in the merchant world, but also in the AI world, around the community that’s being built around open source AI tools on Hugging Face, and you had a space there that had to do with product photography… Before we go into the technology, the space, kind of how this works, could you describe a little bit of the motivation behind this project? Because you mentioned product photography, but people might not – if they haven’t been exposed to kind of e-commerce as much, or worked on their own e-commerce store, they might not realize what product photography means, and some of the challenges around it. So could you kind of set up the motivation for this, before we hop into the technical pieces?
Yeah, absolutely. So merchants spend an enormous amount of time and money generating visual media that’s compelling, that gets people excited about their products; either the details in the design, or the lifestyle that it might afford whoever buys it… And these images are really core to what drives a lot of commerce online, whether it’s advertising, or whether it’s building an attractive storefront, a web storefront, or whether it’s appearing in various channels, in different marketplaces as well… But not least of which is on the product detail page, where somebody has landed, a shopper, and ostensibly they’re interested in this product… And the job of those images in that context is to do the best job possible painting a picture of what that product will looks like in somebody’s life, as well as all of the details about the product.
Early on last year, when Stable Diffusion and other open source image models started to land, I got really excited about a future where you could imagine merchants just being incredibly more agile and cost-efficient in how they create these images. And so we started digging in pretty quickly; we played with Dream Booth as soon as that was available, and Stable Diffusion, and we started to see actually “Could we train a Dream Booth model that could encapsulate the concept of a product, and recreate it in high fidelity, over and over again?” That’s like the dream, right? And we’re getting closer and closer to that, but we weren’t quite there yet. But some of those early explorations proved beneficial to understanding the space, understanding the technology, and thinking a little bit more deeply about some simpler ways that we might be able to bring this to market in the near term.
When we think about the opportunity for image gen in commerce - I mean, it’s massive. And the ability, the promise of being able to recreate your product in high fidelity in any scenario is kind of the dream. You can imagine requesting any kind of lifestyle or product detail image and just in seconds getting that out the other end to use in your storefront, or to use in blog content, or to use in advertisements about your product. And that’s incredibly powerful, because commerce is always changing, taste is always changing. Seasonality is a huge piece of commerce, and thinking about how do you merchandise and market your products differently in the spring versus the fall… And keeping up with the amount of imagery just required to drive that part of your business is really challenging.
I think the reason that we got really excited is that we saw an opportunity to take the existing imagery that merchants had, either from past photoshoots, or from humble at-home photography with their kind of mobile cameras set up on their kitchen counter, or whatever they might have access to, and give them a tool that could not really change any pixel of the product itself, but otherwise completely reinvent the reality around that product.
[00:22:14.01] So we started to work on – you’ve seen a lot of examples of this out in the market, but I think the key problem that we saw with a lot of these early examples of this, where you do object segmentation to select the product and keep it sacred; you don’t touch any of the pixels that you sort of guard and mask there… But then all the pixels in the background, you reimagine with AI. And what we saw with most of these early tools, as I said, was that there was this real disjoint appearance between the product that got masked and safeguarded, and the reality that was created around it. The camera angle – it looked like “Oh, well, this one was taken from above, but the original product image was from straight on”, and there’s no grounding shadows, and there’s no realistic reflections of the product in the environment… The pixels of the product and the pixels of the environment aren’t speaking to each other. One hand doesn’t know what the other is doing, and so they can’t knit those pixels, those moments of grounding around the product, that really sell the illusion that it’s part of this other reality; those shadows, those ground reflections. Seeing maybe some of the light of the scene hit the product object itself.
And so we wanted to really tackle those grounding problems that we saw in a lot of these early examples. I’m happy to dig into all the technical details of what went into that, but that was really the opportunity that we saw, was to begin to bring some of this magic to merchants really early, before we’re even yet to that perfect personalization of “You upload a bunch of images of your product, and now it produces them again perfectly at the other end.” We can begin to bring really powerful tools to merchants in this space already, even with techniques like this.
And just as a point of clarification, when you say grounding in this case, you’re kind of talking about that visual context going across different – as opposed to technical grounding with a model, and such… Just because we talked about both on the show.
Yeah. No, it’s a great clarification. Yes. I’m purely talking about sort of the visual aspects of the output image, and making that product feel seated in the new reality in some visual way.
Before we hop into more of the details about how you actually accomplish this, I’m wondering how you see the kind of state of open source generative models in comparison to maybe some of the other platforms that are out there, that do enable amazing things, but not in an open source way. It sounds like at least for your team – I don’t know if it was kind of personally important to you and your team to leverage some of this open technology, or if it’s just like these things are openly available, they’re licensed permissively for our use, and they’re enabling things that we couldn’t do before… How do you view kind of the state of generative AI on the image side specifically? Because we’ve talked a lot in recent weeks about the text side of things, and how maybe text generation models that are open compare in certain ways or other ways to closed models… But I’m wondering from a team that’s actually used this kind of image generation models that are open and licensed permissively, what was that experience like for you? It sounds like this grounding element was one thing that you had to deal with, but what was it like for you generally to kind of work through the details of getting the model, figuring out how to run it, figuring out how to scale it maybe, that sort of stuff. Could you kind of describe a little bit of that process?
This is a really early field, so we’re still figuring out what the tools need to look like, and how to work efficiently… We were working on some of these early ideas in a very sort of falling over ourselves way in some notebooks, trying to collaborate and work together, and just not sort of seeing the pace that we wanted to see in our iteration speed.
Our team works really quickly. We work on kind of these three-week sprints to just very rapidly prototype and understand a new technology space, and develop some kind of potentially useful concept there.
[00:26:10.04] We needed a way to move faster, and Tobi, the CEO at Shopify, is incredibly technically adept. He’s an incredible developer in his own right, and was really interested in some of the image gen work that we were doing in the early phases, and suggested that we pick up this new tool called Comfy UI, which is an open source tool. We’re big fans of open source at Shopify; it’s why we shared to Hugging Face, because we want to contribute back to that community. You can go take our pipeline and do something with it. It’s up on Hugging Face.
So we’re really excited about open source, and obviously the capability of other providers as well. Our objective is always to bring the best technology to our merchants, whether it’s open source, or by a closed provider. So we’re really excited about all contributors in this space and what tools we can build for merchants with them.
We focused a lot on SD in the early days, and we were excited when Stable Diffusion XL launched. That’s actually the model that underpins our Hugging Face space. We’ve done a lot of work with Stable Diffusion in all of its iterations as we’ve explored this space, and are excited to continue to work with it, and obviously, build amazing, new stuff with it.
But yeah, I mean, we used Comfy UI, we dug into it… I think what we loved about it is that it’s this Node-based UI. I come from the design world originally, for product and UI design, and there was this much-loved tool, originally from Apple, but it got hacked by a bunch of prototypers, called Quartz Composer. It’s a Node-based interface with a bunch of little modules that do little conversion jobs, and wire them together in these sort of larger machines, and recompose and move things around really quickly, and rewire them and change the sort of constant values, and very quickly build these very complex computing machines, in a visual way. And so for me and for our team, that was a really powerful tool for us to accelerate our process, and we began building these machines, this pipeline that we ended up putting up on Hugging Face in Comfy UI, and iterating there. And when we had it to a great place, we pulled that code into Hugging Face, sort of rebuilt everything ground up with the models hosted on Hugging Face, and sort of encapsulated the pipeline there. But we were able to iterate super-quickly and visually this way, and see exactly what every piece of the machine was doing at each run.
It’s really interesting, because you’re taking new capabilities in the AI space with a large business that’s running, and you’re trying to do the uptake while absorbing the technology at the same time… And as you pointed out, your CEO brought Comfy UI to your attention… As you’re doing these activities as a business owner in general, the folks that are there, how do you decide to make investments in certain areas with these new technologies, and decide – because there’s the pull and push of “Well, direct AI isn’t our business. Our business is to make the best platform for all these merchants.” And yet, there’s all these new capabilities out there, but they’re not mature enough.
You brought an example to bear a second ago… That’s a complex set of business processes to work through and figure out what’s the right level. How does Spotify think about that, you and others there, in terms of is this a step too far to go on a particular leap, or this is appropriate, like Comfy UI turned out to be? How do you make those choices?
It’s a jungle, and one of the tools that we’ve used is really our Magic Labs team. So early on – well, actually rather at the end of 2022, as we began to see some of the rapid advancements in LLMs begin to take shape, and the product possibilities became clear, we started our early efforts around product descriptions, and generating those on the fly for merchants. And early on it was really about saying “Okay, what are the things that this tool, this new technology is going to obviously be capable of? …with maybe a little prompt engineering, we’ll figure it out. But what seems to be well within its grasp, but also have maximum time-saving value to merchants?” And product descriptions was like that perfect Venn diagram out of the gate. It was just kind of obvious to everybody.
[00:30:24.00] We know already know so much toil is spent on just creative writing; it’s something that can probably be written pretty quickly, if you have the necessary context, and best practices, and all that stuff. So we got to work on that, and we shipped that super-quickly. I think we turned that around from concept and team assembly, to launching it winter edition in about two months. So it was just an incredible, accelerated – one of those moments, where just the right people, and the right technology, and the right opportunity come together.
And pretty rapidly out of that, we formed the Magic team at Shopify, to sort of help invest more deeply in these AI technologies, and figure out all of the places and all of the ways we wanted to leverage these new capabilities across the admin, to help accelerate what merchants were doing. And so we’ve continued to work on a bunch of different ideas there, not least of which is some of the image gen work that we’ve done. And the way that we’ve kind of worked through this space - because there is so much going on… Every week you’ve got to weed through at least a half dozen groundbreaking papers, all over the map. And so a big part of the process is connecting to that firehose of what’s happening, so that you never lose sight of like a paper that might completely change how we think about serving our merchants… And then weeding through those and just sort of logging them as you go. I’ve got a Twitter bookmarks folder that’s just so deep, that I get back to periodically and sort of pull things that sort of feel like they have remaining value out of, and surface to the team, and surface to the company and start discussions around.
And within Magic Labs, our small team has been iterating on this three-week cycle to just digest all of these new technologies, all of these capabilities. Every three weeks we pick up a new one; we have no roadmap, we just have areas of curiosity, and every three weeks we look at what’s out on the table in the world and we say what’s the most exciting or potentially impactful thing for commerce [unintelligible 00:32:19.02] based on what we see here, and we pick what we want to work on within a day, and we’re prototyping by day two or three, after having picked up either a new piece of hardware, or some open source code on GitHub to get started… And we’re prototyping. And within three weeks, we’ve gotten to the end of that process, we’ve got a deliverable that either disproves that something we hoped could be possible, it’s actually not possible, and here are the reasons why, and here’s now what we’re looking for in the next iteration of this technology… Or, quite often, actually there is a path here; here’s what it looks. Here’s how we might shape this. And from there, tons of internal teams are eager and interested and hungry in sort of like how to rethink their products or how to leverage these technologies for their particular challenge with merchants.
We’re really lucky to just be working in an organization that gives us really fertile ground to just kind of bring these technologies and what we’re learning about them to a really wide set of problems that all seem very tractable based on the trajectory and what we’re seeing in the tech right now.
Well, Russ, I really appreciated your perspective on how your team is thinking about processing a lot of these advancements that we’re seeing in technology and tools so rapidly… Which is definitely hard to keep up with, but I love how you’re thinking about these short cycles of work, and thinking about what could be impactful. I’m wondering if we could revisit this problem of grounding with these product images, because I think some people might really be interested in that. And I’m wondering, to start with that, could you just rephrase the kind of main problem of this grounding, for people that are new to this? And walk us into how you identified this problem and thought about coming up with a solution to it… Because I could see a whole spectrum of things here. There’s a hierarchy of ways to do this, everything from “Oh, we just need to make our prompts better”, to “We need to retrain a model from scratch that’s Shopify Stable Diffusion or Shopify GPT for this”, right? And –
[00:34:32.29] Shopifusion?
Yeah, exactly. Shopify Fusion. Yeah, obviously that ladder of the spectrum - I think there’s very few people that get to that level of the hierarchy when they’re solving these problems… But it is hard sometimes for people to parse out where along that spectrum, from playing with your prompts, to maybe chaining, to like creating some pre-processing, post-processing, to fine-tuning, to training your own model, where is it reasonable for us to land on that spectrum; that’s something for people that’s really, in my experience, hard for them to parse out… So how did that work out for you all in this case? Maybe starting with rephrasing that problem of grounding, and then getting into how you started thinking about how you might solve it.
From the highest level, it’s like, you’re a merchant, and you’re just starting out, and you’ve got some products that you’re really excited about; either you’ve sourced them from a really great provider, or you make them yourself… You don’t have all the resources that somebody with operating business, and scale, and lots of employees, and tons of capital can deploy to build a business, by hiring contractors and employees and all those things. And if you’re this merchant, and you can take photos at home, or you’ve got maybe some photos from a previous shoot, you paid your friend to do them and they’re pretty good, but they’re not quite helping your brand sing, you’re looking for something to help you get over that. You’re looking for that tool that’s going to help take the media that you have and turn it into the media that you want. And your first thought is “Oh my gosh, AI. Of course AI is gonna unlock this.” This was our first thought. It was like “Well, if we can just train a model to know exactly what your product looks like [unintelligible 00:36:09.25] over and over.” And again, we’re getting closer to that, but we’re not quite there.
And short of that, we’re looking for ways to help merchants still realize this creative elevation of the creative materials that they do have. And it turns out that a lot of merchants have pretty good photography. It’s almost there, either because they took it at home with their mobile camera, and they just don’t have a whole lighting studio set up, or they’re just not sure how best to art-direct an image so that it feels tantalizing to look at, and drives purchase behavior… And so we saw a path where you can, of course, crop out and save the product pixels from your original image that you took, and keep those sacred, and eliminate the challenge of getting AI to recreate the product again, which is a very specific thing; it’s got details, it’s got your logo on it… And AI has a hard time holding on to some of those details at times. But it can be fantastic at creating the background, right? The not centerpiece of your image. To create a new elevated environment around it. And so we saw an opportunity to kind of take that path, and give merchants an early tool, as personalization matures, as we get to that point eventually, that can begin to help them unlock some of the value in their existing image media, their humble kitchen countertops photography.
And so by building this pipeline where we’re able to hold and keep your product pixels intact and not change those - we keep those details intact, and yet we can magically create this world around it. When we started this journey, we were like “Okay, great, we’ll take ControlNet, and with Stable Diffusion, and we’ll combine these things. We’ll use the depth of your original image, and then we’ll just ask it for a new background, and it’ll come out the other end, and it’ll be great.” But what happens when you sort of segment out your product image, and keep ControlNet from really understanding what’s in there, so it doesn’t change anything, it begins to lose an understanding of how to fill in the details around that object to make it look like it’s a part of the environment that it’s been creating.
[00:38:16.29] So it loses its shadows, it loses its tabletop surface reflections, because it’s actually kind of forgotten, in a weird way, that there’s even an object there at all. So in whatever way you can say an AI model thinks, it doesn’t have the triggers to generate those ground shadows and surface reflections in the scene it creates around the product.
And so that was the first key obstacle that we saw when we started moving down this pipeline of trying to help merchants take their existing product photos and just kind of create new, rich realities around them. So we had to solve that grounding problem - there were no shadows, there were no good ground reflections, the camera angles were off… You’d get a kind of tabletop scene background for a front-on product photo, and it just looks wrong. And you see it immediately, of course.
And so we started tinkering and trying to figure out how to get that to work, and it actually turned out to be kind of a multivariate approach. We had to think about prompting, we had to think about “How do we structure a good prompt, just so that we get a good result, even without all the fancy stuff we want to do in the in between?” And it turns out, one of the key things we learned was that you need to start with a declaration of what your foreground object is, what your product is in the shot. And if you can get a really good description of that, then your prompt is already starting out in a really good, grounded spot. Obviously, adding stylistic language, like commercial product photography, high quality, all those sort of like little tricks of early image models that will eventually like pass away… But we injected a few of those into the prompt as well.
You start with that product description, and then the next key line has to be some kind of grounding description of how it’s been placed in the environment. Without that description, you don’t get those shadows, you don’t get those table reflections, even with all the cool additional sort of support for that functionality we’ve built in. And so you need that grounding, and then finally, you can describe the scene that you want in that background.
It’s a great description.
It really is. I’m really enjoying this. But I want to ask a couple of questions to make sure… It sounds like the way you were starting, when you were kind of talking about the product pixels, and pulling those out, in my mind I was almost thinking of it in an old-fashioned way like a Photoshop mask or something, where you’re masking out the product, and then you’re trying to bring all the goodness of the contextual understanding of the models in. The thing that I think surprised me in there was kind of if you talk about that initial masking, I wasn’t surprised when you talked about finding the description for the background and everything, but I was a little bit about the thing being masked, if you will, the product itself. How do you think about that? As you’re going through the process and you’re saying “I need that description”, could you describe that step a little bit? Because I’m trying to kind of really grok that one… But it sounds really interesting to me.
I think it’s probably helpful to work backward from really what we deliver to Stable Diffusion as a model to generate the output that we get from it in the end, and then kind of work backwards. Okay, well, then how do we assemble all of that input to then get it [unintelligible 00:41:12.20] right?
Sure.
So at the end of our pipeline, once we processed all the prompts you’ve put in, and the image you uploaded of your original product, all that stuff, really what we’re delivering to Stable Diffusion in the end is a masked depth map of your original product, at a little bit of a bloom at the very bottom, where it might make connection with the original scene around it.
Could that be like a shadow, when you say that?
[00:41:41.04] Kind of, yes. Sort of like a little bit of that shadow. If it’s a table reflection, you’ll get a little bit of that table reflection. And what we’ve found was that little hack is just enough context; that little gradient of additional depth info as you leave the product pixels is just the right amount of grounding information, that Stable Diffusion and controlling that need to be like “Oh, there’s a shadow there. Oh, and I see the angle of the table is this way. And oh, I see the camera angle is kind of like this.” And all of that together collectively gives Stable Diffusion the context that it needs to then paint a grounded scene around that product in high fidelity. And of course, we’re generating a new product in that resulting image, but we do a composite in the end. And those pixels, because we’re using depth ControlNet, adhere very closely to the original product pixels. So when we do the paste-over in the end, you never see the sort of like hallucinated product pixels in the background.
It’s so interesting. I think one question that would be really interesting for our listeners and for me, from a selfish perspective, is how does one – because I think a lot of people play with these models; they can pull down and figure out ControlNet reasonable enough. But then this connection of this little hack, as you described it, in some ways, after you have something like that, it seems simple enough to describe why that would work, and it’s like a cool hack… But to get to it, it’s like, “How do you come up with that?” I think is what is in a lot of people’s minds. And some of it - for me, I know a lot of times I’m banging my head against the wall one day, and I sleep on it, and in the shower in the morning that idea comes. But I’m curious from your perspective, from your team’s perspective, how did that happen, and what sort of environment exists that would promote this sort of hacking? Because you’re not retraining a whole model here; you’re kind of using what is off the shelf, but using it in an extremely powerful way, but in a very creative way, that is creative not in the sense of training a new model, but creative in the sense of how you’re using the existing model, which I think is really intriguing.
This really was just kind of the perfect workshop product, where we had just a bunch of like brilliant people who kind of understood these models enough, had played with them enough, knew and had seen enough of what they were capable of, from different demos and other things, to have a real opinion about what was possible and kind of what wasn’t. And when we started the journey, and started building the machine, and trying to figure out “How should this work? How do all the pieces fit together?” We knew that there were hundreds of these little amazing AI machines that could be plugged into and turned into bigger machines, that do even more powerful things. And it’s just about figuring out what’s the sequence? What are the pieces? What are the core problems? And it’s just how do you get to that iteration speed, where you can try something. It’s why I fell in love with web dev in the early days of web standards and 2.0. You could code something and see it immediately. “Okay, that didn’t work. Go back, code it again.” See it immediately. “Okay, no, that didn’t work. Go back.” Code it immediately, see it again, and getting into that state. And that’s really what the open source tool Comfy UI really unlocked for us. And GPUs still take a few seconds to deliver images, so it wasn’t a perfect, rapid-fire iteration… But way faster than trying to do it all remotely, and trying to – you know, Comfy UI dramatically accelerated our ability to kind of like build this more complex machine, because it was so easy to configure, and reconfigure, and try a thing, and wire it a different way, and then that didn’t work, and wire it a different way, and see the results… You’re like “Oh wait, that’s new, but different, but not what I want… But isn’t that interesting?” And then “Oh, maybe I have a hunch about why that happened”, and you pull that back into something else, and now you’ve unlocked something, not because you’ve had some amazing insight, but just because you’ve tried enough stuff, and you’ve seen enough weirdness, and then there was something there. That was weird, that shouldn’t have happened, and something surprised me, and I want to understand it… And that’s how it just unfolds that way. And eventually, you start to connect all these little discoveries you make, as you’re like “Why did that happen? Why did that happen?” And sooner or later, you end up with something that works. It’s kind of magic.
[00:46:02.00] It is a kind of magic. As a Queen fan, that fit right in there as well. I’m still almost stuck on that creative epiphany that you had a moment ago. I’ve found that really interesting, that you came upon that. As you’re looking at this set of technologies evolving over the years ahead, as the organization is maturing with these technologies, and you have this amazing creative capability in your humans in your organization that can use these tools, where’s all this going? As we wind up this conversation, how are you thinking about the future? What are you excited about that? What do you not have yet that you wish you had at your fingers right now? How’s your thinking about that?
When people think about shopping, they don’t always like jump to think about technology… But if you think about how technology has impacted commerce over the years, commerce and our culture around it, and how it works, it’s always inherently tied to the wave of technology that we’re experiencing… Whether you’re talking about IBM cash register adding machines, or whether you’re talking about mass media, and the creation of kind of mega brands, or you’re talking about the evolution to the internet, and sort of the democratization of commerce and connecting with niche audiences, the culture around commerce always evolves around technology, and it’s why I’m so excited to be working at the intersection of these new technologies at a company like Shopify, because they are so directly related. And as we kind of think about the future of technology, and sort of where this is all going, I get really excited about AI being an incredible driver of personalization in commerce.
When I go to some of my favorite stores, the person behind the desk knows me, they recognize me, they remember what I bought last time, we have a conversation about it… I can ask questions of the new line, or the new products, or ask them to help me find stuff that I might really enjoy based on what they know I’ve bought in the past… And that’s all an experience that I can get at an in-person store today, because the person there knows me.
I’m really excited about a future where our online e-commerce experiences become a little bit more like, where we visit an online store and it knows who we are, and it helps us find the stuff that we’ll be most interested in. And even really exciting things like being able to visualize myself in different clothes that I might want to buy, live, in a browser - that kind of stuff is in the future out ahead of us. And so I’m really excited about a future where AI helps bring these kinds of personalized, one-to-one, customized shopping experiences to merchants, and helps them bring that to their shoppers.
That’s awesome. Well, I’m definitely looking forward to seeing the things that your team comes up with moving towards the future, and I just really appreciate you taking time out of what must be an incredibly busy week leading up to Black Friday, Cyber Monday at Shopify… But yeah, thank you so much for the work you and your team are doing, Russ. We hope to have you on a future show to see some of those things you’ve just mentioned become reality. Thanks so much for joining us.
Awesome. Thanks, Chris. Thanks, Daniel. I really appreciate it.
Our transcripts are open source on GitHub. Improvements are welcome. 💚