In this Fully-Connected episode, Daniel and Chris discuss concerns of privacy in the face of ever-improving AI / ML technologies. Evaluating AI’s impact on privacy from various angles, they note that ethical AI practitioners and data scientists have an enormous burden, given that much of the general population may not understand the implications of the data privacy decisions of everyday life.
This intentionally thought-provoking conversation advocates consideration and action from each listener when it comes to evaluating how their own activities either protect or violate the privacy of those whom they impact.
Play the audio to listen along while you enjoy the transcript. 🎧
Well, welcome to another Fully Connected episode of the Practical AI podcast. This is where Chris and I keep you fully connected with everything that’s going on in the AI community. We’ll discuss some recent AI issues or news and dig into some learning resources to help you level up your machine learning game. I’m Daniel Whitenack, I’m a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist with Lockheed Martin. How are you doing Chris?
Doing very well, Daniel. It’s a good day. I’m looking forward to having a fun conversation with you. I hope our listeners are, too.
Yeah. Have you been flying much recently? For listeners, Chris is a pilot. Have you been up in the air very much?
Well I did… We took a vacation with my daughter a little while back, and I did a lot of flying for that. And then - ironically that you asked this today… Tonight, pilots have to do what’s called currency flying to keep your night rating going every three months. Tonight is the night, so I’m gonna go fly tonight, a little while after dark, and do some night landings. I always enjoy those. The lights are beautiful.
Well, in terms of some of the things that I’d like to discuss today - this might seem like a random question, but I think it’s relevant… So I know you’re doing these certifications and other things, and you’ve got to keep things up… If you were told that the FAA or whoever, they wanted to have a camera mounted in your plane and monitor all of your whatever is going on in the cockpit during each of your flights, to judge whether you’re a good pilot or not, and there would be constant monitoring of you… Maybe an AI model identifying certain things you did wrong, or something… How would that make you feel?
Oh, not good at all. Not good at all. I mean, aside from all of the moments where maybe I take liberties that the FAA wouldn’t go for, just in general, every bad landing noticed, that kind of thing… Oh, boy. That doesn’t appeal to me at all. It would feel like a fairly substantial invasion of my privacy.
[03:41] Yeah. But I think one could argue that if you wanted to know and certify only pilots that did the right things a certain percentage of time or something, I guess in that case there’s maybe a balance between, “Hey, on one side I’m going to make an argument about some type of safety over privacy, or accuracy over privacy, or something like that.” And on the other end, of course it’s a natural, maybe what most people would consider in this sort of hyperbolic situation, most people would consider an invasion of privacy.
Yeah, I think there’s a balance to be struck there, certainly. I mean, when you raise public safety… That’s a legitimate concern. But I know that it is a topic that – in the use case that you brought up, pilots do talk about that, because with current technologies, the oversight is becoming increasing for pilots. And I think that that’s very important like if you were an airline pilot, and you have passengers in the back; that’s super-important. For me, I worry about “Do I really need that level of oversight if I’m doing the mountain flying?” I tend to do low mountain flying [unintelligible 00:04:57.05], but if I were to pass a hiker on a ridge top without realizing they were there, technically, I would be breaking a regulation, and I could get in trouble. And frankly, I think that might be like a step too far. So I think the privacy concerns are something we need to figure our way through. I’m guessing that there’s an AI angle going on this one…
Yeah, I bring up this topic – and in these episodes that it’s just you and I, of course, we get a chance to discuss some of the things on our mind… And this has been one of the things on my mind recently; not so much the cameras in the cockpit sort of scenario, because I’m not a pilot, but general sort of privacy concerns, and thinking about, even for my own team, what are the balances that we need to strike, and where are the privacy concerns within our own workflows, in terms of making sure that we’re comfortable and responsible with the ways in which we’re handling data, the data that we’re feeding into our models, the types of data that we’re storing in certain places, and that sort of thing… It has definitely been on our mind recently.
When I got into this stuff – I don’t know if we’ve talked about this, Chris, whenever you got into things like this, but when I got into this sort of stuff, it was sort of the beginning of data sciency hype; not so much the AI hype yet, right? Like, there was this hype around sort of data science is the new thing, and so getting a job as a data scientist… And I remember at that time there was sort of this thinking, “Well, you don’t know what data you’re going to need, so just make sure you store it all, and you have it all.” That was kind of the mindset; I remember very distinctly at the time, that was the mindset. Do you relate to that? And how do you think that’s shifted over time?
Oh, I remember that. You’re showing your age, Daniel, by the way… Because that’s certainly changed dramatically over the last couple of decades. When you talk about those early days of data science, everyone was pioneering their way through that. And yes, you were trying to find data to use, and there often wasn’t enough data around… And when you found it, you collected all you could to combine with others. And obviously, today things are somewhat different. And with the capabilities, it is privacy, and things like data bias, and such as that (and they’re all interrelated), has changed the landscape dramatically, especially when you consider all the use cases out there.
[07:40] Yeah. I bring this up because the – like, let’s just say that we want to strive for privacy, or a reasonable amount of privacy; let’s make that argument first. There’s probably a separate argument of like, “Well, maybe we don’t need the privacy that a lot of people are after.” Maybe that’s another discussion. But let’s assume that we’re striving for some level of privacy. I would say the first thing that comes to my mind in terms of making something “private” is if you don’t collect or store the data, then that’s just about as private as you can get. Now, maybe there’s like other logs and certain things that we maybe wouldn’t think immediately of as data, that are revealing certain things… But I think one principle is – I even saw this term… So I was looking through several things leading up to this, and one of them that I look at occasionally is - Google has this Responsible AI Practices page, and they use this term “data minimization”, which… I know probably listeners are thinking, “well, what would we have to learn from Google about privacy? Because they know everything, and have all the data.” So it’s kind of interesting to think about Google talking about data minimization. But I find this term interesting in the sense of like one way to improve privacy is to just plain not have data. Have you been in those sorts of discussions within your career, around like, “Do we actually need to store this data? Or should we not store it?”, those sorts of conversations?
Yeah, I think the burden has flipped to the opposite side from those early days that you talked about. I think when people talk about data now, in terms of data that affects personalization and identification, I think the argument to be made now by any data scientist or AI practitioner, is the argument on what you need and why you need it, and being able to justify that going forward, in general. I would say, there are many exceptions to that, obviously.
But yes, I think I think the burden has changed to us to show not only why we need it, and what we need it for, but why that’s a good thing, and why it does not cause damage unintentionally. And so we’ve come a far cry from the early “collect everything” days. I think only intelligence agencies these days collect absolutely everything, the way the way the world works now.
Yeah. I think there has been a shift. I think there are a lot more conversations going on within companies talking about whether they should store certain pieces of data, maybe about a user, let’s say a name, or a location… Something that is useful in maybe marketing purposes, or whatever it is, right? Do we really need to store that to do our marketing the way that we want to do our marketing? That’s like a question that comes up probably. And it comes up, I think, in relation to like Facebook and others have - or Meta, or whatever I should refer to them as - changed their APIs and other things to where you don’t get some of that data in many scenarios. So maybe some of that is just we don’t even get it anymore.
But I think that as much as I love Hugging Face, and Hugging Face Hub, and that community, I think there is this sort of shift with the recent AI, more AI-related hype around “What are all the AI datasets we can create?” And there’s definitely bias concerns that have come up with that. I think there’s probably privacy concerns as well though.
I remember very distinctly… I tried to actually find if there is like a blog post about this or something, but I remember Jim Klucar, who used to work for Immuta - I attended a talk by him, and he showed how you could reconstruct a real person’s face from the parameters of like a facial recognition model, because the parameter space was so large… So there’s a very large model, there’s a lot of information encoded in the parameters of that model, and he could sort of reconstruct - or he showed some research where someone did; I forget the exact details… But you could sort of reconstruct something from that.
[12:02] So even these like very, very large models that are released, and the parameter spaces of those models, could even have privacy concerns. So I think this sort of proliferation of like, “Let’s get all the datasets on the hub. Let’s get all the models on the hub”, I think that overall is like 99 – I don’t know, I don’t want to put a percentage on it, but I think overall, it’s a very, very good thing. And obviously, I think if you’ve listened to this show very much, you know how much I love that effort. But I think with it, there’s sort of this - maybe a shift back in thinking, towards like, “Let’s accumulate all the data, let’s release all the models”, and these models themselves may even have sort of privacy – certainly bias concerns, but privacy concerns as well within them. So yeah, that’s one thing that I don’t really have a definitive statement on, but I’ve been thinking about as I’ve seen the community grow around that.
You raise a really important point in terms of the implications of what you’ve just described… And that’s the fact that as the capabilities are evolving over time, the way we’re choosing to make evaluations about how our privacy is affected is also changing. So it’s not a static decision. It’s a decision where if you look back a few years and look at where it’s at, you’re like, “I’m okay with that. I could see that but at this point, the sophistication level is becoming so much higher, and the fact that you can do that reconstruction that you’ve just described - it makes one reevaluate. And then if you add in the fact that there are also considerations like “Who is it that’s doing it, and why, and what…?” And that changes depending on who it is. We all are making decisions every day about what privacy compromises we’re willing to make, and we all have different profiles in that capacity.
If you choose to install security cameras, like the doorbells that everyone has now, you now know that every time you walk in and out of your front door, you’re on camera, it has a model there, it knows who you are; it’s recognizing you even before anything is done with the data. And I’ve made that choice. I have a Nest on my doorbell, and I have other devices around my house that know who I am. So there’s some level of that… But it also depends on whether or not I have some level of control of that data in terms of its usage, what the rights that I have as a consumer are, and whether or not it’s from a public sector perspective or a private sector perspective. So all those are considerations that we can we can delve into.
So Chris, the first term that I had run across that I wanted to bring up was that term “data minimization”, which is, maybe you do need a data to do something, maybe you don’t; that’s one consideration with privacy, is certainly the easiest way to deal with the privacy concern is to not have the data. I think though many cases either we step into a project and data exists already, and is maybe stored within our organization, or we have some dataset that we’re interested in working with that maybe we don’t know what the sort of identifying information within that dataset is, or the privacy concerns with it…
[15:49] The next term that I ran across as I was sort of probing this space was data de-identification. I was reading a blog from, again, Immuta, which - I think we’ve had Immuta on the show before here, and they’ve of course done a lot of thinking in this space… But they have a nice blog post, which we can link in the show notes, about data de-identification, and they talk about various sort of pieces of data that you might want to de-identify within datasets. I think for practicalities purposes I’ll just mention a few of those, since this is Practical AI. So they have a long list; I won’t read all of them, but they talk about names, dates, telephone numbers - those are probably ones that would be immediately assumed… Maybe ones that people might not be thinking about immediately would be a device identifier, or serial number… So like maybe that’s a MAC address, or maybe that’s like a browser fingerprint. Web URLs might be identifying…
There’s such a proliferation of analytics data within URLs these days. That’s one thing I was thinking about… all the query strings that are added on to a URL to track you in various ways. Or there could be an account ID in some URL, or something like that, which is, something that could happen. And they list out a bunch more, but those are the types of – when we refer to identifiers, the types of identifiers that we have in mind. As you look at that list, Chris, do these things come up in your mindset in datasets that you work with?
Absolutely… going through the process of trying to get them removed, to de-identify them, while not losing the potential value of what you’re trying to create from a model. Because, let’s face it, many of the models we create, humans are central to the output, to the inferences of those models. And so if you’re going to deal with humans, you’re going to be dealing with these identifying traits. But if you take out too many, too much, sometimes you run the risk of the model not being able to be productive, even for the best use. So it’s a bit of a challenge for the data scientist of today to try to – there’s this balance of a bunch of hard things that we need to go accomplish from an ethical standpoint, and we do the best we can with the tooling available.
I also think that the person giving you their data needs to have agency to give you their data, right? But I also think that the general public doesn’t understand the implications of some of the data that they might give you… So I think that you as maybe a practitioner in the AI space probably could also not just assume “Because the user gave me this, it’s going to be okay”, or at least not have any issues if I use this identifying field, or something.
I listened to a podcast about the… Have we talked about the boarding pass thing… This is another flight thing.
I don’t think so. Go for it.
So I listened to – I think this is another Darknet Diaries. I love that podcast. I’ve mentioned it a couple times on the show. But what had happened was - people that go on a trip, right? And they like post a picture of their boarding pass on Instagram, or something. Like “I’m going – I’m on my vacation. Look at my boarding pass” or whatever. It’s very common; #boardingpass. Well, there was a guy that said, “There’s some gotta be something on this boarding pass that is –” Like, the airline doesn’t tell you that your boarding pass is a security risk and should be private, right? And so people post them all… But what this guy learned was that the booking ID – so it was like a Qantas flight, and he saw the booking ID was on the boarding pass. And what’s interesting is that he found – I think it was the Australian Prime Minister posted a picture of one of his boarding passes, somewhere he was going…
[20:03] So he took the booking ID from the Australian Prime Minister, took it to the Quantas website, and it turns out all you needed was the booking ID and a bit of personal information, like their name, where you were from, which is obviously all public record for a prime minister… And he just logged right into Qantas as the Prime Minister of Australia… Of course, at that point the flight had already happened…. But then he was like, “Well, I wonder what else is here”, and then he just did “page view source” on the logged in Qantas site, and in the source of the page there was a JSON field which included all the info about the account holder, including passport number, phone number etc.
And of course, the podcast is really great. Maybe I’ll link to that in the show notes, too… But it’s like, who would have thought that posting a picture of a boarding pass, which the airline doesn’t tell you is a security risk, but obviously, there was a security risk there, and a privacy concern, because there’s sort of passport information and such… But sometimes the companies don’t even understand how people might put this data together, which I guess influences like maybe the scope of the concern here, and how you really want to consider both data minimization and data de-identification, at least in many cases.
Yeah, you really raised the point about the burden being on us as the data scientists of goodwill and good ethics… Because the general public doesn’t understand a lot of these things. I mean, any of these documents – the whole purpose of a boarding pass is to identify you as the rightful right user of that airline seat, and to admit you to the plane and such. So by definition, it’s an ID thing. And anything that serves an identification purpose should be treated pretty carefully.
It’s hard to do today for the public, not only in the context of how data can be used in an AI context, but just in the broader world; there are so many opportunities for data leakage that affects us in that personal way. I have gotten probably more insight into that than most people because of two things. a) I’m in this world that we’re talking about; AI/ML and data science, but I’m also in the defense industry, and we go through classes about how to protect yourself, for obvious reasons… with nefarious folks out there. So there’s so many opportunities…
So it really does raise the need for the data science and AI/ML community to kind of step up to meet those needs, because you can abuse it and you can get away with what you want to get away with probably in many cases, but that hurts us all in the long run. It causes harm, not only to others, but to ourselves in this industry. So definitely something to be thinking about in every possible part of your life that has any form of identification.
Yeah. There’s a big concern here, but there is a lot of good thinking and tooling around this sort of de-identification side of things as well. In the Immuta article they talk about - okay, well, if we assume that, as you mentioned, us as practitioners want to be responsible with the data that we’re processing and the way that we’re handling it, one scenario… Let’s say that we couldn’t do or didn’t do data minimization; we have data, we need to use it for a specific purpose, but we also are maybe concerned… Maybe it’s text fields, and we’re concerned that there’s names or phone numbers, these sorts of things, account numbers… Maybe it’s individual structured data, but maybe it’s just raw data, and we don’t exactly know. There are de-identifying methods out there…
[24:02] Of course, this is a lot easier probably in the language space; if you’re using English, for an example, you have an advantage, because you could, for example, take a named entity recognition model and figure out where the names are, and replace the names with pseudonyms, right? For your AI model, it probably doesn’t even care what the name is, as long as it’s a name. So you can sort of do pseudonyms, or fake phone numbers, and this sort of thing… Or hash certain fields, or obfuscate them in certain ways… So that’s like using a replace type of method for these fields. You could just identify them… I know there’s Python tooling… I’ve used – I forget what the update is on the best one to use. We’ve used one called Scrubadub, I think. There’s Python libraries to find these things, and identify them, or replace them.
The Immuta article emphasizes this type of masking, or pseudonyms, and that sort of thing… And it probably – again, it depends on the data type. Maybe if you’ve got an image with people’s faces in it, maybe that’s a different scenario than if you have sort of a text field with a name in it, and you can replace the name. It’s maybe more difficult to replace a – I mean, there are ways now, of course… Maybe this is another positive use of the deep fake sort of methods; you can replace faces and images, and that sort of thing… But there’s probably certain methodologies, like facial recognition, which by their very nature are identifying methodologies. The whole point of facial recognition is to identify someone, right?
So there’s probably a range of scenarios as well where if I’m just trying to predict a marketing campaign or something like that, maybe the sort of obfuscation and masking methods are really relevant. If I’m actually though trying to identify a face for a security reason in my building or something, I am actually trying to identify someone… And that probably brings up other issues of how you log that and store that identification, which we can talk about.
Yeah, it gets complicated in that way. Kind of going back, building on your last point a little bit there, it goes back to the use case. It goes back to who is using that data. Is the government that you happen to fall under, in whatever country you’re in, are they looking for facial recognition? Or is this your Nest doorbell, and you’ve made an accommodation? It’s pretty crucial, and it’s pretty hard.
From an identification standpoint, I think your airline example a few minutes ago was really pertinent, in that it’s very easy for a user who may be making a choice about offering their data to misunderstand that they may look at the data that they’re giving up and go, “This is okay. This isn’t too much.” But if the model creator is combining that data, they’ve chosen to give up with other data, a lot of privacy can be compromised by combining different data types together, that may not be part of that initial thing; it may be something that you already have available, or from another source… So it gets challenging.
The challenge that you brought up, Chris, around the expectations of users, of how their data is going to be used or combined with other things - it’s a really challenging one. That can get really complicated. I’m thinking of, even in my own scenario - we’ve had discussions before, because maybe we’ve got a recording from someone across the world, some language recording in our archives, and they gave permission for that data to be used, or collected and stored in the archive, for language documentation purposes, or something like that, right? Maybe we no longer have access to that person, so we can’t get their explicit permission to use that in any other way, even though we know, “Well, this would be useful to add to an AI dataset.” We’re talking about that all the time internally on our team, like “When the data collected was collected - that’s a very crucial time to help the company express to the user how their data is going to be used, and have the user understand and have agency over that.” But also, that brings up the additional point that you could give them a long list, a Terms and Conditions thing that no one’s going to read. Right? Is that really giving them control over how their data is being used? Because for any reasonable person, you could assume that they’re not going to read through all that.
Everyone will assume they’re not going to read it but the lawyers involved, of course.
The lawyers, yeah. [laughter]
The lawyers are assuming they’ve read every word. I mean, you raise a great point… I confess – I probably shouldn’t do it in such a public way, but I have agreed to many terms and conditions where I have not read the full verbiage. There might have been more than a few where I didn’t read any of the verbiage. So we are often making these choices of convenience that may have some fairly long-term repercussions, as you’re pointing out.
[29:57] The other major category within the data de-identification that Immuta brings up, and actually the many other places too as well, including that Google Responsible AI Practices, is something having to do with randomization and differential privacy. A case of this that we’ve been talking about internally is if we have a device in the field, and we’re gathering either text, or audio, video, one choice would for us would be to send all of that audio back to a central location, store it in S3, and do a bunch of things with that. That’s probably the worst-case scenario, because now we’ve got just recordings of audio from some random place, and maybe people don’t know – hopefully, they knew that they were getting recorded and understood what was happening, but still, that’s a very hard situation, because you’re actually got the raw data, and it’s sent to a central location.
I think one thing in that scenario that is a best practices - if you can do any of that processing at the edge, if you can push your models out to the edge, and let’s say I’m doing transcription of the audio, and then I’m detecting like something about what is said in the audio - maybe I don’t even want to send the transcript back; I just want to send metadata back about like, “Hey, I did a transcription. And I’m not sending the audio, I’m not sending the transcription.” Of course, that’s a much better scenario, because the audio is staying on the device, the model was run at the edge, and the only thing you’re sending back is metadata. Of course, that’s still probably a tricky situation, because you’re knowing maybe something was said at a certain time, at a location, from a device, which brings up this randomization piece, right?
So the other thing you can do is take those messages that you would send back to the central location and randomize their timestamps, or their ordering, or that sort of thing, to where, for example, if someone said something that had political implications at a certain time, at a location, whoever had access to that central source of data - they couldn’t really tie it back to a certain location, at a certain time, and maybe identify the person that said that, and persecute them for saying that.
So this sort of randomization comes in… It can be taken as far as this idea of differential privacy, which offers a mathematical guarantee around sort of privacy and the masking of direct identifiers. And that’s come up also with federated learning. So I think the edge computing side of this comes in, and actually, to a lot of benefit to the privacy situation, if you’re able to do things at the edge, and the things that you’re communicating over a network are randomized in some way, there’s some guarantee around privacy, and maybe you’re just communicating metadata and not the raw data that stays at the edge… So that, of course, makes infrastructure a lot harder to deal with, but it’s overall a better situation.
As you are saying that, I’m struck with the fact that it takes a good actor to be willing to do these things. By way of example, so many of the laws that we have, both here in the United States and in other countries, are not sufficient to kind of enforce these things that we’re talking about here in this episode as good practices, as ethical practices.
I know that here, where I’m at physically, in the state of Georgia, I can record a phone call legally, and only one party of the phone call has to know what’s being recorded, and that’s me as the recorder. So I can record a phone call without the other person having any knowledge that that call is being recorded. And that data is data that I have available to me as their voice, who knows what they say on the call… Kind of going back to your point about political comments, whatever… And how I use that data - what I’m getting at is as we kind of build this ethical framework as good actors in the data science community, we really need to find ways of having these techniques kind of acknowledged beyond our community, and be able to be integrated in as best practices in a legal framework to help enforce it.
[34:40] Because I know I’m not going to do anything nefarious, and I know you’re not going to do anything nefarious, but there are a few people out there that might do something that is nefarious, and it raises a fairly challenging kind of enforcement or compliance concern in terms of implementing these techniques that are going to be necessary for us to be responsible with this data going forward.
Yeah. And I think that as a person that builds like tools that maybe various clients will use, one thing – like, if you’re in that situation, if you’re creating software products that might be used by a variety of organizations, I think it’s your duty to take into account how you can ensure that your software product isn’t going to be used for malicious purposes, rather than assuming or writing in a Terms and Conditions or a licensing agreement that you agree not to do this.
For example, like in that scenario of communicating audio back to a central place, if you only make it possible for your software product to communicate metadata back to a cloud location - I mean, someone could hack it and maybe do something else, but at least you’re making it much harder. Whereas if you make it to where there’s an option to send the audio back as well - well, then you’re in a whole other scenario, where people could do all sorts of things with that.
So I think that also understanding what might be possible, whether you think you are working with good actors or bad actors, is within the sort of duties of us as practitioners to think… Because even our managers or executives that are promoting the things that we build - they might not understand the implications of what could be done with what we’re building. So at some point we have to kind of own that, and hope that over time the sort of regulations and guidance we get from maybe governing bodies or other places will catch up to where the technology is.
That’s a great point you’re making, and that is do what you have the ability to do to kind of police the set of circumstances that are out there. So if you don’t have a strong legal framework to fall back on, that will protect your users in that capacity, as a data scientist being able to say, “Well, this is the software I’m gonna give you, this is the capability that I will provide”, and eliminating some of those cases that can be used nefariously is really important. I would love to hear examples from our listeners about how they might be doing some of these things, and maybe share some of their ideas with us on some of our social media outlets for the show… Because this is important. Our community is leading the way in the sense of how to affect privacy with all of these new technologies coming through, and all of the capabilities that AI is has surged forward on in the last few years. We’re vanguard, we’re the tip of the spear.
Yeah, I totally agree. Well, in the last few minutes here, it may be worth just quickly mentioning a learning resource and maybe a couple of things happening in the AI community. One interesting thing, Chris, I wanted to mention before we close out here is that you can now run this stable diffusion model on Hugging Face Space, which is one of these recent text-to-image models that does really amazing things when you put in a variety of texts. In our Slack channel I sent you a message, I put in two cool guys recording an AI podcast - and maybe I can post this in our show notes or something, but… They don’t look like Chris and I, but…
…at least in a couple of the photos they’re wearing sunglasses; maybe we should consider that. I do notice that there’s a trend where at least three of them, one of the guys is bald and the other one is not bald…
Well, I have very short hair.
Maybe I should shave my head. We need one bald guy, and another… Yeah. Also, there’s some interesting text going on “Tu-Koes-Koju-Ka-Kas?”
I don’t know what it means.
Yeah. Anyway… Very interesting AI-generated two cool guys recording an AI podcast. But they have other examples too, like a small cabin on top of a snowy mountain in the style of a Disney Art station, an insect robot preparing a delicious meal… So anyway, something to play with if you don’t have access to the OpenAI DALL-E 2 model yet, and you’re on the waitlist - wait no longer, you can use Stable Diffusion on Hugging Face and have some fun there.
I also saw a pretty cool release from SpaCy, from the NLP world… They’ve been on the podcast before. They released Floret, which is an extended version of fastText which uses Bloom embeddings. So Bloom is this huge language model that was a collaborative effort, and a big language model that came out recently, and SpaCy has implemented this sort of combination of fastText embeddings and Bloom embeddings, which is efficient and built right into the SpaCy ecosystem. I’m excited to try those things out, where this will allow you to compare words to other words and see their similarities, but also build models on top of these sort of embeddings, which could allow you to do things like text classification, or named entity recognition, and these sorts of things. These are really the building blocks of modern NLP, these embeddings. And it’s interesting, these combine both word and sub-word embeddings, which could handle like misspellings, or rare occurrences of words, and that sort of thing. So a really cool effort from SpaCy to make this sort of cutting edge NLP building block really an easy to use piece of their really user-friendly packaging. Really cool to see that.
The last thing that that I saw, which is more of a learning resource - I see there’s an upcoming Intel workshop on FPGAs, which seems pretty interesting to me. I’ll link that in the in the show notes. I don’t know anything about FPGAs, but I hear them mentioned occasionally, and so maybe I’ll join the workshop and find out some more.
Sounds good. It looks interesting. And I will say, without jumping into detail, there’s a lot of really cool things happening in that space, with hardware and processors and single-board computers right now… And so a lot of new AI capabilities are coming out by various companies… So yeah, I would imagine this workshop is a pretty cool place to go.
Cool. Well, thanks, Chris. It’s been a fun discussion. Let’s think more about our privacy, and I promise I won’t install that camera in the cockpit of your plane.
Thank goodness! Oh, boy. Thanks, Daniel, on that note…
Alright. See you, Chris.
Talk to you later.
Our transcripts are open source on GitHub. Improvements are welcome. 💚