AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment, and is accelerating research in nearly every field of biology. Daniel and Chris delve into protein folding, and explore the implications of this revolutionary and hugely impactful application of AI.
- AlphaFold reveals the structure of the protein universe
- AlphaFold: Timeline of a breakthrough
- AlphaFold Protein Structure Database
- GitHub: deepmind / alphafold
- Oxford Protein Informatics Group: AlphaFold 2 is here: what’s behind the structure prediction miracle
- Nature: How AlphaFold can realize AI’s full potential in structural biology
- Nature: ‘The entire protein universe’: AI predicts shape of nearly every known protein
- Nature: Highly accurate protein structure prediction with AlphaFold
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another Fully Connected episode of the Practical AI podcast. In these fully connected episodes Chris and I keep you Fully Connected with everything that’s happening in the AI community. We’ll take some time to dissect a little bit of the latest AI news and dig into a few learning resources to help you level up your machine learning game. I’m Daniel Whitenack, I’m a data scientist with SIL International. And I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?
Doing really well today, excited about the thing that you’re about to tell our audience we’re going to talk about… And I just wanted to put a tiny bit of context around it.
We’ve gone through the pandemic, and there are major wars that we’ve talked about, you know, ongoing as we record this, and you know, monkey pox is now out…
Wasn’t it just called – I don’t know the designations; it was just designated as an emergency status somehow, or something like that.
Yeah, both WHO, and then now the United States has declared it such, as of yesterday, as we record this… So we’re going to be talking about a topic today that reminds me that we live in the most interesting time in human history. And things are changing faster than they ever have, and there’s actually a lot of reason to have hope in the world. As we talk about the possibilities that we’re going to talk about today, I just want to kind of remind people that there’s a lot of things that are really worth being positive about, and I think today’s topic is frankly one of them.
Yeah, a lot of people really doing things with tech that are beneficial, or at least the intention is that they would be overwhelmingly beneficial, right?
So I think that this factors in the topic that we’ll be talking about today, is AlphaFold and the corresponding database that they’ve released of protein structures. This came up, and you know, I was seeing – I don’t know about you, Chris, but I’ve seen it pop up in my news feeds various times over the past couple of years, and most recently, just this week, I think it was coming up in the news because of some of the things that they’ve released… I think in particular the sort of recent news is that they have this database of protein structures, and we can talk about kind of what that means and how it was generated etc. here, you know, over the course of the podcast…
[04:12] But this database of protein structures, and they’ve just released and expanded that from one million structures to 200 million structures. So that’s a pretty big increase in terms of the size of this database. And I don’t know – we were just talking even before the episode, Chris, about proteins, and maybe how those can be important for the study of various things. I don’t know if you want to chat about that at all, but…
…it was definitely interesting to look at this project and understand a little bit more about that field, which I’m not actively participating in.
And it’s important that we note that we’re exploring this as non-experts.
Yeah, come with us along our journey of learning about AlphaFold. [laughter]
So obviously, we’re here with our listeners because we all love AI, and we’re exploring things… But often the use cases are things that we don’t have expertise in, and this is one of those episodes that we call Fully Connected, where we’re just exploring and we’re bringing people along on the journey as we talk about this. And I have no particular expertise; I took some biology in high school and college, but I have no particular expertise. But I do know that proteins are the foundation of all life, and is incredibly important to understanding how they can be used in their application, their 3D structure; they can be incredibly complex. I will actually give a – I know I’ve relayed this to you privately, but I’ll give a quick setting on kind of why 3D structure is so important.
Many of our long-time listeners will know that I’m really into animal welfare causes, and particularly I handle venomous snakes quite often, with appropriate safety gear and such… But a friend of mine, named Dr. Brett Siegel, who has a Chemistry PhD from Harvard, he and I will often talk about - just for fun; it’s not what either one of us primarily do - we’ll talk about snake venom as just a fun two guys chatting thing… And Brent with his expertise in chemistry can literally look – we’ll be talking comparing two species, and he will be able to pick up, he’ll go and look at the protein makeup of snake venom, and then he’ll look at the protein molecules and the folds and where they’re at, and right there off the cuff he can just tell me exactly how those proteins are affecting if someone’s bitten, what that will do, and what that particular combination of proteins… And so protein folding may sound really esoteric to those of us who are not in biology professionally, but it’s crucial to understanding chemistry and life itself. It really gave me an appreciation for this topic before we get to this episode, and so I’m pretty excited about the possibility and I think it’s going to really revolutionize medicine.
Yeah. And I think in this in this episode at least what we’re going to try to do is kind of talk through how – the context for AlphaFold, the data, how it sort of works, and what the implications are. And so getting in the weeds a little bit with how this is actually operating. We’ll get there at a certain point. But yeah, I think setting that context is good. I was looking through some articles - again, because I’m not a chemist or a biologist… But looking through some articles that we’ll link in our show notes as also good learning resources for you, talking about the sort of reason why proteins and protein folding is useful.
[07:59] This is from the National Library of Medicine, which sounds very official. I don’t actually know a lot about the National Library of Medicine… They talk about how the proteins are basic building blocks of all cells in our body and living creatures, and that we kind of often think of DNA as being at the core – or DNA and genes as sort of being at the core of the information needed for life, which is true… But then the sort of dynamic processes of life, like the things that happen in our bodies, like the functions and the processes, defense mechanisms, and reproduction of certain things in our bodies - all of those sort of dynamic processes are carried out by proteins, which, you know, do this kind of folding and assembly into all of these complexes to actually perform functions. So it’s like a functional process.
Exactly. And to really get that tangible – and these are examples we’ve seen in many of these articles on this… You know, the fact that your eye and the retina can receive light and process that light to your brain, the mere fact that it can do that is protein-based. The fact that right now you’re probably – even if you’re sitting down, you’re probably moving some part of your body, and that movement that you’re engaged in right now is based on proteins; it’s just impossible to escape that fundamental function that proteins provide, times a billion different things. So this kind of technology - it’s going to be really fundamental to life going forward.
I was joking to you that earlier that I wish I was younger than I am now, not just from an age standpoint, but because then these kinds of technologies could positively influence me for more years than they’re currently going to be able to. It’s like, every time I see these great advances coming out - I’m in my early 50s, and I’m looking at it kind of going, “God, why couldn’t that have happened in my 20s?” or something like that. So it is pretty cool stuff here.
One of the interesting things to me is like they’re releasing – so we kind of talked about how protein structure is important, and how it’s tied to the basic functions of life, and why that, like you’re saying, is important for advances in medicine and other things… What’s interesting is that all of this complicated function and processes that are carried out by proteins are fundamentally driven by sequences of what’s called amino acids. And there’s 20 of these amino acids.
I was trying to think of like a metaphor, and I don’t know if this has been used - I’m probably stealing it from someone - but when I was going through it and looking at this stuff, these sequences of amino acids, there’s 20 of them… You know, you can think about how much complexity we can see formed out of 26 letters of the Roman alphabet, in all sorts of languages. And you know, you can express innumerable things with that kind of small set of characters… Here we have this sort of sequence of amino acids, there’s 20 of these acids, and that’s what forms proteins and drives how they fold, and how they assemble, and how they do all of these functions.
So when we’re thinking about how does this intersect with AI, the process or the data transformation that we can think about is like, on one end you have sequences of amino acids that you might know about, and then on the other end you have the folds and the assemblies and the geometric structures, the 3D structures that are driven, the protein structures that are driven by the sequences of amino acids, or that you could predict from these.
[11:59] So an AI model, as we’ve talked about many times on this show, is at its core a data transformation. You take an image in, and then you get a label out, or something like that. Here, you’re taking these sequences of amino acids in, and out of it you’re predicting a 3D structure of one of these proteins. That’s really the fundamental kind of data transformation that we’re talking about, which is what AlphaFold is addressing, is sequences to 3D structure. That’s at the main core of what we’re talking about.
And I think in some of the materials that we reviewed ahead of time, if I’m understanding them correctly, those different amino acids, the folding itself is kind of amino acid to amino acid. So even though we’re talking about sequences, and you tend to think about a line of amino acids with the word sequence, but it’s being folded in 3D, with those different amino acids connecting to each other in different ways, and lots of different shapes. So even one sequence can have many, many different possibilities there, going back to your point; you know, even different folds with the same amino acids is the impression I’m taking away. So there’s a lot to happen there.
And as I’m referencing back what I talked before about my friend Brent - he can look at that and see a functional, kind of what it will do after that. So it’s very, very practical AI that we’re talking about here; we’re talking about something that the output can be put in the hands of an expert who can immediately see, in many cases, where this is going and what the what the effect will be. So super-practical medicine we’re talking about here.
Yeah, definitely. And I guess, to kind of bring home the importance of the methods that we’re about to go into… Previously - I mean, it has been known that knowing the structures and the folding process is important, and so people have done experiments over time. And you can find out the structures via experiment; I don’t know all the details of that. Maybe we can find a link to share in our show notes. But experimentally, you can find these things out. But of course, anything that involves, you know, a chemistry and biology experiment is going to be limited in terms of the pace and capacity that you can do, as we’ve all learned in terms of lab testing, you know, COVID results and that sort of thing in recent years. So there’s a limiting factor on that, which means that were you to be able to predict protein structures with a computer, which is maybe not – it still has a cost, in terms of computational cost and environmental costs and other things… But were you to do it, you’re no longer constrained by your sort of experimental capacity. You’re constrained maybe by your computational capacity and that sort of thing. So the scaling mechanism is quite different.
And to that point, I believe there was roughly - correct me if I’m not remembering this accurately, but I think that it was trained on roughly 150,000 known protein folds that had all been human-determined. You know, this was before the AI was applied. So that was the baseline. And to talk about the leap that we’re describing here, what was announced on July 28th, which was just a few days ago as we record this, was the fact that from that training set of 150,000, they went to 200 million, which describes nearly the entire universe of known folds. And I’m sure that there are more that they’re going to continue to work on, but that’s everything that we currently know, for all practical purposes. So you’re going from a fairly small subset, to most everything in this one big release that we’ll talk about with the database, and everything. So I’m pretty excited about what comes next.
Okay, well let’s maybe give just a little bit of context for AlphaFold, and then talk about the database that they’ve released a little bit. So my understanding is that AlphaFold kind of – it first started getting notoriety because of these shared tasks that were… Like, what I would think of in the AI world as shared tasks; maybe they’re called something different in the biology world. But there’s these shared tasks within a certain community, critical assessment of techniques for protein structure prediction, or the CASP, I guess, assuming I’m saying that correct, CASP. And they’ve had these over time, over the years, and CASP 14 was one of those shared tasks where AlphaFold really kind of stood out from the rest of the pack in terms of what it was providing, and really showed the ability to very closely replicate the accuracy that you could achieve via experiment with predicting these structures. Because experiment in and of itself also has error related to it, right? So when you do an experiment to get these structures, you also don’t get like 100% accuracy. There’s error bars, and all of those things.
So what they were showing, which is quite extraordinary, is that this AlphaFold thing, which we’ll talk about more and get into the weeds of, is able to take these sequences, a sort of database of sequences in, and output structures that are of the same kind of level of quality as experiment in many cases, which means, “Hey, well, now you have a sort of choice. You could run experiments, but if you’re getting about the same accuracy out of the simulation, then that scales–” Like, you’re talking about the scale that you can achieve without is something wildly different.
Yeah, I think all of the outputs are - obviously, being from an AI model, they’re all predictions; the accuracy of those predictions has proven to be something that is significant enough to where further research based on those outputs can proceed, rather than a lot of kind of going back and trying to figure out if the output of the model is sufficient in terms of accuracy to be able to base further research on it.
So it’s not just turning out a lot of outputs, it’s also the fact that they’re very high quality. And those two features together are what’s going to really propel things forward in the larger biology and chemistry space here to drive medicine forward for all of us.
The method that they’re doing has created these predictions, and so it’s really this bank of predictions that is part of this release that has been, you know, getting a lot of attention. We’ll link to a blog post about the release in our show notes, but one of the things that I thought was really interesting - Chris, I don’t know if you saw this - was there as a figure of like one circle, which was the experiment today… Like, how many structures do we have in our database of experiments, and then the database, when it was originally released - because they originally released the AlphaFold database with about a million structures. And then they have kind of the circle of AlphaFold database today and the scale just sort of like, for our listeners who aren’t seeing this in front of them right now, it’s like one big circle, which is the database today, and experiment is sort of like a little dot within that, in terms of what it represents… Because experimental structures in a database - one of these I understand is called PDB; it has about 190k structures, and Chris, that’s what you’re saying, these sort of supervised examples that they used in training. And then AlphaFold today, the database has 200 million plus. So that’s pretty crazy.
[20:18] They also give these circles representing how much is from different places, and you’ve got kind of a circle for animals, and plants, and bacteria, and fungi, and other… Animals is the biggest category, but then you have plants, bacteria, fungi and other things. So it’s pretty interesting, both the diversity and the size of this, I would say… And again, I’m near the field, but my understanding in terms of what’s offered here is actually, you know, 3D structures. So you can look up - AlphaFold itself is open source, so the inference pipeline is open sourced; as far as I know, the training pipeline isn’t, but the inference pipeline is open source, and you can look kind of in 3D at the structures that are coming out. So it’s like 3D Cartesian coordinates that are coming out, you put the sequence of amino acids in, you get this 3D Cartesian coordinates out, which are really just this 3D structure representing the 3D structure of the proteins.
Yeah. You know, as a dataset, the ability to do that, and then combined with previous technologies… So if you go back a few years and you talk about how big it was to release the human genome… And that provides a different set of capabilities, you know, in terms of understanding what our genetic predispositions are, and all sorts of different use cases… But now with the protein folding, to be able to maybe start with the genome, and understand what’s likely to happen and what your predispositions are, and then you can go use protein folding from this database, and be able to solve for some of those issues is pretty remarkable.
Yeah. I think also it’s like – when you think of the scale, 200,000,000, one of the other things that comes to my mind - and I’m sure people are exploring this, and our listeners, please share links with us in our Slack, or Twitter, or LinkedIn, or wherever, of studies that you know about… But you have now this dataset of 200 million; I’m thinking like, “Oh, what does it look like to do clustering sort of techniques on top of that?” Can you learn about the sort of structures, now that all of the proteins are kind of mapped to these 3D structures? What can you learn at a more aggregate level about like clusters of folding patterns or structures? What can you kind of post-process this dataset into and maybe build models off of these 3D structures?
We all know that the graph neural networks now are a huge thing that’s coming up, and people are exploring that more and more… So obviously, these are 3D spatial graphs, and it would be interesting to know what are people doing with these structures on the backend after they’re formed. I think that’s an interesting direction to study as well.
Yeah, I’m looking at these same documents that you are, and I can’t help but think about the fact that hopefully this is unleashing this revolution in this type of research. When you talk about that, I’m wondering how many high school and college kids today who have an interest that crossover might leap into this. I think this is a moment we’re going to remember, just like the release of the human genome was.
Yeah, yeah. And they already talk about the impact that AlphaFold is having, even just a couple of months after this sort of release. I see here that after they open sourced AlphaFold and the database, it’s already been cited more than 4,000 times in academic research, and there’s things related to – here they do AlphaFold predictions, references in publications, there’s a large complex that acts as a gateway, and then out of the cell nucleus… From something having to do with malaria, which is a protein for including in vaccines… There’s something having to do with the rate of mRNA degradation, which I think a wider audience is now more familiar with mRNA, after all of the vaccine stuff…
Yeah, yeah. There’s something having to do with causing frost-damaged plants, which is obviously an agricultural thing… So even outside of medicine, you could think about agriculture and other things.
That’s a really good point you’re making, because I think we’re focused in our conversation very much on medicine. But agriculture, food supplies… There are so many different areas. Pretty much everything in life, not just us walking around, are impacted by this.
And I know with your interest, Chris - I noticed this one too, about something involved in the immune system of egg-laying animals, including honeybees. Of course, you’re probably even more familiar than I am with sort of how honeybees and bee populations are in decline and –
Yes, it’s a huge crisis we’re in. Yeah. So who knows how this could impact many of those things. Well, maybe we could jump now a little bit and start talking about how does AlphaFold do this. So I think that we’ve established it’s caught the attention of many people because it does a really good job at this. They’ve open sourced the inference pipeline, so people can use it… But what does what does AlphaFold do? I mean, this is Practical AI, so we could probably all learn – even if we’re not all doing protein folding, maybe there’s elements of the way that they’re processing this data that are useful in our own creativity, in our own problems.
I think it’s interesting that in their processing pipeline you see sort of a number of really interesting things popping up from other domains. So the transformer architecture pops up within this. There’s what they’re calling an Evoformer, which we can get into why it’s maybe Evo - evolution-related in terms of how it is also iterative… But there’s this Evoformer architecture, there’s this element of joint embeddings, and also, in the training they use sort of supervised and like semi-supervised methods; they also use these BERT style - not in a pre-training way, but they use a BERT style masking in their training as well. All of those things - I think we talked about this on a similar episode… This sort of innovation is built off of a number of things that have just been sweeping across the whole AI world, including - you know, when you’re thinking about transformers, these joint embedding semi-supervised methods mass language models… All of these elements kind of contribute somehow to how the data is processed in this pipeline.
Yeah. A few episodes back we had quite a conversation about that, and the fact that it is an analogy; if you think about these different approaches that you’ve just enumerated, you can think of them almost as Legos… And the creativity then of scientists and researchers being able to say, “Well, I’m going to try this one, and then combine it with that one, and maybe do it in a completely different domain.” And then you’re getting these interesting outputs.
[27:56] Before this episode, I was kind of thinking about the fact that it’s almost like about a year ago we almost entered, I think looking back, kind of a new era of AI. There was kind of the development of those models for a while, but now we’re seeing the mixing and matching of them and such. And I think that this is one of the outputs of that. So yeah, cool stuff.
Okay, Chris, so I think if, I’m understanding this right - and you know, we’ve looked through a bunch of things here… Even just you and I are learning about AlphaFold, but it seems like the network or the architecture that’s driving AlphaFold is kind of split up into a few different main components. The first of those kind of takes an input sequence, and then develops two kind of encodings of that input sequence; one which is called multiple sequence alignment, and one which is a pair embedding or a pair representation.
So there’s this first stage, which is input sequence to encoding (encoding or embedding) and then there’s a second stage, which takes that represent initial representation through a transformer-inspired architecture to develop a sort of hidden representation. And then those hidden representations are then fed into a last stage, which is a structure model, which outputs the actual kind of predicted Cartesian coordinates of the protein.
So we’ve got kind of encoding this transformer-based architecture which produces a different representation or embedding, and then we’ve got a structure module which produces the Cartesian coordinates. And what’s interesting, and one of the reasons why I think they’ve used some terms related to evolutionary algorithms, Evoformer and stuff, is there’s actually an iterative piece of this. So those last two stages, kind of putting the representations through the transformer-based architecture, and then out the other end to generate the structure - those actually cycle. At least in their paper, they say that they do that three times. So they kind of refine – they make an initial prediction of the structure, and then refine that by passing it back through the network, so that it kind of goes through this loop a few times, and then outputs a refined protein structure.
Yeah, it kind of has a recurrent network aspect to it there, in the diagrams that they show there.
Yeah, exactly. There’s this kind of looping that happens… And from what I was reading, using deep neural networks to predict protein structure in and of itself is not an innovation of this work. So people have tried this for quite a while. But I think that there’s two kind of main pieces here that really kind of set this apart. One is this Evoformer architecture, which is unique to what they’ve done… And the second is this kind of iterative process, which kind of helps the network learn across these representations and the predicted structure in a really powerful way.
[31:47] So yeah, it’s interesting in… We can kind of dive into a couple of these things, but the first one - it kind of reminded me a lot of some NLP things to some degree, because you’ve got this input sequence, which again, is just a sequence of amino acids, and they generate two representations from this. Maybe people are more familiar with NLP - you might have a sequence of characters, and you might assign a number to each of these characters; because you have to represent text as numbers to a computer, because a computer knows how to calculate numbers, right? So here, they’re in some ways doing a similar thing. They’re taking this input sequence and they’re representing it by numbers, but in two kind of really interesting ways. One which kind of tries to identify - not identical, but other sequences that have been identified in living organisms, and it kind of creates what they’re calling this multiple sequence alignment. So it’s actually an alignment of this sequence with other sequences; a multi-sequence alignment. And then they have this pair representation where they’re actually trying to identify proteins that have a similar structure, and construct an initial representation that’s kind of a pair representation of these two things, thinking that “There’s similar things maybe in the whole database that we’ve learned about, and similar proteins, so maybe we can learn from those things.”
So the initial sequence goes in these two representations. The multiple sequence or alignment, and then this pair embedding. So one which is kind of an a matrix of sequences, and one which is a pair representation of one sequence with another.
Let me ask you a question. It’s more from your NLP background than this, but do you think it would be fair to say going through that two-step process is sort of like pursuing the probabilities iteratively as it goes, and kind of constantly working on where it’s more likely going to be, between having the multiple versions that it’s producing in that intermediate step, and then looking for other proteins that may have exhibited the same sequence, and therefore you already have a sense of what that folding might look like?
So in NLP, we leverage a lot of pre-training, which isn’t leveraged here, and to some degree, learn “Hey, language behaves in a certain way. So I can kind of pre-train some things and learn some things that I can transfer in.” I think the idea is slightly similar here. I think what they’re trying to say is, you know, proteins are different one from the other. But if you have similar sequences or similar templates of your protein, they’re not going to be quite the same. But some fragments and structure is going to be conserved across them. So I think they’re leveraging this existing database of knowledge and sort of these paired representations to kind of understand that “Yeah, there’s something unique about this single inference, but we also know a lot about other protein structure, and nothing’s completely sort of new.” So the contact between proteins or amino acids - yeah, the contact between amino acids, if that’s similar in this case to another case, it’s likely that some of these fragments of structure will be preserved as well.
I’ve gotta say, Dr. Whitenack, for someone who is not trained in this field, that is quite a good explanation.
I’ll let our listeners who have some type of chemistry and biology background correct me in our Slack channel, or something… But I am very thankful to – I should give a shout-out actually to… There’s a series of blogs that I looked at from the Oxford Protein Informatics Group. So if you’re listening out there, if we’ve got any listeners from that group, thank you for your blog posts and your work in explaining many of these things, because they’re very useful. We’ll make sure and link those in the show notes as well.
[35:54] But yeah, you’ve got this representation, this initial representation, and then that, as we’ve learned, is useful basically everywhere, whether we’re talking about images, or text, or whatever; these initial representations, the MSA or multiple sequence alignment and then this pair embedding are passed through a transformer-based architecture, which is this Evoformer, which is a unique architecture. You can read more about some of their choices that they made with that architecture in their paper in Nature. But it passes through this Evoformer architecture, which exchanges information between the two representations, so between the multiple sequence alignment and the pair embedding, and then outputs a kind of updated representation of both the multi sequence alignment in the pair embedding, the sort of hidden state of the model. And then that’s what’s passed into this third stage of the structure model, which takes those embeddings, takes that hidden representation, and then maps it to 3D coordinates, 3D Cartesian coordinates, which is the output structure. And then like we say, there’s a looping thing that goes along. So actually, this structure is fed back into the frontend of the second step, the transformer step, and you do this a couple of times where, you know, after generating one structure, it’s passed back, and that information is passed back to refine the structure.
I’m curious, and I’m gonna throw another tough question to you… And it’s fine to say “Too far, Chris.” As you looked at the Evoformer and kind of how it’s approaching, do you have any thoughts on – as we’re talking about this era of using these different components in different ways and combining them, and going across domain, any thoughts on what an Evoformer might be used for in other contexts? Do you have any– I know that’s getting out there a bit…
Yeah, it’s a very interesting question. I do wonder, like, one sort of random idea… And you know, this is a random idea that I haven’t thought about until this moment, so there’s probably flaws in it… But I wonder if certain things like this could be used for, you know, multi-lingual models, and that sort of thing… Because you’re taking these multi-sequence alignments, which are sequences of different proteins… And I wonder – and they’re kind of labeled accordingly. I wonder if you could have this sort of multi-language alignment between different languages, and then factor that in. That’s a random thought… But I definitely think that this sort of idea that you would take a single input and represent it in two initial representations that have a slightly different character and represent different things about kind of your problem space, and then combining the information of both of those representations in the transformer - that could be applied in a number of different ways… You know, whether it’s text input, or image input, you could represent that in a couple of different ways that are useful and then mix those representations in this sort of Evoformer type architecture. I’m sure that even after AlphaFold, some of those 4,000 citations do a much better job at postulating possibilities than myself. So maybe –
That wasn’t too bad for off the cuff.
Yeah, maybe one homework assignment for all of us would be to look at Semantic Scholar or something and look at the 4,000 citations and see which are the ones popping up that are related to reuse of the Evoformer architecture. I’m sure there’s a few things that have already come out.
[39:51] I think it is interesting that – we can just say something briefly maybe about the training of this before we close out, because I think that is an interesting bit of this. We are practical AI after all, and I think we can maybe learn a little bit from the general training structure that they’ve set up for AlphaFold… And that is that they have this initial set of supervised examples from this P – I was going to say PBR, but that’s definitely not the right domain. What is it? PB-something, the protein… PDB, that’s it. So PDB, not Pabst Blue Ribbon, but PDB is this – like, 175-190 whatever it was, set of existing protein structures, right? So they have supervised examples, but what they did was actually trained the AlphaFold architecture on the supervised examples, and then used the train model to generate the new structure of sort of like a bunch of different guesses that they had. And for the high-confidence one they took 350,000 of those generated samples and combined them back in with the supervised, the gold standard samples, to create this mixed dataset, which they then retrained AlphaFold on.
So you have this mix of supervised learning with what they’re calling this “Noisy student self-distillation”, which is basically this process of, “Hey, I’m going to use my model to generate new things, and then we’re going to add the high-confidence ones back into my dataset”, which is a really interesting structure that a lot of people could use. You know, you don’t have to be using AlphaFold to use that idea, right? You can do that when you need to augment your dataset somehow. And so I think that that’s maybe another learning to be taken away here, that they’re using some creative elements in the training as well, which help them kind of boost the performance.
So as we wind up, I’d like to challenge – we have so many practitioners in our audience… I would love to hear about some of the novel ways that they’re taking these techniques and using them across other domains, and combining them. That has really been fascinating in the recent months, to see some of the creativity in the space across different types of use cases. So I’m looking forward to hear what people are doing with Evoformers and some of the other combinations that are present in the architecture here to do completely new things, particularly those things that benefit the world at large.
Yeah, yeah, definitely excited to hear about that. I’ve kind of already mentioned some learning resources for people, and we have a bunch of links we’ll add into our show notes that people can explore, but if you’re looking for something to start with, DeepMind does have a really good, brief explainer video about protein folding, and AlphaFold, and how that fits together. So we’ll include that; that’s a really good starting point. And if that sparks your curiosity, they actually do have published a Colab version of the inference pipeline… So you can actually spin up Google Colab and try to predict some structures yourself. I think that would be maybe the best way to learn about this, is just to try it. So we’ll link to the GitHub, to AlphaFold, and then yeah, you can try out that Colab on your own.
Well, I’ll finish with this… You might share – I started with the idea that there’s a lot of reason to be optimistic about the world and the future, despite the fact that there are plenty of things to bring us down… If you’ve enjoyed this episode, you might go share some of this with the people in your life, whether they’re into AI or not, just because it’s worth knowing that the world is still moving forward in a really positive way, even when other things are a bit challenging. So share this with people whom you might not otherwise think about…
And that’ll be it. I’ll talk to you next week, Daniel.
It’s been good to chat, Chris. See you soon.
Our transcripts are open source on GitHub. Improvements are welcome. 💚