Drausin Wulsin, Director of ML at Immunai, joins Daniel & Chris to talk about the role of AI in immunotherapy, and why it is proving to be the foremost approach in fighting cancer, autoimmune disease, and infectious diseases.
The large amount of high dimensional biological data that is available today, combined with advanced machine learning techniques, creates unique opportunities to push the boundaries of what is possible in biology.
To that end, Immunai has built the largest immune database called AMICA that contains tens of millions of cells. The company uses cutting-edge transfer learning techniques to transfer knowledge across different cell types, studies, and even species.
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined as always by my co-host Chris Benson, who is a tech strategist at Lockheed Martin. How’re you doing, Chris?
I’m doing very well. We’ve just come across summer solstice, we’re into summer, we’ve got hot weather all over the United States… So I’m just figuring we need some hot AI topics to go with that, don’t we?
Some hot AI topics…
If we’re burning up 24 hours a day, we’ve gotta burn up with some AI too here.
[laughs] Good one. Good one. I don’t know many things that are maybe hotter than sort of AI applied within healthcare, or within pharma, or within genomics… I just saw actually today I saw a tweet as someone had just released the first open source version of AlphaFold, the protein folding thing, and that’s pretty cool. So yeah, we’re really privileged this week to have someone who’s an expert in the field of AI as applied to immunotherapy, and genomics… We have with us Drausin Wulsin, who is the director of machine learning at Immunai. Welcome, Drausin.
Thank you. So good to be here. Thanks for having me.
Yeah, for sure. Well, maybe before we sort of jump into your specific work and some of the things that you’re doing at Immunai, maybe it would be good from an expert in the field to just hear a little bit about how have you seen AI sort of creeping its way into immunotherapy, or maybe applications to genomics, or this sort of field - how have you seen that history progress, and where are we at right now?
Yeah, so it’s actually sort of crazy to think about it. My first introduction of AI and genomics was about a decade ago, when I was in graduate school, and I was studying statistical methods for understanding epilepsy, so something very different than immunotherapy. But I saw a bunch of papers starting to explore microarray data. What microarrays are is basically you want to run an experiment on a 96 well plate often. And so you get to profile 96 genes. And all of a sudden, we were able to measure 96 genes in a single experiment, in a single little well, and the sort of AI people were going crazy about this, because it was this brand new, big(ish) – now we would call it small data, but back then it was big data. This was sort of maybe 2010 or so.
[04:25] And people were using this really rich data to start both to test new statistical algorithms and ML models, and also to try to understand associations between genes, which finally we were able to measure enough genes and proteins that we could actually sort of start to peel these back… And I just always thought it was very – I always kind of wished that I could get into that area, of getting into microarrays… But then my path took me elsewhere, and then , fast-forward eight or ten years and all of a sudden no one uses microarrays anymore, because you only get 96 genes, and meanwhile, now we can get all 20,000 genes in the human genome.
So I think the thing and the high-level story of AI helping understand biology over the last decade is this sort of intertwined story between algorithms getting better and more capable, and also critically our experimental techniques getting much better. Actually, I would say the experimental techniques might even be getting better faster than the algorithms, because we’re just able to profile so many more cells and genes and aspects of biology… And here I’m just talking about one small area of the larger biological understanding landscape; we can talk a little bit more about others. But I think this is the exciting piece about our field, is this sort of race track of both the experimental techniques, and also the computational techniques are racing against each other.
It’s interesting that you say that, and we’ve heard that from other people in other fields, but just the acceleration you get from kind of – this is a whole new tool set. We’ve been doing this for a while in this, but if you look at the scheme of these fields that it’s being applied to, it’s getting new tools, and figuring out how to use them effectively… I find it interesting, because we love talking about how fast the AI field is evolving, and you’re actually saying that you’re getting a bigger impact just from the fact that you’re learning how to best use those tools, which I think is a great lesson for folks.
So in the last decade, let’s say just sticking with this area that I know a little bit better, from microarrays to where we are now, we’ve seen basically three revolutions in sort of profiling of individual cells and biology. We went from microarrays to something called bulk RNA sequencing, where you can take a bunch of cells and understand what they’re doing across all the genes in the human genome… And then from bulk, maybe five years ago we started the single-cell revolution. There, instead of looking at what’s the average of 1000 cells or 10,000 cells, all of a sudden we can look at what’s going on in an individual cell, for each individual cell. And that is tremendously exciting.
There’s a lot more that I could go into if we want to go down this route of sort of the single-cell profiling opportunities we have… But this is, as I said, just one area of biology that has sort of been revolutionized by just the data capture that we’ve been able to perform in the last decade. And it’s the data that’s the thing that’s driving it that I’m most excited about. If I think about what the next decade looks like - yes, the algorithms will sort of be right there along, but the data is going to be leading the pack, and that’s the thing that’s most exciting.
[08:05] So on that front, I love how you brought up this side, this data-centric side of what you’re doing. I’m wondering… , you brought up - okay, we can go down to the single-cell level, we can get a bunch of data about this single-cell… I’m wondering if you could describe, for us who don’t have this sort of experience with biology and such what is measured, or what does the data look for a single-cell, and why is that data is important? Why is that connected to anything we would want to care about?
And if I can extend that to one other dimension, what’s the difference in the single-cell versus the bulk in terms of what you’re getting out of it as well?
Oh man, I could spend the whole chat just on this.
You can tell you piqued our interest…
Yeah. Okay, let’s first cover just what is the data, and then we’ll get to what’s the difference between individual cells versus what’s sometimes called bulk, which is the average of many, many, many cells. So what is the data - you have to actually recall your high school biology, which is how does a cell work?
Exactly. This is why it’s wonderful - actually, all of this is relevant. So you’ve got DNA, and the DNA is your master set of instructions for a cell. But you don’t crack the master set of instructions every time, right? So normally, when a cell needs to do its thing, it just needs a page here, a page there, it needs to make a bunch of copies from your master set of instructions.
So you can think about - there’s a process of making a bunch of individual copies of individual pages, and then it’s giving those copies “Okay, here’s how to make this protein, or here’s how to make that protein” to some factories. These factories are called ribosomes. And the ribosomes are the things that actually take the copies of instructions for individual proteins and actually making the proteins.
So with that – so just covering that, there are three things you can measure right there. You can measure one, what pages are you copying in the master set of instructions? Think about what page is the master book open to, that you’re about to make some copies on? Imagine that you can then say “Okay, what are all the loose pieces of paper floating around, that are encoding proteins that we care about?” And that’s going to tell you something about what a cell is doing. Basically, what instructions is it passing around to its factories?
And then you can also observe, okay, what’s the product of these ribosome factories, the proteins? What’s coming out? How can we measure how many of this kind of protein or that kind of protein? And so you can measure all of these, and they’re all complimentary. So you can measure some proteins a little bit better, but not every RNA molecule – so basically, the middle thing, which are these individual pages that you’re sort of copying from your master book, this is what’s called messenger RNA. And messenger RNA are these little pieces of instruction that tells ribosomes like “Here’s how to make this protein.” They code for a specific protein.
And so messenger RNA is – so you can imagine reaching into a cell and grabbing a bunch of random… Like, you know, whatever pages are on the floor, so to speak… Sweeping those up and then tabulating them, and counting them and say, “Okay, this is a page for this protein. This is a page for this protein.” And that’s one way to profile what a cell is doing, is by sort of counting the number of mRNA molecules. And literally, with some of these techniques, we are counting individual molecules of mRNA. And that allows us to sort of – if we can see, “What things is the cell doing?” it gives us a sense of what’s the function of the cell, where has it been, where is it going… But each of these data - sometimes we call them modalities - it has its biases, and its problems and its weaknesses… And the idea of a lot of modern genomics profiling is sort of capturing multiple of these modalities. At Immunai, where I work, we actually have a technique that can capture all three at the same time, and they sort of complement each other well.
[12:20] Well, since you had started off kind of a numerating the three, and as we’ve kind of gone through, can you tell us why would you do all three at the same time? Why isn’t the most recent one, the best one? What are you getting from each one?
So each has its trade-offs and benefits, right? Let’s take the simpler example of what’s called RNA seek, which is basically these individual copies of RNA. So in this, you can actually count up the molecules of RNA, and you can basically get an expression profile for every gene in the human genome. Usually, about 20,000 genes actually have activity here. So you have amazing scope in covering the entire human genome in a single readout. But the trade-off for that is that 90% of the sort of counts - basically, what you get here is for each… Now I’m talking about a single-cell; for each individual cell you get to count how many molecules of RNA for a particular gene did I observe. And the problem with RNA seek, at least single-cell RNA seek, is that 90% of the genes have zero counts, because you just can’t observe all of these molecules of RNA. So you’ve got a lot of zeros you’ve got to contend with. It’s a very sparse readout. So let’s contrast that with proteins.
So proteins, which basically can exist both within the cell, but especially on the surface of the cell, are what traditionally, and sort of for the last 50 years biologists have used these proteins to characterize and identify biology, certain cell types.
So in the modern era, I can, say, choose 100 proteins to profile, maybe 200. And I sort of choose them carefully, and I often choose them so they capture all the biology that I would want to measure. And for each of these, I get a pretty high-quality readout. The data is high quality, but it doesn’t cover all the possible proteins. I have to place my bet.
And so sometimes what we’ll do is we’ll say, “Okay, I want to use the proteins to identify, let’s say, cell types, or to identify which cells are dead or alive” or there are many things you can do. And then I’m going to use the RNA to identify what’s going on under the hood in the cell, that I may not be able to as clearly see with all the proteins. But sometimes I can say, “Okay, I really don’t care about all the stuff that’s going on to the hood. I just want to know, is the cell really energized to go kill some tumors? Or is it feeling exhausted, just coming off the field of killing a bunch of tumors?” And sometimes there are a couple of individuals surface proteins that we know about, that will indicate the sort of state of the cell. So it’s sort of you have this menu of things you can read out from a cell, and you have to be very thoughtful about choosing, “Okay, what are you going to read out here or there?”
And there are lots of other very interesting trade-offs, because each of these modalities has a cost. It has a cost in money, it has a cost in human effort, it has a cost in time. So I could say, “Okay, would I rather have a million cells of this sort of single-cell RNA seek data, or 200 proteins worth of data for each individual cell? Or would I rather have 5 million cells with just 10 proteins?” And depending on what I’m trying to do – but that 5 million cells with 10 proteins, I can get that data tomorrow, right after I did the experiment in the lab, whereas the stuff where it where I have all the RNA molecules, or I just have 200 surface proteins - that maybe I have to wait a couple of weeks, because it has to go get sequence, and stuff like that.
I think one of the things that’s really hard, but also fun and interesting about biology and biological data is that there are so many options, and you have to think about trade-offs a lot more, I think, than if you’re just working with vision or text or some other sort of modality that – or some other type of data that many of us in the AI world are used to thinking about.
So we’ve talked a bit about cells, we’ve talked about genes, we’ve talked about this sort of measurements within a single-cell, or bulk… I’m wondering if you could kind of connect this to what I’m learning about on your website, which is more sort of related to immune profiling or immunotherapy. How do the cells and what we know about the cells connect to kind of immunotherapy, and what exactly does immunotherapy mean maybe for some people that are new to that?
Great. It’s a great question. I’ll be honest, before I started working here, I had no idea about immunotherapy. I only sort of knew at the very highest level, so I can walk you through sort of my own learning process as well. But first, even before we get to immunotherapy, let’s talk about what is the immune system? So the immune system - think of it as a combination of the security guard force, it’s the police force, and it’s the army, and it’s the Air Force.
Speaking in terms Chris can understand…
Yeah, I work in the defense industry. I’m all on board with this. Keep going, I’m sorry to interrupt…
So your immune system is the defense industry for your body, and it is insanely good at its job, because 99.99% of bad things that happen in your body are crushed, right? The immune system dispatches them, no problem. And this is the product of hundreds of millions of years of evolution, and lots of predecessors of ours dying, in order to naturally select for this beautiful thing that is the immune system. It is actually like a defense force, where you have different specialized players that are that are good at different things
You’re not going to send a SEAL team to monitor your apartment building; you have a security guard for that. And the security guard is maybe even better than a SEAL team in some cases, for various reasons. And so the immune system has evolved to have all these really good different players that work together really well, to sort of crush viruses and bacteria when they sort of come into your body… And also to crush things that happen inside your body. So the two biggest problems that happen and that the immune system is balancing between are cancer and autoimmune issues.
So cancer is when basically there’s some mutation in your body that - it’s like a runaway train. All of a sudden, some of your own cells start replicating, and all of the brakes and emergency brakes that usually keep this from happening have broken, and there’s this sort of out of control growth that’s happening.
[20:01] Almost all cancer - again 99.99% of cancers are dispatched summarily by your immune system; your immune system sees what’s going on, it goes and kills those cells; you’ll never know. At the same time, in autoimmune issues, what’s happening here is the immune system is sort of going overboard. It thinks there’s a problem, but there’s actually not a problem, so it’s just attacking sort of like civilians, healthy cells. And this is also very problematic. So I’d say the immune system is this wonderful balance – like checks and balances. And there are lots of cell types that work and signal to each other to keep each other in control.
Okay, so what is immunotherapy? So immunotherapy is this wonderful sort of revolution, really, in oncology treatment that’s really taken place and blossomed in the last decade, where we basically really have realized that with just some very minor coaching, we can get the immune system to be way more potent at killing some types of cancer. And the coaching that I’m talking about is basically just binding one little kind of antibody to a certain type of immune cell. And when you do this, all of a sudden the immune cell sort of has like a forcefield on, and the cancer cells can’t turn it off in the way that they’ve evolved to be able to do. And so immunotherapy is really the art of coaching the immune system to be better than it already is. It’s sort of using an existing tool, just better, and giving it up the pep talk that it needs to go and fight the cancer.
So I’m just curious, and I don’t know if I’m going to ask this the right way, but… By putting in that antibody, which is essentially plugging it in right there, that’s the light switch that the cancer cell would be flipping on or off to have its impact, and you’re just kind of taking that away; you’re using your force field, or you’re putting the kids’ safety cover over the light switch, so to speak…
Exactly. It’s like putting that piece of tape over the light switch.
There you go.
So in the sort of technical term here, in the immunotherapy literature it’s called immune checkpoint blockade.
And so what you’re basically doing is sort of putting that tape over the light switch that the tumor has evolved to be able to turn off the your T cells. Your T cells are your main fighters; they’re like your Marines, or your SEAL team. They’re super-elite, and they go in, and they’re killers.
There are a lot of other immune cells that are involved. But you know, your Marines - they need an off switch. You don’t want them running wild throughout your body. So that’s why there is this off switch, and tumors evolve because in their ability to sneak by and actually sort of turn the off switch off when the T cell doesn’t realize what’s going on. It’s kind of crazy that it – really, in immunotherapy, the first five years from, say, 2010 to 2015, they were really just a couple of targets. PD-1 is one, and CTLA-4 is another. And just between the two of these, I can’t tell you how many lives have been saved because of the cancer treatments that just these single two targets have enabled.
I’m kind of thinking about, in my mind, while you’ve been talking about what these things are, what the data looks like, I’ve been thinking in the back of my mind about kind of how I would expect AI to fit into this. When I teach a workshop or something, often a common question is “When should you apply AI to a problem, and when shouldn’t you?” And I often put that in terms of scale and complexity. So it’s very easy for a human to identify a cat in an image, it’s very hard for a human to identify a cat in a million images, just time-wise, even though that’s a simple task for a human. But there’s other things, like maybe it’s certain time series forecasting, or maybe it’s things where the data is very complex language-related things or something that, where it’s very hard, the data, the problem is very complex for a human to even make a single inference. And so that doesn’t even necessarily require scale.
[24:25] And here you’ve kind of talked about – it seems there’s complexity around the data on one side in all of these different things that you can measure selectively against single-cells, or bulk cells… But then there’s a whole complexity on the other side, of all the different ways the immune system works, and it’s different targets within the immune system… And I’m wondering if that’s a good way to represent it, or how you would kind of make the case that – or for the place, I guess, that AI fits within this problem.
So one of the things that is really challenging about using AI in this biospace is there is so much that you could do. Because now there is just so much data that the space of possible things that we could do is just ginormous. And so you have to think about, what is the thing you really care about? What is the question or the problem that really matters to you, and go solve that. And even once you’ve identified the question or the problem, as you said, Daniel, a highly trained immunologist can’t look at a 20,000 dimensional sparse vector and tell you what kind of cell type it is. They can look at it – we can all look at a picture of a cat and a picture of a dog, or a face bounding box and see “Did the algorithm did to do a good job or a bad job?”
And so I think in sort of the bio/AI combination, you’re sort of in the hardest of both of these regimes, and the data is hard to understand and hard to sort of know what the ground truth is sometimes, because also biology is super-messy, and the problem space is so large and unbounded that it’s really easy to sort of get lost in the woods.
Yeah, and I know on your website and some of the things you’re involved with, some of that involves this sort of transfer learning. I know some parallels were drawn between large language models, things going on in the NLP space… How do you chisel down to – like you said, there’s so many things you could do. Immunai is specifically interested in immunotherapy… How do you go about saying, “Okay, here’s the space of what we’re interested in, here’s this really complex data. Here’s a bunch of things that are going on in the AI industry more broadly, to other types of problems.” How did you kind of come into the place where you understood how to connect certain of these things, whether it’s transformers or whatever, to the specific problems that you’re thinking about? And maybe you could give us an example of one of those problems.
Absolutely. Actually, this was one of the founding insights that Luis and Noam, our co-founders identified, is that there’s a lot of data being generated and a lot of problems being posed in the immune system space, but none of those datasets and none of those problems are benefiting from other people posing really similar problems, using really similar datasets. Each person is sort of doing it in their own narrow, little lane. And I think one of the things that, especially with transformers and this idea of a foundation model – but it’s is sort of in the earlier part of, or even back in 2010-2013, people were talking about semi-supervised learning, where you train an unsupervised model, and then you fine-tune it for specific tasks. And so what our founders realized is that basically there is this opportunity to do this transfer across tasks.
[28:10] And there’s lots of ways that – and actually, this is essential in our world, because of the multiplicity of data modalities and readouts, and different experimental conditions and contexts. So again, in vision and text - I’m sure that people who do vision and text would laugh at me, but I to say that they have it easy, right? Because all the images, they’re all RGB… Maybe they have some different focus, or out of focus, or different sizes, but basically, it’s all the same. And text -you’ve got maybe some language differences, but fundamentally, you can go and scrape all of Wikipedia and all of Reddit and get a pretty good dataset from those two sources. And this is actually what enabled a lot of the AI revolution of the last decade, is just pulling data from the web.
Whereas in bio, instead of having a couple of really high-volume sources of data, you’ve got many tens, or maybe even hundreds of smaller pieces of data. And so if you want to benefit from all of those smaller blocks of data, you have to have models that are more like Legos, right? That you can sort of learn an embedding from this dataset over here, and bring it to some new model that can benefit from, let’s say that embedding, ort that inference that you learned.
At Immunai we’re doing a lot of work on this, and we’re probably at the front of the pack, I would say, but we have not figured it out, and no one else has. But I think this is the thing that’s exciting over the next decade, is we will as a community figure out how to do this… And it’s really helpful to be an AI by person in the biospace to be able to point to the successes that, let’s say, in natural language understanding we’ve been able to have using this transfer learning approach. Because before Transformers or GPT 1, 2, 3, you had 25 years of computational linguistics, people building really finely-crafted models and context-free grammars and things this, and it actually took a certain amount of data, a certain scale of data and a certain class of algorithm to sort of trample all of that work. And, I think we’re just about there in the bio world. But there’s still a lot of earlier things that I think are around.
But an example - let me give you a really specific example of a transfer learning task. So at Immunai we’re primarily a single-cell company, which means that we take, let’s say, a tumor of yours or a vial of blood, we process it, and then we’re profiling each individual cell that we’re getting. And when I do this, I want to know, “Is this cell a SEAL Team Six cell, or is this an apartment security guard cell?” Because it depends on how I think about what is the cell doing, or not doing, and is this good or bad.
So this is something called a cell type annotation, and it usually involves an expert immunologist who understands what all the immune cell types profiles we should see are… But ultimately, it can be boiled into a classification problem. So we can do this classification problem for, let’s say, immune cells that we would see in the blood. But what about immune cells we would see in the bone marrow, or a tumor? Those are related, but they’re not exactly the same; different profiles. So is this a separate problem, or is this just a sort of very similar flavor of the blood profiling problem? And so there’s this problem – so traditionally, people think of it as a completely separate problem, completely separate tools and approaches and atlases… But I think the reality is, it’s just another instantiation of a similar problem.
[32:03] And then there are a bunch of other technical problems, of like how do I know whether the cell that I’m observing is one cell or two cells? Because sometimes the technical hardware - there can be two cells that sort of get smooshed together, and you can’t resolve them. So this is actually an interesting, very in the weeds technical problem that you have to solve if you’re going to do it with single-cell work, and this is also a simple binary classification problem.
And so when you think about all these individual – and here, I’m talking about just single-cell problems, but there are analogs for bulk and, sample level, and there are lots of very related problems that up until, let’s say, the last year or two people have been solving independently, and now we’re working to solve them together.
I’m pretty intrigued by this whole idea of a generalist approach to applying this kind of pre-training and transfer in multiple domains, with multiple modalities of data… It’s something I’m personally really fascinated by right now. I’m wondering if you could describe how you might approach pre-training and self-supervision for biological systems… Because I could think in the Natural Language Processing space, for example, I know I have this text, I know this word goes here, and I can just remove it, and then I have a blank, right? Is that some of the inspiration for what you’re doing in terms of pre-training, or how do you think maybe about self-supervision and what might be the more relevant things to think about in the biological side?
So in many ways, the single-cell – and again, we’re going deep on the single-cell world, because especially it’s what I know the best, and it’s what Immunai focuses on… In the single-cell world, if you look at the trajectory of models and techniques let’s say over the last five years, the field literally didn’t exist five years ago, or it barely existed five years ago. Five years ago people published papers on 200 cells, now we’re publishing on 2 million. So just the scale of the data in the last five years has enabled certain new flavors of models that just didn’t exist, or we couldn’t do before.
But early on in the first couple two or three years people trained auto-encoders, paralleling sort of the earlier work, both in vision and in text. So initially, it’s just train your auto-encoder as basically each cell is an observation, and maybe you select 5000 genes that are the most variable or the most active for a cell, or maybe you do all 20,000, depending on how much data you have… And you run that through a bottleneck where you’re just trying to reconstruct the gene expression. Then you take that middle bottleneck layer and you do something with it; maybe you fine-tune it for a specific task. Often, people use it for data exploration in an unsupervised way, maybe visualization, or clustering, and things this.
So this is sort of where it started. It’s just now in the last year starting to happen where people are “Huh, now we have enough data, and there’s this fancy transformer thing that I’ve been hearing so much about… Maybe we can start building some of those.” And in that, the task can vary a lot, but probably one of the most – sort of analogous to the language world is just masking, right? So instead of masking words, as we do often in the language, or parts of sentences, or the second sentence after the first sentence, you’re masking individual genes, and you say, “Hey, model, I’ve masked this 15%, or 25%, or 50% of the genes, and I’m gonna give you the other genes, and I want you to reconstruct.” So that’s probably the simplest formulation, but there are a lot of alternatives that you can do.
[36:17] And the cool thing now is that – you know, two years ago if you wanted to build some big transformer or big foundation model, you sort of had BERT as your template. But now even just the transformer world has completely blown up, and so we have BERTs, and GPT-3, and the perceiver, and lots of options to sort of choose from and customize, which is just – you know, at Immunai we’re not leading the edge on the brand new transformer ARM architecture, we’re benefiting from other people doing this, like OpenAI, and Facebook, and Google, and DeepMind, and we sort of get to be like “Okay, yeah, this is the one I think that most benefits our application.”
So I’m curious, because you kind of downplayed it a little bit at the end, but these are pretty fascinating approaches, maybe because we’re talking about it outside of some of the more common topics that you tend to have in the ML space in this way… But at the end of the day, you’re still having to run a pipeline, and there’s all these kind of practical ML tasks that you’re going to be engaging in… But with this interesting dynamic about the thing that you’re addressing specifically, both similarities and differences… So what have you learned about running a practical machine learning pipeline in a company that’s doing these kind of interesting techniques. It’s definitely a realm that most people are not thinking about machine learning in, and that has some of the uniqueness of that. I’m wondering if that’s given you some insights that might benefit all of us?
Well, I can say that – so in my own personal journey, I had a period of time where I did a lot of ML. Then I got burned out of ML, and I did no ML for four years. Just software and data engineering; building pipelines and getting data – I was a data plumber, and I loved it. Then I sort of missed the research in the ML stuff, and so I got back into it, and here I am. But I think that the biggest thing that I’ve learned is that when you have a problem that you want to solve, you always solve it without ML first, right? So you solve it with getting data, and you solve it by analyzing the data. And then once you do this, then you solve it with something really simple, like a logistic regression or an XGBoost. For classification tasks, these two things probably cover 80% of all the things you would want to do reasonably. I think it’s really hard, even at our current place, to sort of do the responsible thing sometimes… Because you know, you want to go train the fancy transformer, play with lots of data… But the problem is you can just burn so much time. Unless you’re one of these huge companies, it’s so easy to burn so much time getting infrastructure, training infrastructure, GPU, parallel GPU training up and running to support these big models, when often – you know, maybe you’re solving the wrong problem.
So solving the problem without models first, or just with data analysis… Actually, when I joined Immunai about two years ago, my boss who was the CTO, they said, “Okay, Drausin, I need you to just understand cell type annotation.” So I didn’t train a single model for six months. I just deeply understood the data, and analyzed it, and understood the problem domain… And it was actually really fun, because I got to work with our immunologists, and this was – I was sort of grumpy about not getting to get in and train models right away… But in retrospect, it was a really fortuitous thing, because now I just have a much richer understanding of the problem. And I think especially in bio, the problems are hard to define, and defining the problem well is the most important thing. And this comes with data and analysis before you have to do any ML.
[40:07] And along with that, what is this sort of interaction between maybe those with expertise in machine learning, or in even software engineering, and expert biology doctors, this sort of other side of things? How have you found kind of good synergy between the domain experts and technical experts?
It’s challenging, even when everyone wants it to work. This is one of the big differences in the bio-AI world, you just need a lot more different flavors of experts in order to make a therapy, to make something that’s going to help people. And so getting the immunologists, and the software engineers, and the data engineers, and the computational biologists, and the machine learning people to all communicate effectively is hard. And the only thing that works is there are two critical things. One, you have to find people who are interested in doing this. Find people who like getting out of their discipline areas. And not everyone is interested in doing this, and that’s completely fine. But I think in the bio world you have to be interested in what the immunologists are doing, and be excited by those problems, rather than just wanting to make your ETL code even better and more efficient.
So finding the right people and the right team is critical. And also, building the team such that – my sort of a quantitative brain thinks about it like overlapping Gaussian distributions. So each person, or really I think team, each specialty has a higher density area of specialization. But if those densities don’t overlap, then you’ve got cracks, you’ve got holes where things get lost. So what you need is basically to get foreign teams were the sort of tails of your distribution – I said a Gaussian, but maybe it’s a t-distribution, which have heavier tails, where they’re overlapping, and so it’s easier to speak each other’s language a little bit. And this is essential, and it’s again, one of the challenges of work doing AI in the biospace, but it’s also one of the best parts about it, because you get to work with all of these brilliant people, who are working together towards a shared mission. It’s not just a bunch of engineers.
Now, I love engineers, and I am an engineer, but I wanted to work with other people who are not engineers. So that’s been a learning experience, for sure, but it’s actually one of my favorite things about the company where I work, Immunai, and also sort of the field in general. it attracts these hybrid people, who like to be in the marshlands, or in the hinter regions of these different specialties.
That was a perfect segue, because we’re kind of coming close to the end, and I want to ask you where this is going, in the sense of not only your organization, but the field at large. You’ve said that you are working, at least currently, on single-cell, but there’s the work that you guys are doing, and there’s the larger field, and it’s a fascinating topic, that’s very different from most of the folks that we talked to, in terms of how you’re applying it, and you’re kind of having to pioneer and not only the techniques, but the identification of the problem sets to begin with. And so we’re where do you see both your organization and the larger field going over the next few years? What’s possible here?
So we’re already seeing it, and what’s going to be happening is - think of it like assembly lines from patients, and cells, and animal, mice, to therapies in people. So this is the beginning and the end of the AI drug discovery, therapeutics, biotech assembly line… And what’s happening is little pieces of this assembly line are being productized. And because we’re getting enough data, we understand the problem well enough, they’re not being solved yet, right? And we’re far from commoditization, but they’re sort of coalescing. You’re starting to see some companies saying, “Oh, we have the first fully AI-generated therapy.” And I will tell you, and I feel okay saying this as an insider, there’s a lot of AI marketing that happens in the biotech space.
Yeah, but especially in biotech, where it’s white hot – there’s a lot of AI… So is it a really AI fully controlled drug? No. There are a lot of people in the middle. But certain core components - the running of in vitro experiments in a Petri dish, the understanding of which patients are good for which therapies… These things, we are starting to have enough data where we can actually – the problems are blossoming into well founded problems on their own. And one of the things that’s exciting that we’re beginning to see is people from outside of the biospace are getting excited by these problems. So we have people – and Immunai has collaborations with top-tier professors and research institutions who are not biologists, but are excited by the data and the mission that we have, and want to get in on it. And to me, the fields that can attract top talent to come and join the sort of fight to build out the core building blocks of this ultimate pipeline - this is very, very promising.
So what I think we’re going to see is just more maturation of the handful of problems that need to be solved in order to develop a therapy. And at Immunai ourselves, we are on this journey. We are one biotech company, we’re about 150 people, which is tiny compared to some of the big guys, but big compared to some tiny biotechs… And we’re trying to lead, be at the head of the pack here of identifying what are the problems that need to be wrapped up and solved cohesively. But there are a lot of people out there, and no one company is going to do it. But I couldn’t be more excited about the next decade. I think a decade from now – my goal, and I think many people at Immunai and probably lots of other people in the biotechs world… You know, you look at HIV. Now, 30 years ago, generation ago, HIV killed people, and they had months to live when you got HIV. And now it’s a chronic condition. In some cases, we’ve actually been able to cure it.
I think that with cancer, this is within striking distance. We can do this. And AI is not going to solve this. It’s not a magic wand. But the combination of AI, data, good people, amazing experimental methodologies - all this is going to come together; our better understanding the immune system, I think, over the next generation… I want to be able to tell my grandkids that I helped solve cancer. And I think that it’s this sort of grand challenge, but… What kind of challenge do you want to work on in your life…?
Yeah, I think that’s a really encouraging and inspiring way to end my day, and in this conversation. I’m really excited about the things that you’re doing, Drausin, and the whole team there. It’s really interesting to get a sort of view into AI within this space, and how it’s also being influenced by things happening elsewhere in the industry… So thank you so much for sharing these insights with us, and looking forward to following your work.
Thank you so much, Chris and Daniel. It’s been a real pleasure.
Our transcripts are open source on GitHub. Improvements are welcome. 💚