Practical AI – Episode #86

Exploring the COVID-19 Open Research Dataset

with Lucy Lu Wang from Allen AI

All Episodes

In the midst of the COVID-19 pandemic, Daniel and Chris have a timely conversation with Lucy Lu Wang of the Allen Institute for Artificial Intelligence about COVID-19 Open Research Dataset (CORD-19). She relates how CORD-19 was created and organized, and how researchers around the world are currently using the data to answer important COVID-19 questions that will help the world through this ongoing crisis.



LinodeOur cloud of choice and the home of Deploy a fast, efficient, native SSD cloud server for only $5/month. Get 4 months free using the code changelog2019 OR changelog2020. To learn more and get started head to

AI Classroom – An immersive, 3 day virtual training in AI with Practical AI co-host Daniel Whitenack. Get 10% off using the code PRACTICALAI10. To learn more and purchase tickets go to

Notes & Links

📝 Edit Notes


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I am a data scientist with SIL International, and I’m joined as always by my co-host, Chris Benson, who is a principal AI strategist at Lockheed Martin. How are things down in Atlanta, Chris?

Doing very well down in Atlanta. I’ve got a bit of a cold, so I may cough my way through the episode, but other than that, doing great. Spring has sprung, it’s beautiful.

Hopefully just a cold.

Yeah, we’re just crossing our fingers. I took my daughter a couple of weeks – I’m hoping it was her cold… I took her to a pediatrician a couple of weeks ago; we actually had to go in, because it kept going… And it was a frightening thing to say “Well, it could be strep, it could be this, or it could be Covid-19. We can’t exclude that.” As a parent, that was like “Whoaa…” So I’ve gotten through that; just a cold, we’re good. Rolling forward.

Good, good. Well, we’re surviving here in lockdown in Indiana. It’s actually pretty nice outside. It’s mushroom season here, so there’s these wild mushrooms that come out in Indiana just around this time; they’re called morel mushrooms, and we go every year hunting… My family has some property that’s all forested, no one else is there, so we’ve found some good times just going out there and walking through the forest and getting outside. That’s been nice.

Yeah, that sounds nice, whether you find any mushrooms or not.

Exactly. Well, I guess a related topic - because it’s really the only topic these days, affecting us in all of our lives in a big way… We’ve had another episode a couple of weeks ago about the Covid QA system, which is a question and answer system related to Covid-19, and they were also using this dataset called CORD-19. Today we’ve got Lucy Lu Wang from the Allen Institute for AI, she’s a research scientist there, and we’re gonna be talking all about the CORD-19 dataset, the ins and outs and the story behind it. Welcome, Lucy.

Hi! Thank you, Daniel and Chris, for having me on this show.

Yeah, it’s great to have you here. I appreciate you joining us. This is, of course, a big topic. Everyone on Twitter and all around is talking about this dataset and how it’s being used, so we’re really excited to talk about it a little bit more here. But before we do that, I’d love to hear a little bit about your background, how you got into AI-related things and ended up at the Allen Institute.

[04:20] Sure, yeah. My background is maybe a little less traditional. I started out more in biomedical engineering and physics, and worked in a host of biomedical startup companies, creating medical devices. Over time, I was doing more simulations, and incorporating more data science and machine learning techniques into my work and found that that was very motivating for me… So I decided to pursue a Ph.D. in biomedical informatics, where I focused primarily on biomedical applications with natural language processing techniques, and creating models to try to connect these automated methods with the type of improvements in clinical care and biomedical text mining that we so desperately need these days.

When you’re talking about NLP for biomedical applications - are we talking mostly here about medical records, and doctors notes, or whatever that is, and trying to extract relevant information from those, and patterns, and mine those for useful things? Is that the main drive there?

That’s definitely one aspect of things. I am also very interested in looking into the scientific literature, trying to extract entities and relationships and useful information out of that body of work. I think that’s what really I’m working on the Allen Institute for AI, too. I’m part of a team called Semantic Scholar, which I think a couple weeks back you had an episode about Semantic Scholar. It’s a literature search engine project.

For Semantic Scholar we’ve indexed 180 million papers. There’s a really rich corpus of texts to work with, and as part of the research team there I’ve created a number of tools and worked on a number of projects to understand more about the content of that text. And that’s kind of what brought us to the CORD-19 dataset. We have this underlying infrastructure for processing scientific texts, and we were asked to contribute some of that expertise to creating the dataset.

Awesome, yeah. And I’m curious – so I’ll definitely reference the semantic scholar episode in our show notes, so people can listen to that, because I think that provides a really good baseline for the sort of data that you came into this recent work having, which is a really great way to find and discover scientific information and related data across scientific literature, which is amazing.

I was wondering if you could comment before we jump into CORD-19 specifically – I know we’re in a really interesting time where a lot of people are publishing a lot of things about Covid-19 very rapidly. What does that situation currently look like? Are we talking thousands of papers, over how much time…? How rapidly are they coming out?

I think the scientific engine has really spun up to handle this current situation. As far as I know, there’s been more than four thousand papers released since January on Covid-19…

The number of paper continues to grow, but more importantly, the number of papers released every day continues to grow. We’re up to maybe several hundreds of new papers a day… And it’s kind of intimidating to look at this source of information and see what people are discovering.

[07:56] So the fact that you have so many new coming in every day - are you refreshing the dataset? Is the dataset static at some point in time, or is it something that you’re constantly updating and refreshing?

I guess folks already know what the CORD-19 dataset is, but it’s kind of a collection of papers about Covid-19 research, including historic Coronavirus research. So we have a collection of historic research, and then we also have all the new research that is being released daily. We update the dataset currently at a weekly cadence, but we are rapidly moving to a daily cadence, since there’s just so many new papers released every day.

Since we went that direction, I was wondering if you could maybe tell a little bit of the story of how this dataset came about. Obviously, you have data about scientific literature within Semantic Scholar, and you’re already doing certain things as relating to tracking entities or topics covered in those… How did the idea for CORD-19 come about? I know that there’s others involved in this, too. So there’s Allen AI, but there’s also Microsoft, and the Chan Zuckerberg Foundation, and others… So how did this come about?

Yeah, so the entire project is a coordinated effort by the White House Office of Science and Technology Policy. I think some time in early March a group at Georgetown, the Center for Security in Emerging Technology (CSET) reached out to us at Allen AI to help coordinate the release of this dataset, along with a couple of different organizations. You mentioned MSR (Microsoft Research), Chan Zuckerberg, Kaggle was also involved, and the National Library of Medicine, which is part of the NIH. So all these groups - we’re going to come together to essentially create this dataset to help create text mining and information retrieval tools that could assist medical experts in understanding more of what was going on with the epidemic.

For Allen AI, the way that we got involved is we had recently created a new pipeline to revamp our open research corpus. We had a pipeline for essentially taking these paper documents, which are traditionally in a PDF format, not very easy for text mining, not very accessible, and converting them into a structured full-text format, where you could run these natural language processing models on them more easily.

So that’s our major contribution to the dataset, is the pipeline for both harmonizing the paper metadata that we’ve collected over the years, and also producing these structured full-text parses, so that we can run our models over that text.

I know one of the big things that we talked about when we talked about Semantic Scholar was the ability to find relevant data that might be buried in the wealth of scientific literature that we have, about a certain subject that is of interest. When you came to CORD-19, there’s the extraction of the metadata and the actual content of the paper, but then how do you even go about saying “These are all the papers related to Coronavirus”?

I’m a little bit ignorant on the subject, so you’ll have to forgive me. I know that Coronavirus is a family of things, it’s not just this Covid-19 which is associated with Coronavirus; there’s Coronavirus associated with the common cold, and all these things… So how do you go about saying “This is what we’re scoping down our dataset to”, and finding that, and along with that, deciding what you’re going to exclude, I guess, as well?

It’s a great question, and I think it’s a question with very open answers. So what we started with were a couple of trusted sources that we knew needed to be included in this dataset. Those sources were a collection of papers curated by the World Health Organization on Covid-19 specifically, and we also performed searches over PubMed Central, which is a biomedical paper repository run by the National Library Medicine… As well as these pre-print servers, Bio Archive and Med Archive, which were publishing the latest research on Covid-19.

[12:20] We went out and collected papers from these sources using a set of keyword searches to make sure that they were relevant to both Covid-19 or the family of Coronaviruses in general. Because I think historical Coronaviruses like SARS and MERS are also extremely relevant in the current case.

I’m kind of curious, as you reached out and made the dataset available, and you look across some of your partner websites - Kaggle has a call to action, and stuff - and you’re trying to get AI practitioners and data scientists to focus on important questions that need answering for this purpose, how do you provide guidance in that way, for people who are gonna engage on the dataset? Is it something where people just grab it and do whatever they want? Is there any kind of organization across teams? There’s a lot of human factors involved in this, so how is that conceived?

Yeah, so there were a lot of challenges. For Kaggle, when we opened the challenge initially, the CORD-19 challenge, there was a set of ten slightly open-ended clinical [unintelligible 00:13:23.09] questions which were given to the community. And the engagement at Kaggle, the response we’ve received has been absolutely incredible. There’s been millions of views on the landing pages, the dataset has been downloaded many thousand times or more… There’s been lots of teams that have cropped up and self-organized to work on this dataset.

I think there’s a group called CoronaWhy, that’s like several hundred data scientists and medical experts who have bonded together to work on the CORD-19 dataset and other Coronavirus datasets. We really want to just offer support to these community members. So there’s a couple of sources of information that we’ve created to help facilitate these things. On Kaggle the forums have been super-active, there have been a lot of people answering questions for each other, including from the organizations that have created these datasets.

We’ve also established a Discourse to answer questions specifically about the CORD-19 dataset, so that’s a great place to get answers.

And finally, for these Kaggle challenges and for these shared tasks, one of the things that we’re really trying to do by hosting these shared tasks is to connect ML experts with the medical community, and experts who can judge the answers that are being retrieved and extracted by these machine learning experts, and see whether they have practical application in the clinic. So that’s been a challenge.

Lucy, you’ve just brought up something that I think is really interesting, which is the interaction between the AI community and the medical community… And I was actually wondering while you were talking about “Okay, this CORD-19 dataset exists, and I know I have some AI expertise, but I don’t necessarily have a lot of medical expertise, outside of know that I should wash my hands, and these other things, the top five that have been going around…” I guess I was wondering - as you’ve got more experience with this kind of intersection between the AI community and the medical community, what has that interaction been like in the past? Has there been much overlap between the AI community and medical practitioners? Then secondly, as we enter into this new CORD-19 challenge, has that changed in any sort of way, or been rapidly advancing in any sort of way?

I’ve worked on the intersection of these communities for a number of years, and I think there is a lot of great collaborations going on. I think a lot of folks in the computing community are incredibly motivated by these very practical questions that need to be addressed - ways to improve patient care, ways to help with drug development, or vaccine development, and questions of this nature.

As for Covid-19 specific initiatives - I can give you two anecdotes for ways that we’ve had medical experts interact with computing experts. For the Kaggle challenge, it seems that what is happening is a lot of people are developing different systems, different information retrieval, different information extraction systems, and those systems need to be reviewed by experts for usefulness.

In Kaggle there is essentially kind of like an army of medical students and other people who are willing to provide/volunteer their medical expertise, who are actually going through and manually reviewing a lot of the extractions that are coming out of these Kaggle challenges, and creating these living systematic review pages, with the answers to some of these questions. So if you go to the Kaggle page, you can see these reviews being created in real-time, and updated in real-time, as new literature is released.

Another thing that I’ve been involved in lately is we’re running a TREC challenge on this dataset. TREC is the Text REtrieval Conference, and it’s been a project at NIST, which is the National Institute of Standards in Technology, for the last 20 years. These folks are really good at information retrieval, and judging information retrieval systems.

The way that these systems are judged is by having expert medical annotators review all the results and provide gold rankings of what is most relevant to query, and what is least relevant. So there is a lot of this incorporating experts in the loop, incorporating humans in the loop, to bolster our machine learning systems. And that is not something that we’re gonna be moving away from any time soon.

[20:12] As you’re talking here, I’m looking at the various questions that are listed on Kaggle, the tests to go answer… And kind of extending this thing about this collaboration between the AI community and the medical community - the questions themselves, where do they originate from? How were they decided as the important questions that we could all give a shot at going and answering with the dataset?

I might be wrong, but I believe this set of questions originated from the White House Office of Science and Technology Policy, in collaboration with Kaggle. You have to understand – so this challenge and the dataset, during the early days we literally had just a few days to turnaround this dataset, put it out there, and publish this challenge. We wanted people to start looking at this as quickly as possible.

A lot of the questions that you see on Kaggle right now are very open-ended, they can be interpreted in different ways, and as time has gone on, as we’ve learned in this last month, some of those questions are more useful in clinic, some of those questions are less useful. Clinicians already know the answers to some of these questions. So now, as we move into the second month of this challenge, there will be a new batch of questions released, to motivate new work… Questions that have not yet been answered by the community.

I’m curious with that - it seems like you could have various bottlenecks in this situation, and one of those I think you highlighted is this useful interaction between medical practitioners and the AI people that are trying to do something with the dataset… So I was wondering, do you have a sort of healthy community of medical practitioners that are very deeply involved in looking at what’s coming through Kaggle, or these others teams that are self-organizing? Because having worked at a non-profit for a bit, one of the things I’ve seen is people really get behind the social good challenge, and they work on a hackathon on the weekend, and then the project kind of dies.

So what are the ways for people – if I want to step into some CORD-19 related work, what are the ways that I can step into that, but also get connected with the right medical people to make sure that what I’m doing is useful, and not just kind of an interesting weekend thing?

Of course. I think now that Covid-19 has sort of taken over all of our lives, a lot of people are feeling very motivated to do something in this direction, contribute their skills. And I think I mentioned some groups earlier, groups like CoronaWhy, which is self-organizing to analyze this type of data. That’s a group made up of data scientists, machine learning experts, medical practitioners and so on. So it’s a good place to get feedback on one’s work.

And Kaggle form similarly – I think there are a couple of threads out there, essentially continuing this discussion of how to connect to medical experts, how to verify results, and so on. A lot of people will have taken it upon themselves to build systems; there’s tons of systems out there for searching CORD-19, for extracting information out of CORD-19… And to be perfectly honest, not all of those systems are gonna be highly usable or used by any kind of clinical audience… But we as a community need to essentially figure out which of those systems are most promising, figure out where to expand additional energy, more development time, and so on. And that’s where our annotators come in.

[23:59] Going back to what you said a moment ago, you have put this together so quickly, and gotten it out there, and the whole world has kind of dived into it; the whole world of data science, at least. And that’s very different from how most communities form, and I’m kind of wondering - with hindsight, totally recognizing that you had no choice, you had to get it out there, and you did a fantastic job with the kind of pressure that you were under… If you could go back, knowing what you know today, what are some of the things around community that you would like to have done, and that maybe going forward as you’re looking at the tasks and how to move us going into the second month, how to move us in the next set of directions that you need people to take, what are some of the ideas that you’re planning to implement there to evolve this process?

Let me try to unpack that… [laughs]

No worries… Any way you want is fine.

Sure. I think we’ve learned a ton over the last month. Speaking with some of our collaborators like Kaggle, like Anthony Goldbloom, was mentioning how Kaggle has fallen into this place with this challenge where they’re in new territory. The type of challenge that the CORD-19 is is very unlike most of the challenges hosted by Kaggle, and there’s this very open-ended nature of it. We’re trying to discover answers, but there’s no sense of what a gold answer is. And they’ve reacted to that in a very wonderful way, by essentially harnessing these medical students as a resource to make judgments on people’s extractions, and putting an effort there, where it seemed like the results were most useful, or were going to be the most useful.

So I guess that’s one thing, which is trying to figure out as early as possible where the most useful results are, and putting in additional effort there, and maybe even abandoning things that are not worth pursuing. And then… I don’t even remember what the second half of that question was.

Things that you might be thinking going forward, but no worries. We can come back to that.

I mean, I have lots of thoughts on that as well… Certainly we’re gonna be supporting CORD-19 for as long as it makes sense to do so; certainly until the epidemic seems to wind down a bit. There have been lots of requests for additional features and additional content, so that is one of our priorities.

Additional content comes in two forms. One is simply providing more faithful parses of the papers. Right now, first things might be including things like inbound and outbound citations, tables and figures, other places where the answers might be. And these have been requested by a lot of folks.

Another is content in the sense of more papers. One thing that we’ve been very grateful for is that a lot of publishers have made their Covid-19 articles open access. And by making them open access, they’ve allowed us to release a dataset like CORD. But the fact is if you look at the dataset and at the papers that are being cited by the papers in the dataset, there’s actually a lot of content that are outside of this direct core set of articles on Covid-19 and Coronaviruses, that are also very relevant to the content of the dataset.

So it would be great if we could work with publishers - or they could work with us, essentially - to provide additional content that could be useful for discovering information about Covid-19 and its treatments.

It’s like you’ve amassed this sort of central hub, but from that hub of papers - obviously, those papers cite other papers, and those papers cite other papers, and there’s other related work, and you can kind of go down a rabbit hole. I know we talked on the Semantic Scholar episode about this graph of relations and that sort of thing, as papers cite each other. There is a wealth of papers already in the dataset… What is the current size of the dataset? You mention a few different sources - could you give us a sense of the descriptive statistics of the dataset at this point, I guess?

[28:18] Sure. I probably should have started with that. So the dataset currently consists of more than 50k papers, and approximately 40k of those papers have full-text content available. These papers, as I mentioned, come from a diversity of sources. There’s a list of WHO Covid-19 papers, several hundred of those that have been curated by the WHO… And there’s pre-prints from the bio archive and med archive; that’s numbers in the thousands. And actually, the vast majority of papers come to us via PubMed Central. This number will actually continue to grow, because many publishers are now depositing all Covid-19 content into PubMed Central.

Lucy, I’m kind of curious… Recognizing that your responsibility has been putting this together and getting it out into the world, I imagine that you’ve probably talked with various teams, or at least observed some of the efforts and work that’s going on… So I’m kind of curious, what are some of the more interesting, or innovative, or *pick your adjective* of the efforts that you’ve heard of and kind of gone “Wow, that’s a pretty cool way of approaching this”? Any stories to tell us on that?

I think folks have done a really great job diving deeply into this dataset. There’s so many different search engines that have cropped up over this dataset. If you go to the dataset landing page on Semantic Scholar, there’s a list of maybe several dozen that we’ve heard of, and are enumerating there. I’m sure there’s many more that we aren’t aware of.

People are pursuing lots of different technologies for these search engines. Some are using the latest, state-of-the-art transformer models, newer models for ranking. I think Covidex from Waterloo and NYU is using the latest [unintelligible 00:30:14.16] T5 model; very cool stuff. And some of the search engines are actually using very traditional methods, using Lucene, or Elasticsearch, and focusing more on how to search and filter using entities, or other paper features.

A funny thing that we heard from Kaggle is that for many of the questions on the CORD-19 challenge, the simpler methods, the more traditional methods have actually worked better for extracting answers. This came as a surprise to me. [laughs]

I have a follow-up on that. For “working better”, what does that mean? Who out there is looking at the various results that are coming back from teams, and making those evaluations? Obviously, you’ve already said that you’ll be steering the new tasks on to Kaggle, learning what we know. Who’s making those evaluations and the decisions associated with that to keep everything focused?

I think this is primarily work that has been taken on by organizers at Kaggle, and medical students that they’ve had come in and evaluate some of these answers. So really they’ve put in a ton of effort in curating these results. As for the metrics that we use to judge these results, I think currently they’re mostly information retrieval-based metrics of success.

[31:51] To give people an idea, I’m on the Kaggle website, and I’m just looking at some of the tasks… Maybe for those listeners out there that don’t have a good idea of the scope of these, some of the tasks that are listed are “What do we know about Covid-19 risk factors? What do we know about vaccines and therapeutics? What has been published about medical care?” And if I dive into each of these, there’s a number of submissions, things like CORD-19 analysis with sentence embeddings, Covid-19 literature clustering, full-text search of research papers… And here’s another one - BERT SQUAD for semantic corpus search… So you get a sense of what you were talking about, Lucy, that some people are kind of going after these transformer-based, maybe extractive QA sort of things, others are maybe using full-text search capabilities that have been out there for a while, like Elasticsearch sort of capabilities, like we talked about on a previous episode…

Also, if I’m looking at the data, just to kind of have something in people’s mind, I’m seeing – of course, you have the categories of source, like the bio archive, and that sort of thing… But if I’m looking at the individual papers, I can see the abstract of the paper, and the body text… And if I’m looking at the body text of this one, I’m reading things like “to assess the effects of truncation of the poly(C) tract on replication…”, which some of that doesn’t mean a lot to me… But this is the sort of data that’s in there.

I was wondering, as a complete newb to a lot of this medical terminology, what are maybe some good ways – I know you’ve got this CoViz from the Allen Institute, which is helping explore some of the genes, and cells, and diseases and chemicals that are connected throughout the dataset… What are some good ways of onboarding into the CORD-19 work, for people that might be new to medical terminology? Are there ways to pick up some of that, or explore some of it and inform some of those connections in a reasonable way?

The domain knowledge…

Yeah, I think the domain knowledge is one of the greatest barriers for working on this dataset… And thanks for mentioning the CoViz project. That’s a tool that was released by Allen AI for exploring this dataset in a slightly more meaningful way. And for that tool, we essentially ran models to perform extractions of these entities from the text; entities of different classes, like drugs, genes, diseases, phenotypes, things of that nature… And created a vizualization to allow you to browse the relationships that are most prevalent between pairs of these entities.

So that’s a great way to explore what’s in the dataset… There’s also just exploring the articles; we have a CORD-19 explorer to help you do that. But in general, I think unless you’re willing to spend a couple years of your life in medical school, it is very hard to understand what some of these terms mean. Certainly knowing what class or category of entity is being mentioned is important… So knowing something is a protein, knowing something is a receptor, knowing something belongs to a particular biological pathway - these are key for gaining an initial understanding of what is being said in a text snippet. But that’s also why we need medical expert to assess the actual utility of some of these extractions.

[35:45] I happen to have a daughter who is a 3rd year medical student… And I’ve told her very recently - because we had the other episode - about this, but she hadn’t been aware of it. Is there any need to connect with medical schools? Has anybody taken that on to try to gather those together, and stuff? Because obviously, there’s been an enormous effort in a very short amount of time, totally recognizing the constraints of the reality that we’re in today… Is that something that you all are thinking about, in terms of going forward, maybe for stage two, or whatever you wanna call it?

Yeah, absolutely. I’m sure your daughter knows that – I mean, I don’t know if medical school is continuing as usual, but I think during the 3rd and 4th years you’re mostly in clinic.

So I know of a lot of medical schools where there are these more senior medical students who really want to contribute how they can, but really aren’t able to be in clinic at this moment.

Yeah, they’ve been kicked out of the ER, for her. They’re working with the Health Department locally, and I think that kind of alternative work is really common right now for advanced medical students.

Yeah, exactly. So for the TREC test that I was mentioning that we’re hosting, we actually are enlisting medical students from a number of institutions - from the Oregon Health and Science University, and the University of Texas, and University of Washington, to help and provide annotations on some of these extractions.

So I think depending on where your daughter is, or where some of these medical students are, there’s probably gonna be other initiatives like this one, that really need their help. So I would definitely encourage anyone to look out for that.

I have one more follow-up to what you’ve just mentioned. Do you think, recognizing that and recognizing that we’re gonna be past this moment at some point, this very unique moment in our history - but just as the widespread introduction of open source software really changed the industry itself from being highly proprietary to being… You know, open source became not only a part of business models, but even an underlying part of a lot of commercial software that’s out there, and it fundamentally changed how that works… Do you think this is a moment, just because Covid-19 passes us and we get past this, that maybe there are other challenges? …whether they be things that we’ve been dealing with a long time, like cancers, or new things that may come - that this may fundamentally change how we attack really hard medical challenges with AI, and that integration with the communities that has happened out of necessity?

I’m certainly hoping that to be the case. We’ve definitely seen what people can do when they come together for a month or two… And it’s incredible. There’s so many people being engaged, and building interesting tools, and useful tools.

[38:46] I think there’s a couple of things that I’d love for us to be able to extend into the future. One thing is definitely publishers coming together to release more open access content on really important topics such as Covid-19… And then the community coming together, especially crossing boundaries between the computing community, the medical community and policymakers to really build something useful.

A little bit of a follow-up to that question - I know there’ve been things as I’ve worked on related work with SIL, where I was thinking “Oh, if I would have done this prior to this crisis, I would be able to do something better than what I’m able to do now.” In hindsight it’s easy to see those opportunities… I’m curious on your side, with your own research and work - I mean, I’m assuming you were working on various things related to Semantic Scholar prior to this crisis, and now your head’s down working on CORD-19, and getting this in shape… What are you interested in exploring in the future? Not necessarily CORD-19 related, but how has this whole process shifted what you want to work on in your own research in the future.

I think this springs up a really good point, which is we really became involved in the creation of this dataset because we had at Semantic Scholar built a bunch of infrastructure for scientific papers. A collaborator of mine, Kyle Lo and I had also been working for the past nearly a year on a way of essentially creating a full-text extraction pipeline for some of these papers that we are using in CORD-19 today.

So a lot of this was infrastructural work. It’s not particularly glamorous, but it is really important, and it really became more important in light of what happened in the last few months.

I guess one thing is infrastructural improvements can be really important, even if it’s not particularly sexy. And then going forward, there’s certainly things I care about besides creating datasets of papers… My research focuses on making scientific literature and making this content more available to biomedical researchers, and more understandable.

As you mentioned before, there’s so many entities, so many very domain-specific words and relationships that exist in the biomedical literature. And even for someone who is a domain expert, some of those terms can be very hard to parse through and understand.

So a lot of my ongoing projects are trying to create systems that understand particular types of relationships… For example, those that understand drug-drug interactions, or can mine them out of the literature, those that can understand medical images better… These are the types of projects that I’m hoping to continue to work on in the future.

Awesome. We’re for sure going to have the links to the main dataset website in the show notes, along with the Kaggle challenge and the various other projects and groups that you’ve talked about. I really appreciate you coming on the show and describing a bit more about the dataset, how it came about and your own work with it.

I’m really encouraged by the work that the Semantic Scholar team and collaborators are doing, and thank you for your hard work on this, and taking time to talk to us.

Thank you so much for having me, and we really hope that other folks are encouraged to contribute and become involved in this project.

Yes, please do.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00