Daniel and Chris explore Semantic Scholar with Doug Raymond of the Allen Institute for Artificial Intelligence. Semantic Scholar is an AI-backed search engine that uses machine learning, natural language processing, and machine vision to surface relevant information from scientific papers.
Welcome to another episode of Practical AI. I'm Daniel Whitenack, a data scientist with SIL International. I'm joined, as always, by my co-host, Chris Benson, who is a principal AI strategist at Lockheed Martin. How are you doing, Chris?
Hey, doing great, Daniel. How's it going today?
It's going pretty good; a busy week and lots to work on, which is good and tiring all at the same time. But mostly good. What about on your end?
Nothing much, just the usual work. We're finally back to some nice weather here in Atlanta, so I'm enjoying that.
Cool, cool. I know that in our last Fully Connected episode one of the things that we talked about was the increase in AI-related publications and also increases in publications on the archive, and I think that's true in general. I alluded to the fact that we would be talking about that more in a future episode, and that's this episode. So I'm really happy to introduce Doug Raymond, who is joining us from the Allen Institute for AI, where he is the General Manager of Semantic Scholar. Welcome, Doug.
Thank you. I'm really glad to be here.
Yeah, great to have you. Before we jump into Semantic Scholar and all about scientific publications and searching them and all of that stuff, I'd be interested to just hear a little bit about your background and how you ended up at the Allen Institute, and working with AI and Semantic Scholar.
Yeah. Thanks, Daniel. My background is mostly in the product and business of artificial intelligence and machine learning. Before I was at the Allen Institute, I was at Amazon working on the Alexa machine learning platform. Prior to that, I've done a series of startups in the machine learning space and advertising, and commodity trading, and then had a five-year stint at Google, working on the AdWords platform.
Yeah, that's quite an experience with AI. What stands out to you over the years in terms of how AI or applying AI in the product sense or in an applied way has changed during that time?
Well, from my perspective, what's changed is that we've become, as consumers, more conscious about how these models are influencing our lives and replacing various aspects of human cognition.
[00:03:57.12] When I started out in the advertising space at Google - this is now close to 15 years ago - we didn't think of it as AI, we thought of it as an efficient way to match supply and demand. But as the models have gotten more sophisticated and more capable, we as business and product people are thinking more carefully about how we can actually help users with some problem. That's where the AI part of the technology becomes super relevant, because if you're not solving a problem that a user actually has, it's not really artificial intelligence; it could just be an interesting feature.
On the other side, I think the concerns about how these products and these AI models are using our data and potentially influencing us in unforeseen ways has become a much bigger part of what we think about and the considerations for what we build in the product.
So are some of those considerations as related to how we use AI and think about it - was that what motivated you to join with an organization like the Allen Institute? Maybe for those that aren't familiar, the Allen Institute is a different organization than a tech company like Google. So could you explain that and how you got involved with them?
Absolutely. So, I've been at the Allen Institute for almost two years now, and what motivated me to join was the mission. We were founded by Microsoft co-founder, Paul Allen, about five years ago, with the mission to build AI for the common good. So it's a core part of our mission to identify areas where AI can help the public in general. I've found that to be really compelling.
I definitely enjoyed startup life and working at Amazon and Google, and those were great experiences, but my real motivation, I think, to continue my career is to do something that has a positive impact, especially with so much political discord and challenge in the world, especially with respect to the environment and other areas where citizens really need to have the accurate and relevant access to information to make good decisions. So that is a part of our mission, and it's something that seemed like a much more impactful use of my time than continuing to work on commercial products.
Got you. Just out of curiosity, do you have any insight into why Paul Allen wanted to make this investment into AI, especially at the time that he did? Any insight into that?
Yes. Paul Allen was a visionary man. He had a variety of passions and interests. When AI2 was founded– obviously, I joined later, but the story of AI2's founding is related to Paul's interest in how AI could solve really fundamental problems in terms of how people access information. So one of our first projects at AI2 is a project called Aristo, which is a project designed to create an AI model that could answer scientific questions in a conversational format.
We recently reached a milestone where the Aristo project is able to ace the eighth grade, New York region science test. So the vision that Paul set years before has resulted in an AI that can actually help answer scientific questions, and that project continues and is taking on new challenges.
With respect to Semantic Scholar, Paul's vision, and as expressed by our CEO Oren Etzioni, was there's so much scientific literature out there, and it's so difficult to access and understand what's relevant, that the cure for cancer might be latent in the scientific literature. But with AI tools, potentially, we could make the connections and allow scholars to discover those connections and lead to breakthroughs.
[00:08:13.11] So, I'm curious now that you've got into Semantic Scholar… Given the Allen Institute's mission and how it's structured in general, why would it be important or why should Allen Institute maybe be the one that provides this assistance in parsing through the scientific literature, versus maybe a for-profit organization or something like that?
Absolutely. So I think that Semantic Scholar exists in a really unique place. Of course, there are many tools out there designed to help access the scientific literature. But when you think about broad coverage tools, in the sense that they cover all scientific domains and try to solve this discovery problem in a generalized way, there's Google's Scholar, there are tools like ResearchGate, and they don't have a really robust business model. So Google Scholar does continue to release features, but the pace of innovation has been quite slow, and our users always tell us that they want a better discovery experience than what they can find in Google Scholar.
Other smaller startups, like ResearchGate, have the imperative of the business mode, so they tend to be focused on social networking aspects or other ways to generate ad revenue, and not really on solving this fundamental discovery problem that all scientific disciplines face, which is there's just an information overload in terms of the number of scientific publications that are published each year.
And then I guess, just to add to that, on the other dimension, there are a lot of special-purpose tools, which try to solve a problem in a particular scientific domain, and they tend to be point solutions, but aren't well integrated with other domains or the rest of the research lifecycle. So we think we're in a relatively unique place where our mission is to have the greatest impact possible on science, and that with Paul's backing, we're able to pursue that in a generalized way, which makes me think that we have a great opportunity to have a huge impact on the progress of science overall.
That sounds cool. So what would you say were the main problems that you were targeting to solve with Semantic Scholar when you started out? How were you trying to make it different from what was already out there and what people were using? How did you choose the type of interface that you wanted to realize that in?
Yeah, absolutely. So, let me start by talking about the problem that's core to our mission. We define the problem as information overload in science. The characteristics of that problem are that as the number of scientists around the world has grown and the number of research institutions and publications have grown, the number of potentially relevant scientific papers for each individual scholar to read has grown at an exponential pace since World War II.
So we're now at 3.5 million new publications each year, it grows about 5% or 6%… Yeah, it grows at 5% or 6% a year, and the number of new journals also grows, so the proliferation of new publications… And if you're a scholar in a particular domain, your ability to read papers is somewhat static, at least in the short-term. Our research indicates that the average scholar reads approximately 250 papers a year; the time they spend per paper is about 30 to 45 minutes, and that comes out to up to 15 hours a week just trying to understand what's new or relevant in their domain…
[00:11:58.23] So they don't really have a good way to read more papers without the help of tools like Semantic Scholar. So the way that we think of our solution to information overload is if the scholar's attention is fixed and we want to– or at least, the amount of time they have is fixed, we want to make it possible for them to overcome information overload by discovering the relevant papers much more easily and with much higher quality in terms of what they decide to read. And then we want to make it easier for them to understand what's interesting and salient to their research in each paper they read.
While you were talking, I was just contemplating some of the numbers that you mentioned. At least if I did my calculation right– so 45 minutes per paper, 3.5 million per year, that would take me about 300 years to just read all the papers for a single year. So obviously, no one's going to do that. You mentioned that scholars read about– what was it, about 250 or something per year?
So it's called Semantic Scholar, so is the idea really around a semantic or a text-centric, natural text-centric way to search through the literature?
Yeah. Partially, yes. If it's okay, I'll take a minute to explain why we're called Semantic Scholar. So you hit the nail on the head in terms of describing the challenge. 300 years' worth of reading every year is obviously untenable. So Semantic Scholar - we think of semantics as the science of how do we understand, extract the meaning from the scientific literature. When I talked earlier about the evolution of AI throughout my career, Semantic Scholar is an AI application because we're trying to use our AI models and technology to survey and read the papers in advance for you, so that as a scholar, instead of spending 300 hours reading a bunch of papers, most of which aren't relevant, you can focus on only the papers which are most relevant to your interest at that moment. That's our vision in terms of how AI can solve this problem of information overload.
I'm just curious, how do you match up the user who's using Semantic Scholar with that process? How do you know what is the right research and how to present it to them?
Absolutely. So we think of our product as having three core attributes that help the user find the relevant science. At a high level - and I can go into more detail in terms of how we use AI in each of these areas - one thing that we've done is create a very rich knowledge graph that represents all of scientific literature. So through mapping all the papers and citations and indexing full-text PDF of the scientific literature, we created a very rich representation of science; at this point, over 180 million scientific papers.
The second aspect of that, which is I think more related to your question around how does a scholar use us to find the relevant literature, is our discovery experience. Semantic Scholar is the initial experience; it's pretty much like a traditional search engine. However, because we've extracted the semantics from all the underlying literature, and it's in a structured knowledge graph format, it's much easier for the scholar to define their interest in terms of this area of science from these journals, in this date, range, and have a comprehensive representation not only of what papers meet that interest, but all the other extracted information that we build with our models, such as the influence of the paper, how that paper has been discussed in social media, the associated datasets and GitHub repositories that are used in that research…
So we try to create a very rich representation of not only what's in that scholar's scope of interest, but within each paper what are the points that would allow them to understand, is this paper relevant, what is new and interesting about my area of interest that's expressed in this paper?
So, Doug, it was really great to hear about the underlying structure of Semantic Scholar and how you start to access its features, but then get some of the interesting benefits of it. I was wondering as you were talking through that – so there's this sort of knowledge graph that you mention, and then there's the search and discovery… I don't know if it's right to call it a recommendation or notification type of stuff. I was wondering – at least I'm aware that there's some work going on pretty widely around using AI to generate or automatically build knowledge bases or knowledge graphs, so I'm wondering if that's one place where you're utilizing AI to extract this and automatically build the knowledge graph, but then it sounds like maybe there are other opportunities for AI usage on the user interactivity side… I was wondering how much effort you are placing in those two areas and where you think the main benefits - or at least in this application - are for AI, at least that you've leveraged so far?
Sure. So it's true that we do use models to build our knowledge graph, and there are some efforts going on there to increase the quality and coverage of it. In terms of what are the areas that we think are most exciting to us from a research standpoint, we're focused on this discovery experience, in terms of how do we help you identify what's new and relevant. We have several research efforts in terms of creating a personalized representation of what's new, of creating explanations or recommendations that are actionable in terms of how we explain to you why we've recommended particular papers. We have a number of other research areas there that I'd love to talk about, that I think are quite exciting.
Yeah, definitely. I'd love to hear more about those. I know specifically – because I've used Semantic Scholar, and maybe you can describe this a little bit more in how it fits into some of those discovery things… But as you're searching things, you can, I believe, tag certain content and create tagged collections of content that you're looking for and organizing.
Does that fit into this discovery model?
It does. We have a library where we enable our users to organize and tag their research. I think that's part of how we help them use it in their work and have a greater impact in their work with the research they found through Semantic Scholar. There's also a lot of work we've done to help you identify out of the thousands of new papers published each day, which of those are relevant to me, which of those are worth reading. That's another area of research for us.
[00:20:04.18] So does that fit more within a traditional recommender system thing? Or in what ways are those being generated? One of the things that are going through my mind too is, it seems there are these giant benefits to this approach, but then also, if you're amplifying certain signals within the scientific community, you have to be pretty right on with those, because you could – like you said, the cure for cancer could still be sitting somewhere below these amplified signals.
Yes. So maybe I could talk about how do we recommend papers. If I understand your point, Daniel, there's a phenomenon in science which I would describe as "the rich get richer", which is if you have an institutional backing, if you're publishing your papers in prestigious journals, and you're generating a lot of citations, that citation counts can be used as a proxy for quality. Those scholars who don't have the institutional backing, don't have as many citations, their research will get overlooked. I think we are very conscious of that phenomenon, and it is one of the challenges we hope to overcome with our approach to discovery and recommending papers. So I could go a little bit more into that…
Yeah, that'd be fantastic if you would.
Yeah, absolutely. So I think in the Semantic Scholar experience, we do use citation count as an interesting piece of metadata. But from our perspective, one of the challenges with the growth of the scientific literature is that there are great scholars out there and great science being done in places where they aren't in a prestigious conference or a prestigious institution or published in a prestigious journal, and so therefore they may be overlooked if that's the only thing you're looking at. So for us, we really think a lot about how can you discover the science that's relevant before it has a rich citation history or science that's relevant without a citation history.
One of the big efforts that we've made is trying to understand the relevance of papers at a very fundamental level. That starts with us, at the language model level. I think perhaps in a previous podcast you talked about some of our work on different language models. At AI2, we developed a language model called ELMo, which was subsequently developed further into a model called BERT, through Google. We have created a pre-trained language model called SciBERT, which is trained on three billion words from a host of scientific documents, so it is particularly good for trying to understand what a paper is about in a way that a model trained on Wikipedia or other texts would not be quite as good at. We've used that scientific language model, which we call SciBERT, to build a host of discovery experiences that helps scholars find relevant papers, even if the citation count doesn't necessarily indicate that that paper is highly regarded.
One of the things we've done with that is create a personalized feed of papers and sort of a Spotify for research, if you will, which is available in Semantic Scholar now. The idea here is that if we use this language model, and create a neural network to understand the similarity of papers, we then allow the user to indicate what they like and what they don't like. Through that process, in a few clicks, they can create a highly relevant feed of research papers that are tuned exactly to their personal interest in a way that you could never do it with a search engine or just looking at citation counts.
How do you make that available to the user? How are they able to actually specify what their interests are?
Certainly. So the way the product works is you go to Semantic Scholar, you select papers that you believe are relevant to you. You can use the traditional search interface to do that or just type in the title of the paper you already have high regard for. Then we will automatically generate recommendations of related works.
[00:24:21.01] By indicating 'I like this paper; I don't like this paper,' you can tune that feed in real-time to be highly relevant to your interest, so you're only seeing papers that are directly relevant to the interest you're pursuing.
I'm curious, in this process, how long approximately, or how many papers do I have to go through and tag before I start seeing some of this benefit? Also on that front, is this amplified between users? So I guess there's personalization at the user level, but there's also, in science, there are communities of people that are working together and collaborations… Does that fit into the recommendation at all? Or is it mostly at the language model level?
Yeah, at this point it's really at the language model and what that individual user has indicated. I think what you described is very interesting in terms of how do we build recommenders that service to a community. But what we've done so far is create a model that looks at the paper's similarity-based on SciBERT, a model trained on scientific text, and then tune it so that scholars can get papers that are relevant to their interest.
In terms of your initial question, how many papers does it take, what we find is that most of our users are able to get a highly relevant feed by rating between three and five papers. Depending on how generalized or specific your interest is, it could take more. So if you have a very specific, narrow interest, you might need to rate more papers before your feed becomes highly, highly relevant. But in most cases, it takes a minute or two to identify three to five papers that match your interest.
So I'm also curious – because when I'm thinking about this similarity matching with the language model, I'm thinking about it like, "Oh, you provide input data to this language model, you get out some representation, maybe you compare distances" or something like that, but it's not really related, in my mind, to the sort of graph-structured data that you mentioned before. Are those both utilized in– is the graph stuff mostly utilized for just search? Like you type in a query and then you get entities out of that and match those entities to entities in the graph, versus the language model is utilized mostly for recommendation? Or is there any interplay between those?
There is. To be clear, the graph structure is our core data structure. So when you search Semantic Scholar, you're essentially trying to identify a vector within this knowledge graph that's within your scope of interest. For the recommendation experience, the adaptive recommendations I was describing, we do use the graph information, but what we do is use a citation graph of these different papers as a feature in that similarity model. So understanding what papers have cited each other helps determine how close they are in that similarity space based on what the user has indicated is of interest to them.
Gotcha. So am I correct in saying the language model would give you a learned representation of a paper, and then you're matching that in terms of distance in some space and using a feature from the graph like the citations you're mentioning to further refine that? Or is it different than that?
[00:27:53.04] I think that's pretty close. Yes, in the sense that the language model is just how we understand what the paper is about. In terms of understanding how similar one paper is to another, the language model, the SciBERT is one aspect of understanding, "Okay, here's the vector that represents the meaning of this paper." But similarity is also indicated by the citation graph, a neural model that we built to map those papers in some vector space to understand how similar they are. Then the aspect that makes this a personalized experience is the user being able to indicate what papers are of interest and are not of interest. That becomes an input to the model to define what papers should be presented.
So I'm curious, as we are talking through the ins and outs of discoverability in Semantic Scholar and also how things are working under the hood, I got to thinking, given that you've processed so many scientific papers, are there any efforts within Semantic Scholar to analyze in a more exploratory way the scientific community as a whole? I'm guessing from this knowledge graph, you're able to maybe extract collaborations and other things that might be under-described in a general search interface if you're just searching on a journal or something like that, where you're representing more about a paper than is known… So have you explored anything like that in terms of exploring what collaborations exist and how to represent those? I was thinking of duplicate or highly related work in terms of reviewing new work, and that sort of thing - are there any efforts like that?
Yes, that is an interest of ours. A few efforts come to mind. So one thing that we've done recently is use our summary of authors as a way to help conference organizers disambiguate reviewers. This is a problem in academic conferences with where, if many papers are being submitted, you need to select the people who will review those papers to decide whether they will be accepted at the conference in an optimal way.
You can imagine how that problem becomes exponentially more difficult as the number of papers submitted increases exponentially year over year. The submissions to conferences in computer science have grown multiple-fold just in the last five or ten years in fast-growing areas of AI. So one partnership we did in the fall was with ACL, which is a large conference that's going to be based in Seattle this year. They use the Semantic Scholar knowledge graph to help disambiguate reviewers from the papers that are being submitted for review.
Because you don't want someone that you've co-authored with or someone who's potentially on your same faculty to be reviewing your paper for conference, because it creates conflicts of interest. That's an example of a recent application. I could talk about the other uses of Semantic Scholar in terms of understanding science overall…
Oh, please do. That would be good. I was actually gonna ask something very similar to that.
[00:32:14.04] Oh, absolutely. So because we've created this rich representation of science, it allows us to do what we would describe as meta-research, in terms of the trends in science and what potential opportunities or challenges are emerging. We've published articles about the growth of open access publishing, about the trends in gender equality and computer science publishing; we looked at biomedical research and identified areas of bias in clinical studies… And in each of these cases, because we've created this rich and structured representation of science, we're able to do research on orders of magnitude larger datasets than any previous research.
So do you have any– I'm just curious. You just started to go there, but my mind's wondering on different possible use cases of that… Because really in your structured graph you're capturing the shape of science, if you will, based on the papers and the citations and where it's flowing and where it's not flowing a little bit… Do you have some ideas when you're talking about that meta-analysis? Any thoughts on different use cases where you guys have thought that that would be particularly useful?
Yes. So we've already published a number of studies where we've identified areas of bias or potential insights of social impact, for instance gender equality in computer science… I think in the future we'd like to do more of these studies and focus them on areas where we can identify opportunities to increase the impact of science overall.
A big area of interest for us is climate change research, in terms of what's being funded, where are the areas that are either overserved or underserved, and how can we surface that information in a way that helps scholars, but also potentially policymakers or politicians invest in the areas that can have the greatest impact?
I'm curious as well, as you've done this work, are there – and you've obviously processed a lot, so there's a lot there already, but I was wondering if there are certain areas of science or areas of research that are harder to probe with this approach than others. So I'm thinking– I work with a bunch of linguists at my organization, and I've found that there's all of these archived systems that are really hard to access and search and all of those things, but that's where a lot of the linguistic research is, and it's all documented in really odd and conflicting ways in terms of what languages it applies to, and all these things… So I was wondering if there are systems like that or areas of science that have proved harder to integrate with this approach? What sort of ways you're approaching the diversity of how science is represented for different areas?
Yeah… I would say that there are definitely opportunities to increase our coverage in certain areas of science. At the highest level, we are optimistic that our generalized approach seems to work pretty well across all domains of science that we cover.
[00:35:48.12] There are definitely issues that you alluded to in terms of older publications, where we may not be able to get access to a PDF, or the data that allows us to figure out how to integrate it in a knowledge graph is hard to come by, but I wouldn't say that there's an obvious major problem to overcome. There are a lot of smaller problems to overcome, which we address in our planning based on how much impact we think we can have for scholars.
Yeah, and I guess that there are various considerations in terms of how actively or active and rapidly developing areas of science are, and how they may be applied to certain things that the Allen Institute is also interested in, like climate change and that sort of thing… I guess you have to start somewhere and put your effort somewhere, but I think probably just assuming that you can get a PDF, that covers a large majority of cases, is that right?
Yes. A lot of our effort is focused on partnerships with the major academic publishers on integration with some of the major preprint servers like Archive, and open access journals. So if we get a high-quality PDF, in most cases, we're able to fully index that content and make it discoverable to our users.
There are other challenges in terms of quality of extraction and how our models work to fully extract the content from different fields of science, but they tend to be fairly minor to the challenge of just getting access to the science and making it possible for scholars to discover it.
Awesome. Well, I was curious, in addition to the Semantic Scholar product and the discovery tool, I think, if I'm not mistaken, some of what has happened within Semantic Scholar has been open-sourced, in terms of things that people can use… So are things like pre-trained models, and maybe tools– I think there's a PDF parsing tool, is that right? What's come out of Semantic Scholar in terms of open-source things that maybe others can build on as they're thinking more about science, and PDF parsing, and those things?
Absolutely. I think that's a great illustration of what makes AI2 and the Allen Institute for AI and Semantic Scholar special, is that our mission is to have a positive impact on society at large. So most of the things that we build, that we think are valuable and unique, are available as open-source projects. So our knowledge graph, the aspects of it that we can release that are not restricted by the various agreements we have with publishers, is something we release as a public resource for the research community. This language model SciBERT that I mentioned earlier is available on GitHub… So we try to make it possible for others to build on our work.
[00:38:53.15] In addition to that, we have a public API where other projects can access Semantics Scholar features and our knowledge graph to further science in their own way.
So I guess as we wind up, and to bring it back to a very practical side of things, I'm kind of curious, as users - maybe they've heard the episode here and they decide to try it out and get into some of the features that we've been talking about, what should they be expecting on your roadmap ahead? In the relatively near term, over the next few years, where do you expect to grow the product, so that they can take advantage of some of the things we've talked about, as you've talked about these big problems that you're trying to tackle and expanding that?
Absolutely. Our vision is to be a solution for information overload, so we'd like scholars to come back to us whenever they're trying to understand what the scientific literature says about some issue that is of interest to them. So a lot of our work in the upcoming year and beyond is around making that discovery experience higher quality, adding new models and new AI-driven features that allow you to understand the highlights of a paper, to understand the intent and to get a summary of it in a very succinct and high-quality way… By doing that, we hope that we can make every Semantic Scholar user a higher impact on their work.
They'll be able to spend– if they still want to spend 15 hours a week, they can do that, but it'll be a much higher quality of reading. If they want to spend less time, so they can focus on other aspects of their work, we hope to enable that, too. So our research is really designed to make information overload a problem of the past and allow scholars to focus on what they can do best, which is delving into new unknown areas of science and creating breakthroughs.
Awesome. Well, I'm excited about that future. Definitely, I'm excited about Semantic Scholar, and a lot of the things that the Allen Institute for AI is doing. I know we had Joel Grus on a previous episode, talking about AllenNLP, which I've used personally… So thank you so much for working on Semantic Scholar, but also pass along our thanks to the Allen Institute for all the great work that they're doing and the contributions to the community.
I think it's a really great thing to see so many efforts that have made contributions, practical contributions that people can use. So yeah, thank you so much. Thank you for taking time to join us.
Excellent. Thank you, Daniel. Thank you, Chris. It was a pleasure.
Our transcripts are open source on GitHub. Improvements are welcome. 💚