Practical AI – Episode #27

IBM's AI for detecting neurological state

with Ajay Royyuru and Guillermo Cecchi from IBM Healthcare

All Episodes

Ajay Royyuru and Guillermo Cecchi from IBM Healthcare join Chris and Daniel to discuss the emerging field of computational psychiatry. They talk about how researchers at IBM are applying AI to measure mental and neurological health based on speech, and they give us their perspectives on things like bias in healthcare data, AI augmentation for doctors, and encodings of language structure.



Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at

Rollbar – We catch our errors before our users do because of Rollbar. Resolve errors in minutes, and deploy your code with confidence. Learn more at

Linode – Our cloud server of choice. Deploy a fast, efficient, native SSD cloud server for only $5/month. Get 4 months free using the code changelog2018. Start your server - head to

Algolia – Our search partner. Algolia’s full suite search APIs enable teams to develop unique search and discovery experiences across all platforms and devices. We’re using Algolia to power our site search here at Get started for free and learn more at

Notes & Links

đź“ť Edit Notes


đź“ť Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Well, hey Chris. How are you doing?

I’m doing fine. How’s it going today, Daniel?

It’s going really well. I’m still in the midst of grading for my Purdue class, but I see the finish line. How about with you?

Just started the new job at Lockheed Martin a couple of weeks ago and have been heads down in that, and obviously the holiday season is coming up with the family, so… A great time of the year.

Yeah, definitely. Today we actually have two guests from IBM healthcare. I’m really excited that Ajay Royyuru and Guillermo Cecchi are joining us. Welcome, guys!

Hey, hi!

Hi, Chris. Hello, Daniel.

As I mentioned, they’re both with IBM healthcare. Ajay is a VP of IBM healthcare and life sciences research. Guillermo is a principal researcher of computational psychiatry and neuroimaging. I’m really excited to hear about what they have to tell us here on Practical AI today and how AI is related to healthcare and psychiatry and mental health. It’s gonna be a really exciting show, but before we jump into those things, I’d love to give our guests a chance to introduce themselves and give us a little bit of background about how they eventually got to this place of integrating AI and healthcare and psychiatry. Ajay, do you wanna start us out?

Sure. Thanks for the opportunity to chat. This is Ajay. I am leading our healthcare and life science research portfolio at IBM. I’ve just completed 20 years working at IBM.


Thank you. My background is in molecular structural biology. Prior to coming to IBM I was a post-doctoral scientist at Memorial Sloan Kettering Cancer Center, but that was a while ago. Moving to IBM, a lot of my research interest has become entirely computational, so the work that I do now is actually at the intersection of healthcare biology and all things information technology.

It’s really interesting how you’ve kind of gone through that path and eventually landed at all of these integrations of computation and IT. I’m excited to hear more. Guillermo, do you wanna give us a brief intro? How did you get eventually into this world of computational psychiatry?

My background is in physics and neuroscience, but I was always interested in philosophy, and then after completing my Ph.D. I did a fellowship in psychiatry before coming to IBM… Naturally, mental health became very clearly for me an intersection between all of my interests, so this is what I’m doing now - just trying to understand how we put together mental health with AI.

How did you really decide that mental health was a good target to start using AI technologies on?

Well, one clear reason is that mental health needs it, right? If you look at the daily practice of mental health, it’s very constrained by the fact that you have neurologists, psychiatrists, healthcare providers that need to make judgments about the mental state of a patient, or a prospective, possible patient… And the way it’s done today relies to a large extent, among other things, on the interaction between the patient and the clinician/person who evaluates them. That interaction is, to a very large extent, determined by language patterns - how the patient is speaking to the clinician.

Outside of mental health we have an incredible wealth of tools to study language, that at the moment unfortunately are not being used for the purpose of helping clinicians do the evaluation, and actually in the end helping patients to have better healthcare. There’s a really dire need of help from mental health practitioners, and that’s perhaps the main motivation.

Yeah, it’s interesting that you’ve brought up the idea of analyzing language, because actually when this topic was first brought up to me, I guess it wasn’t the first thing that came to my mind. I was thinking, “Oh, we’re studying mental health computationally - maybe we’re studying brain waves, or something like that”, but from what you’ve said it’s the motivation to combine, like you said, these NLP techniques and AI with language as related to mental health; is that really spawning from the patterns that you’ve seen in clinics? They’re using language as a primary means to measure and identify mental health issues - is that the primary motivation, or was it because maybe you also are able to get data more easily than some other ways, or something?

Well, yes, of course, it’s in principle easier to get speech data and language data in general, because we don’t need any special machines to do that, but fundamentally - you were talking about brain waves… Well, a speech is a brain wave, and it’s very important, because it is important for our behavior. This is one of the most essential tools that we humans use to interact with the world and with each other, and it’s a very clear way in which most psychiatric conditions, but even neurological conditions, are expressed. Disrupted patterns of behavior go hand in hand with disrupted patterns of language.

In some cases it’s obvious, like in psychosis. It’s directly mapped to language. But we see that even in conditions such as Parkinson’s there is a clear trace of the disease in the language patterns that are produced by the patients, and in other cases even the language patterns that can be or cannot be processed by the patients. So it’s more than just availability of the data, it’s just really at the core of what defines a mental dysfunction.

Ajay, could you tell us how you’re tying together this process, these techniques of using NLP for speech into kind of a practical – I mean, what is your goal here, what are you actually trying to produce in terms of usability?

Yeah, so we should really talk about how this becomes very practical, but just to examine the context first - the clinical encounter that used to occur entirely in the clinic, where the individual, let’s say a patient, is actually coming with a scheduled appointment, is meeting with an expert practitioner, and they’re having a dialogue or a clinical exam, and a clinical evidence is gathered in the course of that discussion, maybe through a physical exam, or a psychiatric evaluation, as Guillermo was explaining… So that’s a typical clinical encounter.

It used to be that that was the only way in which you the practitioner would know something about the patient. But what has occurred in the last decade or so is with the availability of many different forms of technology, including audio recording of speech, we are actually able to take the evidence gathering from the clinic to something that is of similar quality, but outside the clinic. It allows the observations to move from an episodic encounter in the clinic, to possibly a more continuous measurement that is occurring in addition, outside the clinic as well, in the life of the person.

So this is not necessarily a clinician that is using this mobile app that you’re talking about. So this is used outside the clinic by non-medical personnel, non-medical people, between clinic visits? Is that accurate?

It is being used by a clinician, let’s say, to do a research study involving human subjects, but instead of just observing and recording while in the clinic, a clinician is actually able to use technologies like a speech recording device on a phone to actually observe outside the clinic as well.

So it’s still under the direction of the clinician, essentially, but is it fair to say that the person who is being measured is also using it outside the clinics environment between sessions?

Right. The sessions could be anytime during the day, could be initiated by the subject, and a conversation is happening- it could be a monologue or a dialogue that is getting recorded, and then being analyzed by the techniques that Guillermo will describe.

It extends the observation window from the 20-30 minutes in the clinic to the entire day, from in the clinic premises to wherever the subject is. And as we all know, when you have a mental health condition, sometimes even showing up for an appointment in the clinic is not something that will achieve 100% compliance. So extending the observation - physical location, as well as time window, allows better participation… And of course, the mental health status of the individual is not constant, right? So let’s say you have the opportunity to initiate a conversation with the mobile app, and record it - you would do it in instances where you want that to be captured, and when it is not in the clinic, doing it in this manner actually allows the subject to actually provide more information about the condition, that may or may not always be reproduced in the clinic.

Yeah, I think you’ve brought up a few really good points here. I know that in previous shows and in my conversations outside of the show, when I’m talking about AI and healthcare, a number of things come up, the first being like “Well, we don’t want people just using a smartphone app to diagnose themselves and not going to a doctor.” We don’t wanna get rid of doctors, or automate them away… But there’s also privacy concerns. So it sounds like in your case you’re not just having a recording of all the conversations, all pointed to improve diagnosis, but they are kind of like clinical sessions, but you’re recording them at the participant’s indication between clinical visits, but then also it’s being reviewed by a doctor, right? Do you view this as kind of like an augmentation to the doctor’s current workflow, or something that couldn’t turn into a completely different workflow for helping diagnose and treat and measure mental health.

Right, it is actually deployed really as an augmentation to how the clinician observes and makes decisions for the patient or subject. It is with informed consent, and it is with the ability of the participant to turn the observation on or off. So it is not always on, and the participant is actually deciding when they want to actually allow the observation to take place.

After the observation is done - let’s say that the conversation you and I are having, if I had subjected this conversation to that consent, after I am done speaking, me as a subject I’ll get to review what it is that has actually been observed from this. Then I choose whether the clinician is now being provided this input or not. Every session has that rigor of consent.

That’s fascinating to me. I’m trying to imagine if I had this app on my device, going around through daily life - I’m curious how do people choose to turn it on and off, in terms of… You know, if you’re looking at lots of different use cases, do people tend to have it on most of the time, knowing that that’s recording? Does it make them nervous, does it change their behavior? I’m trying to imagine if I was that patient how I would react to having this tool.

We have done some analysis with retrospective data. That means sessions that have previously been recorded already in a clinician’s office, for example, and built the analysis methodology based on such retrospective data. Then we moved into the very carefully constructed prospective studies that you’re asking about.

In the prospective studies not only is the individual first informed what it is that every session will be about, and how they have to participate, but for each session they are actually taking some steps. For example, in one study the technology is actually deployed as an app on the phone, and they are actually starting the app, the app will prompt them with certain questions, and Guillermo can walk you through actually what the example questions are and what an example session is like… So it’s initiated by the individual; they go through it, maybe a few minutes, five minutes, ten minutes, and then they conclude the session and that’s the information that then gets used to analyze.

I’d love to turn to Guillermo actually on that same point… I was already thinking of a follow-up to this in terms of – on the technical side, the people that are actually implementing the models, and the interaction of the models with the app and all of that - it sounds like there’s a real importance between those technical people and the doctor’s expertise. You just mentioned developing this question and answer session. Could you speak more to that interaction and the importance of that, Guillermo?

Yeah, it’s a great point, and it’s something that we have developed very carefully in all the studies that we have published and we are conducting; we are working very close to clinicians - psychiatrists and neurologists - and that’s very important, both because we want to eventually what we develop be adopted by the field of mental health, but also because we are interacting in a very productive way…

We can think of this in two parallel avenues. One is the typical AI, big data signs approach. We try to create features of all colors and shapes, and throw them against the wall and see what sticks. But at the same time, of course the space of features is, for all practical purposes, infinite, so you always need knowledge. So at the same time, what we are doing is by interacting with clinicians and medical researchers, we are trying to open up their minds and trying to understand how the features and the symptoms that they have found to be most relevant can be turned into algorithms. I can give you a very concrete example of both.

In the first case, when we create features, we have results showing that we can discriminate [between] parkinsonian patients that are on the medication (Levodopa) or off the medication, using features that include frequency components of the voice that are not detectable by the human ear, but they are still there because the drug is psychoactive, so it affects your network system, and of course, trivially, it affects your voice.

On the other side, we study the causes, and one essential component of what defines a psychotic state of a person is what psychiatrists call flight of ideas. That is the notion that these patients maybe are talking about something, and very dramatically jump the topic to something completely unrelated. So what we did there was using NLP techniques create an algorithm that will detect those jumps using another technique called semantic embedding, that is very commonly used in NLP.

This is one way in which we interact between both worlds… Learn and formalize as much as possible decades or even centuries of knowledge in psychiatry, psychology and neurology, and at the same time to leverage all the power of AI and the signal processing of computer science in general. I hope that gives you an idea…

Yeah, definitely. Following up on what you were just saying, Guillermo, it sounds like a ton of different knowledge from psychiatry that you’re trying to infuse in these algorithms and these techniques, it sounds like there’s a bunch of different applicable NLP techniques - you were just talking about semantic embedding, and other things… I was wondering if you could just walk us through what the data is like that you’re actually gathering as far as both the features you’re using for inferences, and also the training… For example, if you’re getting audio, does that mean you’re kind of gathering the audio in this question and answer session, and then doing speech-to-text, or using a first model to get the text, and then the text is input features to other models to do the semantic embeddings, or other things? Could you give us a little bit of a sense of that data flow and the structure and type of data?

Absolutely, yes. We work, as Ajay was saying, with either clinical interviews, or speech samples that are gathered having clinical evaluation in mind. We have monologue speech samples, we have written text in some cases, and we also have dialogue in other cases. The context is that we either have semi-structured clinical interviews - those seem to be the most effective; by semi-structured I mean it’s not following a very precise structure flow of questions and follow-ups, but trying to nudge the patient into talking about something and expressing themselves.

In other cases we have a monologue with anchor subjects. In some cases it can be very short, and we typically target naturalistic samples. For instance, we ask the patients to talk about a typical day in their life, or how their week was, or where they would like for vacation, because the idea is that with those types of prompts - we can reuse them, as Ajay was saying, on a weekly or even daily basis, so we can monitor their state.

Then what we do with the data is - yeah, of course, in the case of speech we have audio files, and we process them as such. We extract voice features that are very well established in the field of voice processing, we extract features related to, for instance, the pause distribution between words, the phoneme structure, something that’s called the “vowel space” – it’s how you pronounce your vowels – that might be different for instance across different accents even in the same language.

Then on the lexical side we extract the expected low-level features, so we can parse sentences into their grammatical components, so we can understand how verbs and nouns and adjectives are used and where in the sentence; that has shown to be important in certain conditions.

We also extract – as I was saying, the idea of semantic embedding… That allows us to take a word or a sentence and have a notion of how similar that word is to other words. We can use target words that are of interest for the particular condition, and understand how the patient in their discourse is getting closer in meaning or farther in meaning for certain concepts that are relevant..

And then we also extract higher-level features… Those are more aligned with, as I was saying, concepts from psychiatry. Just to give you an example, we have algorithms that can measure how metaphorical the content of a phrase is, and that is relevant in psychosis because one of the symptoms of psychosis is in disruption of your appreciation of metaphors, both in terms of how you understand them and how you produce them.

That gives you an idea of the full spectrum of features that we analyze, we study, from the audio and from the text side of language.

Guillermo, that is quite a list of features that you’re extracting, going from the phoneme structure, vowel pronunciation, accents, a lot of the lexical stuff you’ve just covered… Are there certain patterns that you have found through the data that have been more relevant than others, that you’re noticing seem to be weighted heavier in your analysis through NLP? Are there things that are sticking out as particularly important, or has that been established?

Well, what I would say is that language, and even more speech production, is such a complex phenomenon. We know from computer science how difficult it is to deal with it, how difficult it is to use coherent language.

It comes natural for us humans to do it, but any disruption in the health of your brain will have immediately an effect in language. Like I said, even for conditions that traditionally have been considered motor disorders of Parkinson’s, we know and we’ve found (and we are not the only ones who have found) very clear effects in language, and even in content. Even if you have something that supposedly is a motor dysfunction, the content of what you are producing as you speak is affected.

We can talk about [unintelligible 00:24:29.00] a feature that seems to be popping up often. One is the one I mentioned, that we originally developed for psychosis - the idea of measuring flight of ideas as semantic coherence. That seems to be useful to analyze different conditions, and even situations in which, for instance, a patient may take a psychoactive drug like Ecstasy or methamphetamine.

But if I had to answer your question, I would say that every single aspect of language is affected… Of course, differently, but it’s affected because again, language is a very complex phenomenon that involves many, many different aspects of brain function. Any tiny disruption will have an effect.

Another interesting thing that Guillermo focused on early enough, and been very instructive for us, is to really emphasize the spontaneous production of speech… Basically, not go in the direction of some rote answer but rather have the individual create an answer; a preexisting context and answer doesn’t exist in that person’s mind yet, so that spontaneous production is actually eliciting some of these features that he’s describing, and it’s enhancing the visibility of those features quite well.

Guillermo, maybe you wanna describe actually the picture test, which really is a very nice spontaneous production…?

Yeah, that’s a very good example. We are studying actually a number of conditions using this approach, that was initially developed decades ago, to study cognitive decline. You can look it up, it’s called “the cookie theft task”, and there are variations of that. Essentially, you’re shown a picture - a hand drawing - of a typical 1940s-1950s Americana household situation; there is someone who seems to be a mother, doing the dishes, but she seems to be absent-minded. And there are two kids, a girl and a boy, and the boy is standing on his tool, trying to get a cookie from a jar. The task is just to describe that in your own words. It’s something that takes 2-3 minutes at most, it’s very natural, and variations of that can be used to be repeated very often, so you don’t get bored…

What happens is that when you analyze the content of that description of the task, what you say, what type of words you use, but also the structure - even the syntax of what you’re saying, how you’re constructing the sentences, and how flurried or how simplified your speech is, that contains a huge amount of information about your cognitive state. That has been used by manual writers, like I said, over decades, to have an estimate of your cognitive state… But now we can do that in a completely automated way, and we have shown that we can infer the clinical scales that are produced by the human evaluators with a very high accuracy, with the advantage that we can do this remotely, and like I said, we can do this at a very high frequency and without having to bring the patient to the hospital, or the clinician to the house of the patient. And it has value that goes even beyond the idea of measuring or estimating cognitive decline, because it can be applied to many other conditions… Because as I was saying, even something that on the surface looks so natural such as a picture, requires a huge amount of brain real estate, and any failure will leave an imprint in the way that you perform this task.

I think that leads into a question that’s been in the back of my mind through this whole conversation… You’ve mentioned that the way in which you gather data and kind of the spontaneity of it is really important, and that immediately kind of leads me to think about bias in data, both in terms of the way that you gather it, but you’ve also already mentioned accents, and language variety, and that sort of thing, and we’ve already seen disasters in healthcare scenarios where maybe you’re trying to diagnose skin lesions or something, and your data only has data from light skin people, or something… And I would guess that the same sorts of things exist in language, in the sense that both education level maybe, but also regional accents, second-language-speaking people (not speaking in their first language), all of those things kind of come into play.

When we start thinking about language, I know IBM has also done a lot of work around fairness and bias, so I was wondering if that has entered into this work yet, or is it something that you want to probe further in the future?

Yes, of course we take that into consideration, and we try to account for those – I don’t know whether to call them biases, but it’s the context of the person, right? The personal context, and even maybe the group context.

Now, we have several cases in which we can track the patient over time, and for those, we have the best way of accounting for variations because we have the history of the patient. In some of the studies that we have conducted we know that if we didn’t have the story, the context of the person, we could not get any results. Trivially speaking, for instance, if you don’t know that the person is male or female, the acoustic content would be confounded, right? So when possible, we try to precisely have studies that track the individual, and that accounts to a large extent for those biases, as you mentioned… But also, it’s really part of one of the goals that we are pursuing - the possibility of personalizing the evaluation, and eventually the treatment for a person… Just being able to track someone on a daily basis, that is taking a certain medication and following a certain treatment - it’s one of the ultimate goals that we want to do. And in those cases, we have ways to account for the biases; this is much easier to account for the individual biases.

Ajay, I’m curious - can you describe what the output looks like here? Are we really talking about – is there one diagnosis, or do you have multiple diagnoses as an output, and what do your models look like to support that output? Is it different models for each diagnosis, or one model to rule them all, as you might say? What is that output and how are you structuring your models to get to that output?

Sure, yeah. Actually, just to continue for a second on the issue of language and bias…

…the retrospective work that Guillermo has done… He already looked at several different languages and people speaking in those native languages - English versus Spanish versus Portuguese, and so on. I think that’s very important, to actually think of this science, as well as its eventual use as being close to what the person already experiences, and not actually take the person into some new territory where that distortion or bias is actually more pronounced. I think that’s a research goal that we have to maintain - make the technology work for the person, and not the other way around. That is a quest that we are continuing on, but the retrospective work already shows us that that actually is possible… So we are encouraged by the fact that we should be able to bring these technologies into different languages.

To your question of “How do results actually get reported and exactly what are we describing in that reports?”, first I would say that this is not really a diagnosis. There’s nothing clinically diagnosis-like that is being generated here… Rather what we are doing is surfacing features that the clinician is already trained to look for, and make sure that those features are actually visible to the clinician.

The diagnosis and possible help to the patient, whether it is in terms of diagnosing or in terms of treating, is being done by the licensed expert practitioner. So all that we are doing is using this tool we are making sure that the patient’s own experience is being captured sufficiently well. Features that are clinically relevant, like the ones that Guillermo was describing are actually being captured and surfaced, and it is on the basis of those features that a trained practitioner would actually then be prompted to do what they are trained to and licensed to do already.

So this is augmentation, it’s not attempting to do what the practitioner does already, which is diagnose a treatment. So the report actually has both graphical, as well as numerical and textual form of these features being surfaced, whether they are in a graph, or the disjoint thoughts that Guillermo was talking about can typically be presented in a graph form. So you either have disjoint graphs, or an extremely complex graph that is actually demonstrating the complexity of the word choices in the context that the person is talking about. And a trained psychiatrist is actually then able to look at that, and as they’re accustomed to, use those features to actually then be able to make better decisions.

The most easy way in which this might get used is for screening purposes. A psychiatrist in the future might actually be getting this kind of report, just to keep tabs on what is happening in the life of a person who needs to be watched, and you are using that to actually just watch and screen. And when it gets to a threshold of some concern, you’re actually indeed intervening; the practitioner at that point is intervening, and doing what they are trained to do, but now you actually have extended the observation to the life of the person, and you’re able to observe more thoroughly and act upon it before it becomes catastrophic. I think that’s the more likely usage here. Diagnosing and treating by itself is actually the hardest problem, and that we really need to have practitioners do.

Yeah, I think that that brings things together really well and gives us a lot of great context for the use. And as we wrap up here for the episode, I’d love to first off just thank you guys for working on some application of AI that really is making a positive difference for people; that’s something that Chris and I always want to promote as much as we can… But I’d love to just get you guys to share, as we kind of wrap up, what you’re excited about as far as either results that you have now, or maybe next steps that you’re going to, and then also for the listeners who are maybe more interested about this subject either on NLP side or the application side, where can they find out more about your work, or the techniques that you’re using? If you guys could give us a little bit of that perspective, that’d be great.

Well, I have to say that what is keeping me up at night with excitement is the work that we are developing around doing something similar to what we were describing, but in the context of therapy, and therapy sessions… With, again, the same idea of expanding and providing additional tools to the therapist to track the evolution of a patient that is undergoing some type of therapy, and being able to integrate information from different sources, that are relevant to the particular individual that is undergoing this therapy. I think this is one of the next frontiers, and it’s challenging, but at the same time very exciting.

What excites me the most about this is a lot of the mental health, as well as neurological conditions, that individual experience, has really been either not attended to, or been misdiagnosed and not the right kind of help provided, or provided only in worse and in acute situations, but not really more continuously.

What we are witnessing through this work, as well as all other things that are happening with the internet of things and how technology is intersecting with our daily lives, is a change that we are seeing and experiencing where the technology actually allows us to do things in a different way, but in this case going from episodic encounters in the clinic to a continuous measurement done in the convenience of your home and in your daily routine.

What that does is it actually brings attention and it allows practitioners to actually address real issues that are beyond what is happening in the clinic. So it might extend the reach of help that people get, and that is a change; that is such a huge change for the positive, because the unmet needs for mental health are huge, and using this kind of technologies one is actually able to hopefully increase the aperture to which these needs are addressed.

That change I think is very much for the positive, and people who do experience these conditions, whether it’s anxiety, depression, cognitive decline - they need that help, and we are conceivably moving in a direction where that becomes possible for them.

Ajay and Guillermo, thank you very, very much for coming onto this episode. We will definitely put links to your papers out in the show notes, so that our listeners can access those. If people want to find out more or reach out to you, how would you like them to reach out to you?

One place to get to is the IBM Healthcare and Life Science Research website, which features a lot of our breaking scientific news, and including the work that Guillermo is talking about. That’s at That’s a good place to go.

Fantastic. Thank you both for coming on the show. I wish you very well in this work. Goodbye!

Thank you. It was a pleasure talking to you.


Our transcripts are open source on GitHub. Improvements are welcome. đź’š

Player art
  0:00 / 0:00