Practical AI – Episode #284

Metrics Driven Development

With Shahul, co-founder of Ragas

All Episodes

How do you systematically measure, optimize, and improve the performance of LLM applications (like those powered by RAG or tool use)? Ragas is an open source effort that has been trying to answer this question comprehensively, and they are promoting a “Metrics Driven Development” approach. Shahul from Ragas joins us to discuss Ragas in this episode, and we dig into specific metrics, the difference between benchmarking models and evaluating LLM apps, generating synthetic test data and more.

Featuring

Sponsors

Assembly AI – Turn voice data into summaries with AssemblyAI’s leading Speech AI models. Built by AI experts, their Speech AI models include accurate speech-to-text for voice data (such as calls, virtual meetings, and podcasts), speaker detection, sentiment analysis, chapter detection, PII redaction, and more.

Notes & Links

📝 Edit Notes

Chapters

1 00:00 Welcome to Practical AI 00:43
2 00:43 What is Ragas 04:36
3 05:19 General LLM evaluation 04:51
4 10:10 Current unit testing workflow 04:27
5 14:37 Metrics driven development 02:33
6 17:20 Sponsor: Assembly AI 03:26
7 20:59 Most used metrics 05:28
8 26:27 Data burdens 09:23
9 35:50 Exciting things coming 04:59
10 40:49 Thanks for joining us! 00:36
11 41:25 Outro 00:46

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I am the founder and CEO at Prediction Guard, and I’m really excited to talk to another founder today in the AI space. Today we have with us Shahul from Ragas. He’s one of the co-founders. Welcome, Shahul. How are you doing?

Hey, I’m doing good. Hi, Daniel. Hey, folks. How are you doing?

Yeah, yeah. Well, thanks for joining at a late hour in India. I appreciate that. But yeah, I would love to hear a little bit - maybe for those that aren’t familiar with what you’re doing, maybe share a little bit about what that is. And also maybe how you came upon the types of problems that you’re solving with your current work.

Sure. That’s a very good place to start. So Ragas is an open source library for evaluating LLM applications. And what we are trying to do with Ragas as an open source library is to provide the developers or AI engineers who are building LLM applications the tools and the workflows that are necessary to automate or partially automate the process of evaluation through using different techniques, different methods and techniques that we bring out through Ragas. And how we came up to the idea of Ragas is basically me and my co-founder Jithin, we have been working in ML for the past six, seven years, and when LLMs came out, we were already working with large language models. Jithin mostly worked on the inference and infrastructure point of it, and myself, I was working as an affiliate researcher.

So practically, we loved it, we were practically doing a lot of experiments with it, we were part of different open source initiatives, building LLMs and also using LLMs to build different frameworks at that point. Even Langchain and Lambda Index was coming out as one of the earlier frameworks at that point. This is around early 2023. And we were also working with different applications and stuff, and RAG was one of the most popular applications that went into production very easily with LLMs. And this is one of the first things that LLMs actually opened up as a possibility for enterprises to build something on top of them, which can save a lot of time and money for them.

So we were also building RAGs, and initially, after a couple of experiments with some clients, we found out that okay, we are able to build these LLM applications, this RAG application using any LLM. It has different moving components, and going forward, there will be a lot more moving components to any LLM applications. It will be something like a common system where there are – as you can see now, it’s not only RAG, now it’s tool use cases. There are different [unintelligible 00:03:41.01] And we thought, okay, why don’t we build something, some evaluation metrics, evaluation metrics that can be used to understand the quality of any RAG application that any AI engineer is building. And we also found out that going through these answers or going through these intermediate results manually is not a scalable approach. I could rephrase it to say that it’s a very boring thing to do, and nobody is really keen to do it. This is something everybody will push into someone else’s responsibility, but it is something very important, too.

So we thought of identifying methods or coming up with solutions that could help developers do the evaluation, but also save them a lot of time while doing it. So it should not be something – it is basically an evaluation of the LLM application, regardless of RAG or any agentic workflow is basically a very tedious process that takes up a lot of time if you go through manually. You want to make sure that this tedious process, this manual process that takes up a lot of time is cut down into one 10th of the time, and it should get you the same insights as you go manually.

So that’s how we came with the initial MVP of Ragas, and we released an open source library in the middle of 2023. And since then, we have been continuously iterating, and we have been getting organic growth and usage from there. And that’s what we are trying to do.

Yeah, that’s awesome. I noticed that you very specifically refer to the evaluation of LLM applications, not necessarily the evaluation of LLMs. Could you explain what might be the difference between those mindsets for people that are maybe getting into this? Maybe they have looked at benchmarks, let’s say, or like a leaderboard for LLMs. And there’s a certain level of evaluation or benchmarking there… But then you’re talking about the evaluation of LLM applications. So could you help us understand kind of some of the differences there, some of what needs to be thought of at the application level, versus at the model level?

[00:06:04.27] This goes to the ideology of AI as a consumer product as of now… Because pre LLMs, most of the enterprises or most of the startups who were building AI-powered applications used to have their own models; they would obviously also have a team of data scientists who was building and managing these models. So in that spectrum the building of the LLM itself, or the model itself, and also the evaluation of the model itself was the responsibility of the researcher/engineer/data scientist who was working on it. But now it’s [unintelligible 00:06:45.06] building an LLM application, so AI applications, is not really building their own models. They are consuming models from an external endpoint, or even open source models that are being released, and building applications on top of it.

Now, this actually forms a wide spectrum, where at the left end of the spectrum there are people building the LLMs itself, they have a separate list of targets, loss functions, metrics and benchmarks to evaluate on, and towards the right end of the spectrum there are people who don’t really care about how the LLM is built itself; they only care about “Can this thing which I’m consuming using an API do what I’m trying to do here, using AI as a technology?”

Now, when a researcher or an organization who builds LLMs evaluates LLMs, they don’t really know what is the exact use case for which this AI or LLM is going to be used against, or the type of data this will be used against. So when the evaluation is done at the researcher or the LLM builder part of the spectrum, they are limited to the evaluation or testing of this LLM’s capabilities on a general purpose basis. They are not really tailoring this evaluation or testing to your application, because they don’t really have a control over what you’re trying to build with it. They can only say that, “Okay, this is a general capability of the model”, and we know that even the general capabilities of the models and benchmarks are highly leaked into the training data; and even that is questionable, but that’s a different question.

This is what a researcher or an LLM builder does at his end. But when it comes to an application builder, he can’t really – if someone builds a RAG or a tooling agent on top of an LLM, you can’t really say that “Okay, the LLM builder has said X accuracy, so I should get X.” That would be a wild assumption to make. So what we are trying to do here is giving that user a application builder who is at the right end of the spectrum the power, the tools to evaluate this application without knowing so much about ML or getting into so much jargon or anything.

You want to make it as easy as possible and as intuitive as possible, because most of the people who are building AI applications are not from an ML background, they are from a software engineering background or application building background, where their specialty is building and scaling these applications, not building [unintelligible 00:09:18.07] So we are trying to be as intuitive as possible, and we also want to make sure that we do not take a lot of their time while doing the evaluation. So we want to make sure that their time is valuable, and we want to make sure that we do the heavy lifting of this evaluation for them.

Yeah, that makes a lot of sense. And also, I’m realizing there’s this kind of spectrum that I was thinking about while you were talking… On the one side, you have data scientists or researchers who are building out benchmarks or metrics for models. On the other end of the spectrum you have maybe software engineers, who are used to writing unit tests or integration tests for their software… And then, what we’re really talking about is integration of LLMs into software applications, or into certain workflows.

[00:10:09.13] You talked a lot about this distinction between LLM benchmarks and evaluating LLM applications. Could you talk a little bit about the differences? Maybe there are software engineers in the audience, that maybe they’re used to writing unit tests and integration tests for their software. Now as a consumer, as you said, they’re integrating some LLM functionality, or maybe a chain of reasoning with LLMs into their software. From a practical standpoint, what are the new types of things that they might need to consider that are different maybe from the way that they’ve unit-tested in the past, or written tests in the past, now that they’re working with these LLM workflows?

Sure. That’s a very interesting question. So when it comes to the application builder/software engineer who is building with AI, as you said, most of them are already familiar with unit tests and integration tests that they regularly write for their software. Now, the major difference is that – I also have worked with many of the software engineers who are [unintelligible 00:11:19.03] the evaluation using this analogy: okay, this is testing. I will learn or understand evaluation using my understanding of the testing. So the thing is that this is the fundamental thing about the analogy itself. So now you are using an analogy to understand one thing, but it could be that these two things can have different physical and chemical, different properties here. For example, when it comes to evaluation versus traditional software testing, traditional software testing is mostly a discrete space, where you have an input, you have an expected output, and if you give this input, you are supposed to get this expected output from the software. There is no variation that satisfies the test. Now, if the logic is basically to add one plus one, you should get – if the whole logic is basically addition, if you give input as one plus one, the output should be two. There is no other possibility that exists that would satisfy that this case is correct. But when it comes to natural language, there is more of a continuous space. If you have an LLM that does the same thing, let’s say one plus one, or addition, an LLM could even say two in natural language, or two in the decimal point numbers. Both are actually correct. So here there’s like a continuous space where the output cannot be exactly matched against or [unintelligible 00:12:49.08] but you should have an understanding. But in this continuous space of outputs that are possible, there’s obviously a subset that can be regarded as correct.

Obviously, if the LLM gives an answer as three in natural language or something, that means it’s wrong. That is out of the way. [unintelligible 00:13:06.09]

So what software engineers should really understand is that when you deal with ML, when you integrate AI into your application, you should try to think about it as in a continuous space, rather than a very discrete space or in a black and white manner, it’s yes or no. It could be that it is in the middle of – there are a lot of space in the gray area here. So that is what a software engineer should understand.

And then there is also a non-deterministic part of the whole AI thing. Basically, if you have a system which is part of code and part of AI, the system is going to be somewhat non-deterministic, whatever you do. So this is also something that traditional software is not like that. Traditional software is very deterministic. So you have an input, you will have an output, given that the intermediate states remain the same. So here again you have this non-deterministic thing to take account for. So that’s also one thing. These are the two major differences that exist between software testing, traditional software testing, and this AI application or combined AI application testing that makes it different from each other.

[00:14:18.00] Yeah, super-interesting. And I know in the software world there’s all of these sort of frameworks for development around test-driven development, or data-driven development, or these different things… I notice in Ragas, in your core concepts, in your documentation you talk about metrics-driven development for these LLM applications. So for those that are maybe developing LLM applications out there, could you describe a little bit the mindset of metrics-driven development, what you mean by that? And then maybe we can get into a few more of the details of Ragas itself and how you enable that framework.

Yeah. Metrics-driven development is a concept that we took from the test-driven development itself. So the idea - we want to educate developers, software developers who are not familiar with metrics, but they are familiar with testing, to what we’re trying to do here. So as I said, metrics versus test is – like, metrics is something that delivers a value, or that helps you understand the performance of some application in a scale of, let’s say, zero to one, or something.

Now, whenever you have an application, if you want to iterate or change – let’s say you are building a whole LLM application that consists of different agentic workflows, plus [unintelligible 00:15:44.17] workflow etc. And now let’s say you want to change one single prompt or one single function code or something, [unintelligible 00:15:55.01] how would you understand the effect of this change in your pipeline? It’s a question that metrics-driven development tries to answer.

So if you have a metric, if you have a way to objectively quantify or understand the performance of your system before and after this change, you can also understand and analyze the systems, our responses or the behavior of the system using these numbers. For example, if you are switching out the retriever and let’s say you add a set of metrics that effectively quantify the performance of your system, you could actually switch on the retriever or switch on any function calls or something, and then run the entire pipeline once again for the given test set, and understand the change in those metrics in particular dimensions. And okay, let’s say once you observe this change, you could again easily dig down – okay, you could easily fetch these samples for which the change is reflected the maximum, and then you could easily analyze and understand [unintelligible 00:16:56.01] That actually helps a lot – it helps and reduces a lot of time when it comes to debugging and testing these applications. So that’s the idea of metrics-driven development.

Break: [00:17:11.12]

There’s a lot that you’ve already integrated into Ragas, which are really interesting… And maybe some people have heard of, or some people haven’t, like faithfulness and context recall, noise sensitivity, aspect critique, summarization score, and more that I’m not listing. I’m wondering if you could maybe just share a couple of those that you think are maybe most utilized, or maybe interesting from your standpoint, just to give a sense of people like what types of metrics are we talking about here?

I think it would be beneficial to answer that in an abstract level. For example, if I say metrics, the whole concept of we devising or giving the developers these metrics is that it’s not extremely hard to come up with these metrics, but when a software developer or an application developer thinks about evaluation, because he’s already familiar with this concept of putting up metrics, deciding the right flows and everything, it could take him two to three days to figure out the right metrics.

So what Ragas does is if you’re an application developer - okay, you are developing this particular type of application, you can come to Ragas and you can go to Ragas’ metrics, and there you will find that in enough workflows, enough parts of the documentation that can help you navigate according to use cases, according to requirements, that can land you up in the right metrics that you should use. And the documentation will also provide an intuitive understanding of how this metrics is calculated underneath.

So these are two value props that we provide there. It’s not that a developer or a data scientist could easily make up these metrics on their own, but sure, if you’re a software developer or application builder, it could be very hard for you to figure out how to think about all this stuff. So we are effectively taking that load off of you, and providing you the ways to easily navigate and understand the right metrics to evaluate your application, tailor it to your application. Now, we are also expanding [unintelligible 00:23:10.25] workflows, and we are also going to reformulate the whole [unintelligible 00:23:18.10] particular, so that people can easily find the metrics. For example, metrics can be LLM-based or not LLM-based.

Whenever a developer comes to Ragas, he obviously asks a lot of questions. Or even thinking about metrics, he will ask a lot of questions, for example “What is my use case? What are the parts in my application that I want to evaluate? And then what – should I go for LLM-based metrics, which has high correlation with human judgment, but also has its own issues like non-determinism and everything? Or non-LLM-based metrics, which are traditional, it has less correlation with human judgment, but it is more reproducible?” So all these questions, all these doubts pop up in mind when a developer thinks about metrics. We are trying to abstract all that into Ragas metrics, and provide the developer a way to think about metrics.

And with metrics, when they adopt a metric from Ragas, they also have a related list of items or features that they can use because they adopted a metric from Ragas. For example, let’s say you designed a metric for language English. Now let’s say you are evaluating for Portuguese, or Spanish. You’ll have to convert it to another language, or whatever. Whenever you’re evaluating a different language, you’ll have to do that. Now, if you’re using a metric from Ragas, Ragas could take care of that.

Then there are also issues - like, if you are building an LLM-based metric, which is the trend here, and most of the other metrics are LLM-based metrics, because that has high correlation with human judgment. Then - okay, let’s say you are evaluating with LLM-based metrics, and you are finding that the LLM-based metrics is actually not performing well enough in your case, you should have a way to align these LLM… You know, people will have different expectations for different metrics. For example, let’s say you are trying to do something like faithfulness. And faithfulness is basically the amount of hallucination that is happening in the generated answer given a context. And if you look at it, different developers from different domains have different strictness towards hallucination.

I could have a statement like “I have a blue car” versus “I have a car”, and I could – for me, maybe in my domain, I could say that these are two equal statements. But for someone who is working in FinTech or something, it might not be equal statements. So there is also a domain kind of bias when it comes to metrics or inner judgments, because developers from different domains expect different levels of strictness, and everything.

[00:26:04.26] So bringing out this alignment with these metrics depending upon your domain is also one extra thing that we are trying to tackle here, which is called – basically, we call it asymmetric alignment, that is trying to align large language model judges to your specific judgments using the feedback that you can give to LLMs. It is also an upcoming feature in Ragas.

I want to just dig into a couple of those things… So one is - you talk about alignment, but also there is the idea of… It’s very important in this case for me to be thinking about data and examples from my domain in terms of how I am evaluating these. So I am just looking through, just to make things concrete for people, looking at your example around answer relevance… And there is data samples in there that have a question, an answer, and then a set of contexts, and then you can evaluate with Ragas using the metric answer relevancy. So obviously, there is a data component there, and in that component there is answers that are there, and contexts that are there. So one question would be like “How much and what type of data will people need to configure such that they can appropriately evaluate their LLM applications?” And are there metrics that require sort of data upfront, or metrics that maybe are reference-free, that wouldn’t require data upfront? What is the perspective there in terms of kind of, I guess, the cold start? Like, if people are starting an LLM application from scratch, they haven’t run it in production, what’s the data burden and the path towards both getting this data in place, and getting the alignment in place?

So regarding the data itself, the number of samples people generally use to evaluate is around 100 to 500 when it comes to offline evaluation. So it really depends on how you formulated your test data itself. So for example, you could have the same results or the same kind of ranking when you have 100 samples data point. Or in many use cases, even 100 may not be enough to include all that kind of use cases or all the kind of distributions that you see in production. So it really depends on the variety of items that you see in production, that you expect in production.

Then let’s say you have a very niche use case, and variety is very skewed - your test data can be very small, yet still serve the purpose. But if your use case is very broad, and you see a wide variety of users coming into your application, or a wide variety of requests, your test data has to be broad enough to include all these different distributions in the dataset itself. So that’s about the number.

And I think it’s not really about the great number, but again, basically making sure that you understand what are the different distributions of queries coming into your system, and also making sure these distributions of queries are also represented in the test dataset, so that you know when you change a probe or when you change a prompt which type of queries are being affected. It could be that when you change a particular retriever or a particular tool or something, when you swap out a particular tool or something, it could be that a specific set of queries are being affected. The overall pipeline almost never gets affected, even if you want to do a huge change. Where for the small changes, it’s mostly a subset of queries, a subset of distribution that gets affected, and it is very important that you understand these items.

[00:29:43.27] So that’s about how to formulate a test dataset itself for evaluating AI applications. And regarding the second question of metrics, with the reference metrics, we usually provide both reference-free and with reference metrics, but there is a big shortcoming to reference-free metrics in itself, because reference-free metrics are basically - you know, we could estimate things there, but there is an error to estimation with reference-free metrics. For example answer relevancy, or – you could estimate with some way if the given answer was correct or something, but if you don’t really have the exact thing, some way or another it’s really possible to say that “Okay, the LLM application arrived at the right thing at the end of the day.” So there what we provide, even with production data, it’s very hard to [unintelligible 00:30:33.08] a test dataset, because again, with production data, data is very, very messy. It’s a common thing that people – if you ask any researcher or a guy who has put MLs into production and the method of acquiring a test dataset, obviously they are going to look at production data, but the way in which test datasets are being acquired is there is a long way to go from production data to formulating test data, because production data is incredibly messy, because it’s an uncontrolled environment. It’s not a human-controlled environment. People can come and say anything there, and they have zero consequences.

So the thing is that production data becomes incredibly messy depending upon the application. If your application is serving an internal set of users, like internal company employees or something, you have an amount of control there. But if your application is strictly B2C and you are opening up the whole application to anyone using your application, there can be users who are basically trolls, who just like to use your system and leave false feedback and so on.

So production data being incredibly messy, there is a long way from going from production data to a good test dataset. So there again, what we are trying to do is providing a way to synthetically create these test datasets. Now, these synthetic creation of test datasets will be grounded on things like production data, and then there are your internal documents that should be taken into account when creating a synthetic test dataset… Because again, you are trying to create a test dataset that’s really tailored to your use case and not a generic one.

So there are two points where we ground it, basically, the set of internal documents or whatever you have that you ground your application, and also the production data where the user engages with your application. And this is again one of the upcoming features… The test dataset generation is already there, but we are also trying to extend it to a way that we could ground these from production data. Basically, we call it seeding from production data. Basically, if a user has been already using our [unintelligible 00:32:43.27] generation, but if he wants to take motivations from production to imitate more behavior in the test dataset, that is also happening in production. Now, there is a lot of things that has to be done to first understand what is happening in production. There can be different distributions, again, as I said, different sets of users, we have to understand different sets of queries that’s coming in, different sets of interactions that are happening and everything. And once we understand that, we could also have a way to synthesize these types of data points using LLMs.

Now, the developers will not be able to annotate these data points, but to verify these data points. Once this data or test data is synthesized by Ragas, you could export it to a simple UI tool. And then you could simply – or even Excel sheet; most of the users could export this data to an Excel sheet, and basically they go through it. And then once they go through it, they can easily cut out the bad data points that they think these LLMs messed up, or something. Because again, we can’t guarantee 100% efficiency while synthesizing these data points.

[00:33:57.26] So what we are trying to do is improve the efficiency… Let’s say if you generate 100 data points, our goal is to make sure that all the 100 data points are equally valid and good. But it might not happen like that in every use case. So the developer can take 10 minutes of his or her time, and go through this dataset manually. It is a fairly quicker process than annotating or creating these datasets, because creating these datasets would easily take a day or two, or even a lot of money when you give it to [unintelligible 00:34:26.02] So the developer basically can go through this data points that’s been synthesized, and then cut out the points that he thinks that is not valuable.

So again, these are actually the real reasons which will save a lot of developer time in evaluation… Because again, formulating a test dataset - it’s a very cumbersome, it’s a very time-consuming, boring process that nobody wants to do, and it mostly falls upon one developer on the team to do it, and it’s a messy process.

So these are the types of innovations that we are trying to bring in the evaluation spaces to save a lot of time of the developer. And this ideology or this philosophy is why we have been getting this organic growth, because we – it is really kind of hard to come up with these kinds of solutions, but when we come up with these kinds of solutions, there are a lot of developers who badly want this. And they are the people who motivate us to continually bring these kinds of innovations to the evaluation space.

Yeah, that’s great. I love the innovations around synthetic data use in evaluation, and utilizing kind of the LLMs and these models to help in the evaluation process… But in a way, that’s validated and still fitting with improvement over time.

So yeah, as we kind of close out here - this has been a fascinating discussion… But maybe just to close out here, what are some of the things that you’re excited about moving into the next six months or so to either explore, or maybe it’s things that you see happening in the AI ecosystem more broadly? What really excites you about the direction that people are going with their LLM applications?

So with LLM applications - with early 2023 or mid-2023, RAG was a big thing. And if you remember the time when RAGs became popular, there was a lot of limitations. There were only [unintelligible 00:36:29.10] and everything. I think with agentic code tool use cases we are at the same level as of now. We have been trying to bring more and more tool use cases.

Tool-use cases are actually incredibly useful when it comes to building a whole LLM application experience, because combined with internal knowledge – internal knowledge is what you can infuse with RAG, and then taking actions is something that you infuse with tool use cases, tool bindings. So in a tool binding, I’m very excited to see the next class of models performing better on tool use cases. Again, now, even these recent models have been bringing abstractions or using tool binding to facilitate this. But still, it’s a little bit shaky as of now. But the next class of models, I really expect it to be better.

And then the whole thing with RAG plus tool-use case - we will see a lot more enterprises adopting and using LLM applications at different capacities, and saving a lot of time and resources for both people who are at the back of these applications, and also people who are interacting with these applications. So that’s something I’m really excited when it comes to the very next six months of [unintelligible 00:37:52.24]

[00:37:58.23] And also, I think when it comes to the frameworks or libraries that are being built around applications, I’m seeing more and more better abstractions these days when it comes to these frameworks and libraries… Because people now have almost one and a half year of building with LLM application experience. Now, that experience is yielding more and more understanding of what is the best abstraction to be used to build these applications, and everything. And then there’s an overall agreement that is coming up with how to format outputs, how to build these combo systems itself. Because earlier, early 2023 and everything we were really, really confused on how to build these applications, how to use AI and everything. Now I can see that more clarity is happening at that end, too. So at the model building stage, again, the model building spectrum, we are again having more clarity on things like the behavior of the models, the type of data needed, more and more papers, and more and more research happening at the data processing and preprocessing stage… You know, what is the type of data needed to train the higher quality and best quality LLMs. Because we think that we have mostly settled on the architecture itself from that LLM point of view. Now, the main thing people are working on is data. Again, when it comes to data, people are again now mostly – we will finish up the free data that’s available on the internet very quickly, and now it’s, again, synthetic data that has a really good chance of improving these models itself.

So the idea of models output being used to feed models, and improving the models itself for different use cases is itself a fascinating thing regarding the model building stage. When it comes to evaluation, there needs to be more and more innovations happening in that evaluation space itself, whether as an open research or as an open approach to bringing in all that time involved in testing and assessing these applications itself. Because that’s one big pain point on why the enterprises or big companies could – it’s a big barrier for these big companies to adopt LLMs, LLMs for their entire system. Because these people, they have a high responsibility, so with high responsibility you have to do these testing and evaluations. And if there isn’t a very good way of doing it and a [unintelligible 00:40:23.29] way of doing it, it becomes a pain point. And that’s something we are also trying to bring in. We are trying to formulate all that research, all the innovations that are happening in the evaluation space to come together and build an open source standard for evaluating LLM applications itself, so that there is an agreement between everyone on how to evaluate LLM applications. And that’s the long-term mission of the company itself.

Awesome. Yeah. Well, thank you, Shahul, for taking time to join, and again, joining at a late hour where you’re at. I’m really excited about what you’re doing with Ragas, and this is a really interesting space that we’ll be interested to follow. So thanks for taking time, and I hope you can have a good rest of your week.

Sure. Thanks, Daniel. This was fun chatting with you, and I hope your users learned something from this conversation.

Yeah. Thanks. Bye-bye.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00