Practical AI – Episode #195

Production data labeling workflows

with Mark Christensen, CEO of

All Episodes

It’s one thing to gather some labels for your data. It’s another thing to integrate data labeling into your workflows and infrastructure in a scalable, secure, and useful way. Mark from Xelex joins us to talk through some of what he has learned after helping companies scale their data annotation efforts. We get into workflow management, labeling instructions, team dynamics, and quality assessment. This is a super practical episode!


Notes & Links

📝 Edit Notes


📝 Edit Transcript


Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m normally joined by Chris Benson, who is a tech strategist at Lockheed Martin, but he’s doing great tech strategy things and traveling as part of those things, so he won’t be joining today… But I’ve got a really, really wonderful guest and topic to talk about today. We’ve sort of been diving deep into a number of modeling-related things, in terms of Stable Diffusion and various things coming out… And I think it’d be good to shift and talk about – again, we’re Practical AI, so talking about some practical data-related things would be worthwhile… And I’m really pleased today to have the CEO of Xelex with me, Mark Christensen. His expertise is all in the area of data labeling, and workflows around that, and bespoke data processes… So welcome to the show, Mark. It’s great to have you.

Thanks, Dan. Glad to be here.

Yeah. Well, could you give us a little bit of a background about how you got interested in this space of data labeling, and producing custom training datasets, and eventually built a business around that? How did that happen?

We didn’t come out of a data science discipline, actually; we came out of healthcare. And we’ve spent the last 17 years managing healthcare data at scale, specifically in the area of dictation and transcription. So an entirely different field. But the thing we had in common was managing large amounts of data at scale. And in healthcare, what we would do is we would record audio, and for 17 years we recorded audio from healthcare providers, and then moved that audio through an enrichment workflow, which just essentially was transcription. So we’d have skilled medical transcriptionists in the States and around the world who would take the audio and transcribe it into the completed healthcare note.

And a few years ago, we met with a friend of ours, who was an owner of an NLP company, and he really liked our platform; we happened to be working with him on a speech recognition project… And we really have a need for this, and managing trading data for NLP workflows. And so that launched a discovery process that took about two years, and we investigated the use case and determined that there was a really neat fit. And so we modified the application for the next two years, and then launched our training data services workflow called So that’s how we got into it.

[00:04:27.22] That’s super-interesting. And I know data – specifically in the healthcare space, there’s some very interesting restrictions and very specific processes that you have to make sure that you’re following in that healthcare space. I’m wondering if you think that that perspective on data, and the security, the compliance things around that data - did that sort of shape maybe how you think about handling data for some of these use cases? Any thoughts there?

Yeah, that’s a great question, and you’re totally right. Data security is so paramount in healthcare. And my colleague at the NLP company cited that as one of the specific reasons why the workflow that we had in healthcare was a great overlay for data training. The data has to be audited, so data should have a couple of different audit trails on it; data should be encrypted, both in transit and in rest. Data shouldn’t reside on the devices of people that are involved in data labeling… So all those things were just a perfect fit and carryover between our healthcare workflow and an AI workflow. Yeah, you’re right.

Interesting. Yeah. I’m wondering - maybe as you talked to this NLP colleague, or as you’ve worked with clients, around the world, working on data labeling projects, from your perspective, how are data scientists most often labeling their data these days, and where do they encounter challenges because of how they’re approaching data labeling?

Yeah. I mean, the greatest challenge we always hear - and it’s an obvious one - is about getting data that’s accurate enough to improve the model, especially in specialty use cases, or let’s say new language modeling, where a click worker approach doesn’t hold up; it just doesn’t work as well. And for that reason, for smaller projects, maybe the size of a few hundred to a few thousand data objects, a lot of our clients try to do the work in-house, just for the sake of, I’d say, primarily retaining – for the sake of quality control. But for larger projects, it’s just too hard to do everything in-house, and so it winds up being a combination of in-house team members and outside vendors doing the data labeling.

For our purposes and the approach that we took, we decided that rather than commoditize the role of the editor or the annotator, we’d invest more in training and compensating our labelers as a means of building long-term relationships. And for us, we’ve found that’s an essential part of maintaining the consistency of the data quality, and making sure that the data quality remains at the accuracy levels our clients require. And that’s to be able to have those relationships with annotators that we can trust, and that aren’t just commoditized relationships.

Interesting. So have you encountered cases where maybe clients come to you and they say, “Hey, we tried to throw up like a crowdsourced task and get a bunch of labels, we invested a lot in that, and then it didn’t really help us that much”? Do you think that those cases are maybe due to unclear instructions to the labelers, or a sort of variety in the motivations of those data labelers? Or what do you think leads to some of those quality issues, from your perspective?

[00:08:28.22] Yeah, the commoditization of the annotation workforce - I think it can be a project killer. And a very high percentage of projects that launch stall, and never complete. And that’s one of the key reasons. We’ve talked with companies that try that approach, and they wind up iterating the data so many times to try to get an accurate set of data that they can use, that they ultimately wind up going to a more bespoke approach where the teams are more handpicked and more highly trained, even though the costs are higher, in order to finally wind up with a dataset that is useful. So yeah, I think that is one of the key problems that plague data aggregation projects, and that is to wind up with a clean set of data that can be done on time and on budget.

Yeah, I know, Mark, that – so in our projects, and we’ve done some speech projects as well… We’ve struggled with this also, in terms of like the data quality… And I remember in one case, really, we were saying, “Well, we need five labels for each sample, because the variability between labelers is such that like we need either a majority vote, or we need to analyze how much they agree, one label or to the other, or something” and of course, that gets really expensive over time. Could you speak a little bit to – like, you mentioned this training, focusing on training and upskilling these data labelers… What does training annotators look like in your projects, and what maybe have you learned about what’s important as you are training data labelers?

I recently did a paper called improving model accuracy through better translation, and it was really just an attempt to lay out some tips for translating source texts for natural language processing models. And one of the items I mentioned - and it’s something that we’ve seen as we worked with teams around the world doing language projects - is that it’s important for the editors and those involved to understand the use case. And that might seem like it’s perhaps too much information, or an unnecessary amount of information to share with the editors or the annotators, but once they understand the project description, or I should say the better they understand the project, oftentimes it really does translate into higher-quality data. And so I encourage companies to share that information with annotators, so that they are more vested in the work that they’re doing, and as an example of how a project description might be written up as part of the guidelines for the annotators; it might be something like “This project involves–” and this, I’ll just cite briefly a paragraph out of the document. “This project involves training a software application to automatically assist call center agents with their tasks, and increase their efficiency. For example, if the customer says “I have a warranty issue, the agent software application can respond by automatically opening the customer’s warranty clause, reducing the time required for the agent to assist the customer.

[00:11:58.17] The translation project consists of a set of English language scripts that reflects some of the typical conversations that occur between call center agents and customers. The purpose of this project is to translate those scripts into a target language in order to add NLP-driven process automation into the call center’s workflow, thereby adding new efficiencies to the agents and company.”

And so by giving those that insight in detail to the translators and the editors, it enables them to have more buy into the project, and have a better understanding of how their work is going to be used.

I think that there’s a lot of content about hyped AI data science things, but in reality, what people are really wanting more content around is this sort of practical concerns of like “Hey, my data labeling isn’t actually working. How can I fix that problem?” And so I think, from my perspective, at least, there’s an eagerness for this sort of conversation, where people are actually - they have a lot of the other, but they don’t have enough practicality in their content. So I think that that bodes well for this sort of conversation, from my perspective.

Okay, that’s encouraging to know, because we’re the guys on the process side… So the sexy work is being done by you and the data scientists. We’re more the guys down in the boiler room. We’re the operations team that makes the process happen, but doesn’t necessarily know a whole lot about the data science side of it.

Yeah. I think that in reality though, the data side is what is driving things… So yeah, I think that’s good.

Well, Mark, we’ve talked a little bit about the importance of training annotators, we’ve talked a little bit about specific data concerns around healthcare and other things… I’m wondering, from your perspective, since you’re really plugged into the area around how people are managing their data workflows, how they’re managing their data labeling - from your perspective, what does the current data labeling sort of annotation and tooling landscape look like? What choices do people have, and what does that landscape look like right now?

From my perspective, the landscape seems to be rapidly changing… But I would say that off-the-shelf models are being used more often; they continue to improve, and they’re used either, I’d say, as is, or within house tuning. The projects we see and the projects we’re getting more involved in are specialty applications where off-the-shelf models aren’t accurate enough, or they don’t exist. Cases might be places like medical documentation labeling, sentiment and intent projects that have a highly customer-specific language or vocabulary that can’t be picked up by off-the-shelf models… And in specialty models, I’d say training data is needed in cases where unique vocabularies warrant highly specialized, bespoke model tuning.

[00:15:59.26] An example might be gathering business intelligence from call center interactions, for example, where the client is seeking to obtain business intelligence through an NLP automation process, and they need a model to be custom tuned. And they need the model to be custom-tuned to meet their business objectives.

Another area would be, I guess, new language modeling… And that’s exciting to me, and encouraging because we’re starting to see an uptick in interest in other major world languages where models don’t exist in a production environment.

On the tooling side, I’d say we’ve seen companies both big and small relying on a hybrid of in-house data labeling, and in-house plus click worker driven labeling, and fully external third-party labeling. But what we’re not seeing is AI companies that have systems in place to manage those different approaches in a cohesive way. So there’s a lot of manual aggregation, there’s a lot of one-off coding that gets done to unify the results from those hybrid sources… So to answer your question on the tooling side, this is one area where the tooling is broadly not keeping pace with the growth of the industry.

And I know it’s like one thing - and this comes from personal experience - it’s one thing to get data labeled, like gather a label, it’s another thing to develop a workflow around that’s integrated into your systems, and integrated into your backend. What do you think are the challenges facing data scientists around this workflow side of things, and the bespoke sort of things that they have to do to integrate data labeling into the sort of wider set of things that they’re doing?

Yeah. Data prep is so challenging. It’s probably the most challenging part of a project, and it’s oftentimes because of the sheer volume of data that is required. And what we see not just at small companies, but even at big companies, is that highly skilled, and oftentimes really highly paid and talented data scientists are managing projects in a highly manual way, where their time and talent just isn’t being utilized as efficiently as it could.

Senior data scientists are doing things like vetting samples from annotators, and doing quality scoring on annotators… And I’d say that’s probably one of the biggest challenges we hear data scientists describe, is that they’re spending too much time manually managing project minutia. And oftentimes, it’s the use of automation tools and project management platforms that can help them to refocus their energies on higher-level priorities, and allow the application, the software application or the platform to automate a lot of the workflow, and allow other team members to manage a higher percentage of the workflow. So I think that’s one of the things we’re seeing.

And along with that, how does Xelex specifically approach the data labeling problems that you’ve described? We’ve talked about sort of workflows, we’ve talked about the custom setups that are needed for certain tasks, we talked about a variety of things… How has that filtered down into your approach specifically, and the approach that’s Xelex takes?

Yeah. Workflow platforms are all about moving off of spreadsheets and manual processes into processes that scale better. That’s what we do. We’re focused on the production process. So everything from training and managing the skilled labor, to meeting deliverables on time, and at quality levels that clients expect… I mean, keeping projects on budget - those are all things that training data services companies like we do, and that we bring to the table. Our focus is on making complex workflows easier.

[00:20:03.28] And another part - this was interesting, that one of our clients said to us one time, is that they needed all of the stakeholders at the company to be able to see what was going on with the project. And the platform, our platform enabled them - and other platforms, too - it enabled stakeholders to do that. And there’s all kinds of stakeholders at the production level and the commercial level for projects. Because projects, on the commercial side - they’re typically done on the request of a client, and in service to a client. And so all kinds of different people outside of the data science team are involved: the sales team, the ops team, the procurement team, the quality assurance team. And everybody needs to know what’s going on. They want to see if the project is on budget, they want to see if the project’s on time, they want to see if the quality thresholds that the client has set are being met… And so a platform gives everybody that visibility, and I really enjoy and appreciate being able to do that for a company, because it does then keep all the stakeholders in the loop. At the same time, it allows the data science team to not get bogged down managing minutia manually. So that’s one of the neat things that we like to deliver.

And then on the services side, the approach is always about managing the workforce successfully. And success in data science and in projects like this is measured in being able to deliver a project on time and on budget, and at the accuracy levels that have been determined or set by the client and by the service provider as being the goals or the project’s objectives.

An example of how this can backfire is when service providers like us enter let’s say a new language, or a new project area, and maybe their client has come to them and said, “Can you do this? Or can you do a project and data labeling in this language?” And, of course, the knee-jerk reaction is always “Sure, we can do that”, but if saying “We can do that” involves hiring a third-party vendor in that target country or target language to do the project, and it’s done in a scramble, it can really backfire. And so hiring a third-party vendor in cases like that can result in a black box approach where you’re unable to adequately measure quality, and where you’re unable to adequately manage deliverables, so that projects wind up running late, or projects are delivered with poor quality data. And then you’re left scrambling to do those corrections on the data internally, or finding another source to do those corrections for you, and it’s a recipe for disaster.

So for us, the way we mitigate that is when we move into a new language, for example, the initial step is to do the hires and do the training ourselves, so that we have our own team and we’re not dependent on a third-party vendor source for that labeling effort. And that way, even though it’s going to take us longer, and the cost might be higher - and there are cost sensitivities that are realities, but the truth is, if you’re using a third-party vendor and working out of a black box, chances are you’re not going to be able to deliver the project on time and at cost. And so your cost and timeline are going to be affected anyway. So we’ve opted for taking an approach that’s more expensive to our clients, but that ultimately delivers projects with a higher quality, and consistently higher quality data, that are on time, that meet the turnaround deliverables, even though the price might be a little higher.

[00:23:59.22] Yeah. And I think that’s a really good and practical advice for the whole community that’s trying to do a variety of these data labeling projects, is really at the start of these data labeling projects not only thinking about gathering samples, but thinking about how is your workflow going to be managed, and how are your annotators going to be trained… Because thinking about that stuff upfront and taking time, or spending more money on getting that in place from the start might actually save money in the longer term, if you’re not doing as many iterations of labeling. If you start and you do a bunch of labeling, and then you don’t get the quality that you need, or you get some sort of unexpected biases or other things in your data, that could cause more problems down the line.

And one of the things – maybe this isn’t a specific… I guess it could lead to specific quality issues, but one of the things that is hard for me as a technical introvert person who’s not maybe the most people-oriented person in the world is thinking about all of the team dynamics that happen on a data labeling project, and setting up maybe a disparate and distributed set of labelers and vendors for a data labeling project. How can the problems associated with those sorts of dynamics be addressed in this online distributed labeling environment?

Yeah, you’re totally right; there are inherent challenges in managing an online workforce… But many of those can be mitigated through a well-developed, robust workflow application. Things like centralized controls, giving managers total visibility to what’s happening in the workflow at any given moment; the status of data objects as they’re moving through the workflow, and how you’re doing against your timeline for deliverables… Those are the kinds of things that software is really good at managing.

As I mentioned earlier, we’ve seen cases even with really large companies where pretty complex projects were still being managed on a spreadsheet. And when you’re doing that, there’s almost no ability to manage the workflow effectively.

Mark, given the sort of team dynamics that can happen, that we’ve been talking about, this sort of variety of tasks that Xelex is exploring, and other people are exploring in the space, from standardized machine learning tasks to more custom ones… I’m wondering, what sort of would you say about proper ways to set up maybe manual and/or annotated QA type of workflows associated with your data labeling?

Well, I can tell you a little bit about what we do… The first is to establish the ground truth version of the data object, and for all data objects as they’re moving through the workflow. And once we establish the ground truth data object, then we’re able to measure the distance between that and the work that the editors are doing. And that helps to generate a whole lot of different metrics for us - who needs additional training, how pay might be affected, how our costs are effective, if data objects are moving through the QA workflow more for some editors than others…

The second thing that we do is a multi-level QA workflow, so that work gets automatically routed. And that could be in cases where we’ve got new hires, or maybe editors are being flagged via our auto-check process for certain error types… And then thirdly, we run an error script that dynamically checks against the known error list, so that those items are routinely recycled through the workflow, to be re-edited and re-QAed.

So those are some of the typical things we do. Judgments, of course, and multiple judgments on data objects is really important, to make sure that – using multiple layers of judgments is also important, and we do that through the QA workflow process as well.

It’s been extremely helpful for me to think through some of the dynamics and the workflows associated with data labeling. I think it’s extremely practical and very useful. I’m wondering, as you continue to be more and more involved in this space of data labeling and interacting with clients in the data science and AI space, what excites you about the future of data science and AI practice? And maybe within that, what could easier data labeling enable in the longer term?

Well, when you look at the number of datasets and models that have been developed so far, it’s overwhelmingly all English-based, and in that regards, probably largely focused on the US market. And we’re overwhelmingly the largest economy in the world, so that makes sense that it would be that way. But what I’m excited about is seeing the tools and expertise that have been developed in English modeling to now be used in other major world languages, and specifically in developing economies, where AI can be used to help developing economies move forward. All of those nations are generating customer and employee experience data in the form of things like customer behavior data, and online reviews, and sentiment and intent data, or medical data; things that are in a structured format, where AI can be used in a beneficial way.

Well Mark, I’m really happy that you brought up the side of the impact of data and NLP across the world’s languages. As our listeners will know, I’m very passionate about this topic, and I’m really excited anytime we get to talk about that; it’s something that excites me for the future as well. I’ve really appreciated you taking time out of your work with Xelex to help us parse through some of these data labeling challenges and the workflows associated with them. I really, really appreciate you taking time, and looking forward to continuing our conversations over the coming months, as I have my own data labeling issues.

Thanks, Dan, very much. I really enjoyed it.


Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00