José and Ricardo joined Daniel at EMNLP 2022 to discuss state-of-the-art machine translation, the WMT shared tasks, and quality estimation. Among other things, they talk about Unbabel’s innovations in quality estimation including COMET, a neural framework for training multilingual machine translation (MT) evaluation models.
Play the audio to listen along while you enjoy the transcript. 🎧
Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International, and I’m joined this week by Ricardo Rei and José Souza from Unbabel, here at EMNLP 2022 in Abu Dhabi. How are you doing, guys?
Hi, we are fine.
Hi. Good. Yeah.
How’s EMNLP for you?
So far we have been mostly attending the WMT workshop…
Yeah, and what’s WMT? What does that stand for?
Right, WMT stands for Workshop on Machine Translation. This is an historical acronym, because it’s actually now a conference; I would say that it’s the main conference of machine translation, and it has been happening for several years. And it’s always collocated with the EMNLP. So it’s nice, because it’s one of the biggest NLP conferences together with the biggest MT conference.
It’s mostly attended by researchers. So not so much by people in localization industry, but it’s interesting to know what’s happening in terms of research, the latest approaches and methodologies for evaluation as well.
Yeah, and is that the industry that Unbabel is in? Could you just give people a little bit of an understanding of what Unbabel is?
Sure. So Unbabel is a translation company. We provide translations, trying to unite the best of both worlds, which is using machine translation and professional translators to provide these translations. And the best of both worlds because if you only rely on translators themselves, it’s very difficult to scale this process of translation to different volumes of content. And that’s why we use machine translation to speed up this process, and then use the translators to correct, if necessary. And that’s, I think, the biggest difference of Unbabel to other companies, which is we are the pioneers to use something called quality estimation to actually decide whether if we should post-edit or not the translations. And I guess we are big on also evaluation technology, evaluation, and I think Ricardo can talk about COMET…
[03:59] Yeah, what José just explained about the difference between combining humans and MT - so if you have a mechanism that tells you that your machine translation output is perfect, then you don’t need the human. But for you to do this, you clearly need a very reliable, quality estimation system, a system that receives that translation and is able to give you an accurate score for that translation. And that’s why Unbabel has been focusing for so many years on specifically quality estimation, and also evaluation.
Evaluation is a little bit more general. It can also include things like metrics, where you compare the transaction output with a reference transaction that you believe to be perfect, and it’s what people typically use when training models, and stuff like that. For the past few years, we have been developing a metric that is being widely adopted by the research community, and also the industry, which is called COMET. COMET has been very successful in the last two years. It was developed by us. We also developed a quality estimation framework that also gained a lot of traction three years ago, I think…
Yeah. 2019 it was…
Yeah. Called OpenQE, which is basically similar in terms of the model approach and everything, but it does not rely on a reference. So it’s what we use internally for performing quality estimations. Yeah, I think this sums up a little bit…
That said, just one thing is that all of this is only possible because over the years Unbabel established some quality controls for the translations… And this started by using a framework called MQM, which stands for Multi-dimensional Quality Metric, which is basically a typology, and then guidelines on how to use this typology, with different phenomena that happens when translation is made, that goes from accuracy, whether the translations are adequate, if they’re fluent… And then there is a whole taxonomy about that.
So this kind of evaluation enabled us to accumulate data about the quality of translations over time, that we can then use to train quality estimation or metric evaluation models.
Yeah, so this seems different… I think some listeners probably in their experience with like modeling in other domains, or with other data, are probably familiar with like a confidence score, or a probability… So this goes like way beyond that, right? So just to clarify, this is not like just a confidence score coming out of your model of translation, but this is actually a metric that you’re running on the output of your model. Is that right?
Yeah. So explain, maybe comment a little bit, because that has gained so much traction… What is maybe different about COMET? Another popular one I know for machine translation is called BLEU… So what distinguishes COMET as different from maybe that, or like other metrics that are out there?
So like you were saying, BLEU is a very well-known metric, but BLEU is a lexical metric, and this means that BLEU will take the MT output and it will compare with a reference that was created from a human. And usually, the typical setup is that we only compare that MT output with a single reference. And as we might know, there are multiple ways to translate a specific sentence. So a lot of times BLEU will give a very low score for a very good translation because of that. Sometimes it also gives you a very high score for a very bad translation… Because another aspect of BLEU is that it’s gonna give the same weight to all words. So if you have a named entity that is not correctly translated, it’s going to be like one word that is missing from being perfect, and BLEU will give a very high score. If you miss like a punctuation, the score penalty will be exactly the same, although the errors are completely different in terms of severity.
[08:16] Just one thing, to differentiate between – just to explain a little bit more with BLEU is that the way that it looks at both the translation hypothesis and the reference is looking at each word and trying to understand if there is an overlap of each word with the reference. And it does that for combinations of one word, or for combinations of two, three and four words, usually, which are called engrams. And then it has a brevity penalty, that is basically to penalize if the translation is too small, too short… So that’s basically the rationale. And there is a class of metrics that I think we are calling lexical metrics…
Yeah, lexical metrics.
Yeah. So TR, which is translation error rate… It’s similar to that. CHRF is similar to that, but CHRF goes at the character level… So this is a class of things. That is very different from COMET, I think.
Yeah, COMET takes advantage of the presentations coming from large language models, like XLM-RoBERTa, we have been using XLM-RoBERTa. And basically, those representations allow you to compare words in an embedding space; so two words that might not be exactly the same, but have the exact same meaning - COMET will use those representations to output a score.
Now, the other thing that we add on top is that we train those representations to be more suitable for the specific task of machine translation evaluation. And I’m saying this because this is a very important difference from other metrics that have also been proposed, like BIRDS score, where because of the fact that you don’t have any fine-tuning on top, if you use BIRDS score and you say “I love you” or “I hate you”, because love and hate will have similar embeddings, the score will be very high, when in fact they are the complete opposites.
So we start from a pre-training model, but then by training the model with some supervision from human labels on errors, the model learns that “I love you” or “I hate you” for this specific task they are complete opposites. And I think that kind of splits apart COMET from all the metrics that were being proposed before, that either fall into the lexical category, or into the embedding category.
Yeah, that’s great. And you also mentioned, just in passing, there was another kind of category of quality estimation that didn’t require a reference… Could you talk about that a little bit?
Yeah, so the idea is very similar to the idea of COMET. So the difference is that when you have access to a reference, which is the case of COMET, when you create the embeddings for the MT outputs, they will be perfectly aligned with the embeddings from the reference, because they are in the exact same language.
On quality estimation, you are comparing it directly to the source. So the embeddings will not align perfectly. And still, what happens is that during training, using human supervision, the model learns what is correct and what is incorrect only comparing the MT output directly with the source.
So quality information serves a different kind of application than the matrix, like BLEU, CHRF and COMET, which is, usually, I want to know what is the quality of specific sentences or translations, given their source sentences. For COMET, usually what you’re more interested - COMET or the other metrics - you’re more interested in understanding the difference between models or MT systems. So you’re evaluating at some sort of – trying to understand at some sort of test set level or evaluation set level, so that you can decide whether I go with MT model A, B, or C.
[12:08] And then in quality estimation, it’s basically to take decisions on the fly, at real time, in which I cannot wait for someone to make a reference or a position, and decide, “Okay, can I trust this translation? If I don’t, should I throw it out? Is that better that you should do it from scratch? Or I can still give it to someone that can repurpose this and rephrase it?”
So they’re slightly different in their applications, but it is something that you can talk about trends… They started to - like Ricardo was teasing, to intersect themselves a bit.
I would say that the metrics field, so the evaluation on the metrics side was stuck with BLEU for a long time. Quality estimation, on the other hand - I feel that there were more research and more innovation on that field. Actually, that was our motivation when we built COMET - we tried to replicate what was being done; the state of the art of what was being done on quality estimation - we tried to bring it to the metrics field. And now, the modeling approaches are very similar. But it was viewed as two completely different tasks for years.
So just to give an insight, a bit of context on what Ricardo said about the progress in quality estimation… So I did my PhD on working on this kind of problem, and I finished like in 2015. So I was working from 2012 until 2015 on problems around this… And the approaches back then, they were basically using feature-based approaches like classical machine learning… And with deep learning, and access to embeddings, and now large pre-trained models, this very, very fast shifted to these kinds of approaches, and the performance of these models, of these approaches also are much better than when I used to first work on this.
So the quality of this quality estimation models nowadays, they are very useful; you can actually do a lot of things with them, like I was saying, and - yeah, I just wanted to complement that… Because for me, it was – I was not working in the field, specifically on this problem for about three years, I guess. When I came back to it, it was like “Whoa… Now it’s really up to everything”, you know…
Could you explain a little bit - so you mentioned how like in COMET or in these other models you might be comparing the embeddings of words… But words don’t always map like one-to-one between languages, and sometimes - I don’t know if you’re looking at sentences, or other things… But could you describe what are the main challenges looking forward that aren’t solved yet in terms of like next steps with quality estimation, and things that you’re looking at now that you see as open problems?
Yeah, you actually touched a very nice point… I wouldn’t say that it’s not that the words don’t align very well, but sometimes what we see is that the embeddings themselves for certain specific words are not discriminative enough. And we have seen some – for instance, if you translate the sentence, “This apple costs 50 cents”, you translate it to Portuguese, and the translation needs… I’m not gonna say it in Portuguese, but pretend that I’m speaking Portuguese… The perfect translation would also be 50 cents. But for some reason, the MT might have hallucinated and say that it’s 500 cents. So it’s basically changing the price of an apple, and this is a critical error in much scenarios.
But if you look at the embedding space of the 500, or the embedding of 50, it’s going to be very similar. And it’s going to be very hard for the neural network that is trying to differentiate these two things, it’s going to be a very hard task, because there is not enough signal.
[16:10] You also see the same thing with some named entities. Currently, there has been some work, some progress in trying to look at quality estimation, and metrics, and try to figure out why they are not working for this kind of very specific phenomena. Actually, yesterday we had a lot of presentations about challenge sets that try to test metrics for these specific phenomena.
So in WMT we have several competitions, several what we call shared tasks, and inside the metrics shared task, where people are trying to compete to create better metrics, there was also a shared task that we call this challenge set sub-task, where people submit examples that are challenging for metrics. And then the participants from the metrics task have to score those examples, and then we get the scores back to the developers of the challenges for them to analyze. And a lot of people looked into this, and tried to make some suggestions for future work in how to improve metrics for this. So if you guys are interested in this, take a look at the findings from the metrics task, because they are interesting findings, and pointers for future work in this area.
One of the problems of these model-based MT evaluation approaches is that, first, they are based on the data that the pre-trained models were trained on. So there’s everything there; there is bias, and there is a limited amount, or it can be a lot of data as well… But all the idiosyncrasies of that data are encoded in the pre-trained models. Then, when you fine-tune this for the specific tasks that they need to work on, namely quality estimation and MT evaluation, they also are limited in data, in the sense that we have orders of magnitude less labeled data for this fine-tuning process. So this can have its biases, and it can have also, like, taking the example of apple - for some reason, you’ve never seen Apple, the company, but you saw only for the fruit… So every time you see “Apple”, you translate that to the fruit, you know? You actually say that if the model translates that to the fruit, the evaluation thing is gonna say “Yeah, it’s fine.” Because in the evaluation data that used to train the model, you never saw, for some reason, the brand. And this is related to the named entity problem that Ricardo was saying.
So I think we are given the first step as a community to understand that now, and really poke it, and see, “Okay, there is a hole here.” And now the next step is how to alleviate that problem. I don’t think it’s possible to alleviate and completely solve, but we for sure will try to alleviate this for these models now. And there’s a lot of complaints also, of people – not complaints, but you know, even us, when we are using different models, not only ours, we see that these models fall short sometimes… And this can be very bad in a commercial setting, or even in sensitive scenarios in which if you get two cents and the model that translated this to, I don’t know, 2 million - that’s not very nice. Right? You might have some legal implications with that.
So I don’t know, are there other open problems…? For me, one big problem is that – and this is also a trend that we see in the matrix and in the quality estimation task, is that bigger models have better predictive power. So people, usually, what they are doing is just throw more GPUs at it, and just train a bigger model, and this seems to be giving improvements as well. But the problem is that not every practitioner can actually use these models once they are trained, because they need bigger and bigger GPUs, which are costlier, even at inference time.
[20:10] We actually had a paper in EAMT, the European Association for Machine Translation conference, that was actually making COMET smaller… And it’s like a diminutive - the name of the model is COMETINO, which is a diminutive of COMET. Like, Portuguese – a very Portuguese way to say it. And it was also a first step towards that. But I think there is a lot to be done for all the other models, and also for COMET.
Yeah, definitely. I think COMETINO was just the first step into that direction. There’s a lot of things that can be improved in distillation of these models, even the evaluation models, like we did for COMETINO. And not just for evaluation; we have been focusing this podcast a little bit on evaluation, but on machine translation you have the same problem. In machine translation, bigger models have been achieving impressive machine translation quality… But it’s very hard for everyone to develop those models, and it’s even harder for people to deploy those models.
We face this at Unbabel - we develop our own machine translation systems, and we have seen this trend; we get improvements if we keep scaling our MTs, but then we have difficulties serving those MTs. And also, we know that not every company has the capacity to build such big models, like big tech companies develop. So yeah, it’s not just in the evaluation side, but also in the machine translation side, it is something that people should look forward t - without losing performance, how to make these things smaller and easier to deploy.
Yeah, and would you say – so on the model side specifically… José, you mentioned sort of models getting bigger and bigger… Some people might have seen nice giffies about like an encoder/decoder, and one language coming in, and one language coming out, and transformer models… But what are some things others are exploring, maybe yourselves, that are either different approaches, or you mentioned distillation and all these other things to make models smaller… But are there different architectures or techniques being explored? I think I saw one of your papers, something about like KNNMT, or something… I don’t know if you can speak to that, but…
Yeah, at this moment there is a poster on the usage of KNNMT for the chat shared task… So this is something called – I think this is broadly called dynamic adaptation, and one approach to that is doing KNNMT that, rather than actually fully fine-tuning one base model, like one of these large pre-trained models, you actually just do some data retrieval approach in which you combine the contents of a data store that has relevant data for the use case that you’re trying to serve with machine translation, and then at decoding time, when you are assembling the translation, using the translation probabilities of the model, you interpolate these probabilities with the probabilities of words or expressions contained in the data store. So this way, you avoid having to fully fine-tune the model for each use case that you have, and this is something that we started to research and approach at Unbabel.
But I just must say that this doesn’t solve the problem of the base model being big. We just avoid fine-tuning it completely. So there’s still the problem of, “Okay, how do I shrink or compress this model so that I can reliably and cheaply explore it for translation?” And this is, like I said, distillation, quantization, and other compressing techniques.
[23:55] Just to complement what José was saying about the k-nearest neighbor approach… Another very big advantage of this is that it’s very easy to combine with translation memories, which we know that they are widely used in the translation industry… And this is a seamless way to basically take the MT and make the MT work with those translation memories… Because you can build this data store that will help the model to translate the content accordingly. So just to add that also, which I believe that it’s very important for the localization industry in general.
Great. Yeah. Well, we’ve talked a lot about challenges, I guess, which is fun to talk about at a research conference, for sure… What are some things just like generally about the machine translation industry, or Unbabel, or other things that make both of you sort of excited and optimistic about the future? What are some of those things that excite you? It doesn’t have to be in MT; things you’ve seen at this conference, or things that you’re following that give you some encouragement and excitement about the future of this space where we’re working?
Actually, I’m very passionate about evaluation in general. I think that shows up in my work, as I mostly work on evaluation… I’ve been getting very excited with the progress that we have been doing in evaluation. We have started a project on this; it’s to combine these systems, these quality estimation systems with the machine translation itself. So that is something that we – we started working on this, but I believe that you can work on this for the next few years, and there is a lot of things that we can improve there. Yeah, that gets me really excited. I think it’s a direction that it’s going to be really nice.
Yeah, this is the quality-aware decoding project. That is basically what I just mentioned about – what we have been talking about, of having these quality predictions about the hypothesis translations. The idea behind this project that Ricardo is talking about is what if we bring the quality estimation or COMET already to inside the MT process? And then we can make the machine translation aware, or more aware about its quality having a signal from a different model. So this is what this project is about. We have a paper at [unintelligible 00:26:18.07] this year describing that… So yeah, this is pretty exciting.
And I think in terms of more broad challenges, what I find interesting is that – I don’t believe that translation is solved. I think a few years ago some people claimed that there was human parity between MT systems, or some MT models and humans, and translators… But then it turned out that the actual translators that were used were not really professional translators. Like, I know English, but I’m not a native speaker, and I cannot translate everything. So I’m not a subject matter expert on different topics. So I cannot actually – if you give me some chemistry content to translate into English, from Portuguese, I cannot do it.
So I think what’s exciting is to see that the technology is allowing us to translate better and better, maybe compared to me as a non-native speaker when I’m translating some content… But still, there are a lot of challenges to actually translate very well, very specific content, that requires very specific terminology, and a very specific way of actually building the sentences. And what is much better is actually the fluency that this machine translation models are giving nowadays. But what remains as still a challenge is that sometimes the translations - they look very good, but they are not on point. So they are not adequate. They are talking about something slightly different, or completely different. So I think this is exciting. I mean, not everything is solved, but at the same time it’s encouraging in this sense.
Yeah, great. Well, as we close out here, where can people find out more about Unbabel? And specifically, maybe some of this research that’s going on. And also, you mentioned beforehand that Unbabel was possibly hiring as well -where can people find out about that?
Right. So we have our website, unbabel.com, and we have our Twitter handle, @unbable. You can follow our news from there. We’ve just put up a research blog in which we are going to be writing about our research. This is going to be possibly in the links in your info box. I don’t know.
Yeah, we’ll put it in the show notes, for sure. And yeah, we are also hiring soon. Like, we are starting to accept applications for the next year, for research scientists in different levels and different geographies. So Unbabel - we didn’t talk about it, but it was born in Portugal, in Lisbon, but now we have offices all around the world. We have offices in the West Coast in the US, in the East Coast, London, and some other places in Europe… And we are going to post this also, to give an email for contact for people who are interested in all the research that we’re doing, and other works.
We have open positions not only for research scientists, but also for engineers and other positions that are not technical.
Yeah. Well, thank you, José, thank you, Ricardo. I really appreciate you taking time. I know there’s a lot of good posters around to see, and all that, so thanks for taking time.
Our transcripts are open source on GitHub. Improvements are welcome. 💚