Findings of the WMT 2024 Shared Task on Chat Translation

Article Review

The Workshop in Machine Translation (WMT) Chat Translation Shared Task, a competition to evaluate machine translation (MT) systems for bilingual customer support chats, was held last year for the third time. The task focuses on translating agent–customer dialogues across five language pairs: English↔German, English↔French, English↔Brazilian Portuguese, English↔Dutch, and English↔Korean. The study evaluates how well translation systems preserve meaning while handling informal, fragmented, and context-dependent text typical of chat communications. This evaluation is carried out by means of automatic neural and lexical metrics, error analysis based on Large Language Models (LLMs) following the MQM translation quality assessment framework, and human linguists.

Its results have been published, and they offer fascinating insights into the current strengths and weaknesses of AI translation systems. They show how far machine translation has advanced for real-time, conversational settings, but also uncover areas where professional human judgment remains vital.

Advanced models, including fine-tuned large language models, scored over 90 out of 100 in human evaluations for individual, single-turn chat exchanges in several language pairs. This means that, sentence by sentence, AI can capture meaning accurately and fluidly, covering common requests and responses in customer support scenarios.

However, when conversations extend over multiple turns, AI performance starts to drop. Human raters observed declines of 5–10 points at the conversation level, with problems like tone inconsistency, forgetting prior context, or mismatched registers. Errors included:

Failure to maintain consistent gender or references across turns in highly inflected languages like German: e.g. a customer referred to a colleague as sie (she), yet the translation switched to er (he).
Improperly switching between the informal and formal registers: e.g. tú and usted in Spanish, Korean honorifics.
Inconsistent translation: e.g. POS being translated as terminal, app, and caja in successive turns.
Inappropriate changes in verb mood and tense mid-conversation: e.g. from the polite imperative Veuillez patienter (“please wait”) to a neutral indicative mood in French.
Systems that analyzed previous turns – even just the prior few exchanges – saw up to eight-point gains in advanced quality metrics (COMET, CHRF, BLEU). This demonstrates real progress in models recognizing continuity and reference across short dialogues. Furthermore, the leading system delivered more than 94% “perfect” translations in English–German conversations when evaluated with detailed error-checking. Simple exchanges and clear contexts benefit especially from this level of automation.

Nonetheless, even high-performing systems generated dozens of major and critical mistakes in nuanced situations – such as ambiguous messages or fragmented, informal language.

In longer conversations, systems often forgot earlier context, lost track of who was speaking, or dropped implied information: e.g. when a customer wrote “I didn’t understand your last question” in German, the system mistranslated the reply to indicate that the agent had not understood.
In some test sets, models omitted pronouns or connectors that were necessary to keep the conversation coherent, resulting in grammatically correct but incomplete or misleading responses: e.g., skipping “we” or “they” when required to ensure semantic continuity and disambiguation, making the sentence vague or incorrect.
In other cases, facts not present in the original dialogue where hallucinated, or source text was copied rather than translated.

One interesting point is the discrepancies between machine and human evaluation. Automated metrics sometimes rated output more highly than direct human assessments, particularly for subtleties like formality, politeness, and cultural appropriateness. This gap highlights why human linguists are needed to ensure a translation makes sense not just linguistically, but socially and professionally.

Overall, translation systems were found to be highly accurate on a sentence-by-sentence basis, but overall conversation quality (coherence, consistency of style, and cohesion) is notably lower. As conversations unfold, translation quality typically declines, especially in maintaining correct tone, formal register, and reference tracking across turns. This means that discourse-level phenomena – those that require understanding the whole interaction, not just isolated sentences – remain difficult for AI.

In conclusion, AI translation now delivers robust, rapid results for many everyday chat scenarios and can support productivity in multilingual communications. Yet, for extended conversations, brand-sensitive communication, or anything requiring cultural understanding or emotional tone, human translators still provide irreplaceable value.

High-quality translation today means combining AI efficiency for easily automated tasks with the contextual awareness and linguistic judgment of skilled professionals.