Article Review
A recent study by Chinese Tencent AI Lab assesses the machine translation capabilities of ChatGPT’s latest version, GPT-4, comparing it to its main commercial competitors, Google Translate and DeepL, as well as to its previous iteration, ChatGPT 3.5.
The study was based on three benchmark datasets. Three prompts were used to trigger translation in GPT-4, with the template “Please provide the [TARGET] translation for these sentences” (TP3) performing best. Automatic metrics (BLEU, ChrF, TER) and human annotation were both used for evaluation. The results were then compared to the corresponding Google Translate, DeepL, and Tencent TranSmart output to identify strengths and weaknesses across language pairs and domains.
According to the researchers, based on the tests, GPT-4 fares competitively on high-resource European language pairs but lags significantly behind on low-resource pairs (e.g. Romanian-English) and distant language pairs, especially those from different language families (e.g. Chinese-English).
Furthermore, when it comes to specialized domains and noisy data such as informal social media posts, GPT-4 is less robust – that is, it struggles with challenging, unexpected, and non-standard input such as typos, domain-specific terminology, informal language, incomplete sentences, or natural variation in phrasing.
The researchers claim that the transition to GPT-4 reduces hallucinations and mistranslations compared to earlier versions. Automated and human evaluations also show GPT-4’s translations rank higher than both Google Translate and ChatGPT with GPT-3.5. These improvements, however, mainly occur in high-resource language pairs. Moreover, the translation of less frequent words, and of short sentences, remains challenging.
The researchers acknowledge that their study is limited in terms of sample size, potential result randomness, and a focus on only certain machine translation abilities, namely robust and multilingual translation.
*
One interesting aspect of the paper is that, to deal with the challenges posed by distant languages, the researchers explored “pivot prompting”, based on the well-known strategy of using a third pivot language to handle a translation in a less-common language pair. That is, rather than translating directly between two distant languages, the researchers asked GPT-4 to translate first into English, then into the final target language. This strategy, they argue, is advantageous in terms of both knowledge transfer – making it possible to transfer the knowledge of the high-resource pivot language to the low-resource target languages – and convenience for multilingual translation.
It is worth pointing out that the small size of the researchers’ sample may be obscuring the drawbacks of this strategy. Indeed, while pivot prompting can enable translation between distant or low-resource languages by leveraging the strengths of high-resource languages, the use of a third “pivot” language in machine translation and LLMs is known to pose a number of potential issues. These include:
- Error propagation: any mistranslation or ambiguity introduced in the pivot language will be carried into the final target translation. This means a single error in the first translation step is likely to affect all subsequent outputs, making the final translation less accurate.
- Loss of information: some languages have grammatical features (such as gender, formality, or vocabulary distinctions) that may not exist in the pivot language. For example, English does not mark gender on nouns or adjectives, so distinctions present in languages like Spanish or German can be lost or forced into ambiguity when passing through English.
- Cultural and contextual misalignments: pivots often cannot fully capture culture-specific context, idioms, or connotations, so subtleties unique to the source language may be dropped in the pivot and never reach the final target.
- Longer translation process: including an extra step inherently increases complexity and processing time. Any change or correction in the original requires updating across the pivot and all target languages, making updates more cumbersome and error-prone.
- Semantic drift: meaning can subtly shift over two translation steps, so that, even if individual translations are acceptable, the combined outcome can diverge from the source.
In conclusion, while the translation capabilities of the new Chat GPT-4 engine are a considerable improvement over those of its predecessor, particularly for high-resource languages, significant limitations persist for low-resource and domain-specific translation tasks. The researcher’s “pivot prompting” strategy, which would in principle be particularly useful in multilingual contexts and when dealing with low-resource languages, nonetheless gives rise to well-known and serious problems in real-life translation workflows.
In this article:
Currey, A., Heafield, K. (2019). Zero-Resource Neural Machine Translation with Monolingual Pivot Data. Proceedings of the The 3rd Workshop on Neural Generation and Translation (WNGT 2019). Association for Computational Linguistics (ACL), Hong Kong. https://doi.org/10.18653/v1/D19-5610
Oh, S., Noh, K., Jung, W. (2025). A Single Model Ensemble Framework for Neural Machine Translation using Pivot Translation. arxiv.org https://arxiv.org/html/2502.01182v1#S5
Zou, L., Saeedi, A., Carl, M. (2022). Investigating the Impact of Different Pivot Languages on Translation Quality. Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas, Orlando, USA, September 12-16, 2022. Workshop 1: Empirical Translation Process Research. https://aclanthology.org/2022.amta-wetpr.3.pdf