GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels

Article Review

A recent peer-reviewed study has offered one of the most rigorous comparisons to date between state-of-the-art machine translation and LLM engines and human translations across languages, domains, and expertise levels. Its findings highlight both the strengths and persistent limitations of AI-based translation and machine translation.

Researchers benchmarked three systems: GPT-4 (an advanced large language model), Seamless M4T (the latest traditional machine translation engine), and human translators (ranked as junior, medium, or senior professionals). The analysis covered three language pairs – Chinese-English, Russian-English, and Chinese-Hindi – plus three specialized domains: news, technology, and biomedical. Using the Multidimensional Quality Metrics (MQM) framework, expert annotators reviewed 1600 translations, thoroughly categorizing errors (mistranslation, unnatural flow, named entities, grammar, spelling, etc.)

GPT-4 generally produced fewer errors than Seamless M4T, and approached the quality level of junior human translators, particularly for high-resource language pairs like Chinese-English and Russian-English. It was also found to be far less prone to inventing information or omitting content.

However, GPT-4 fell short in idiomatic phrasing and handling unfamiliar terms and names.

Seamless M4T often outperformed GPT-4 for specific low-resource scenarios such as Chinese-Hindi, suggesting that large data-driven engines still struggle without ample training data.

Nonetheless, Seamless M4T exhibited more frequent mistranslations, name entity mistakes, and unnatural flow issues than GPT-4 and human translators.

Human mid-level and senior translators consistently generated higher quality, more fluent translations in all tasks.

This comparative assessment shows that while both MT and LLM-based translation engines are useful tools, their limitations – especially with nuanced language and domain-specific texts – underscore the need for human oversight in the translation workflow. Its results suggest that machine translation is best viewed as a supportive tool for relatively simple projects and to increase productivity, while human experts remain essential for any tasks requiring a high level of reliability.