Predicting AI Singularity Using Trends in Machine Translation

For the first time in history, Translated was able to quantify the speed at which we are approaching the singularity in artificial intelligence. The discovery was made possible by the analysis of a large amount of post-editing data collected over many years in a real translation production scenario.

Language translation was one of the first problems investigated by researchers in the domain of artificial intelligence. Yet it remains one of the most complex and challenging problems for a machine to perform at a human skill level. "That’s because language is the most natural thing for humans. Nonetheless, the data Translated collected clearly show that machines are not that far from closing the gap," said Translated’s CEO Marco Trombetti while showing a preview of our discovery in the field during the Association for Machine Translation in the Americas 2022 conference, where he was invited to present the opening keynote speech.

Many AI researchers even say that solving the language translation problem is equivalent to producing Artificial General Intelligence (AGI). Therefore, the evidence Translated provided about the progress in reducing the gap between what expert human translators produce and what a properly optimized machine translation (MT) system can produce is quite possibly the most compelling evidence of success at a scale seen in both the MT and AI community in general.

The claim relies on data representing a concrete sample of the translation production demand. It consists of records of the time taken to edit over 2 billion MT suggestions by tens of thousands of professional translators worldwide working across multiple subject domains, ranging in good proportions from literature to technical translation and including fields in which MT is still struggling, such as speech transcription.

Time to Edit as a KPI for MT quality

Translated has continuously worked on monitoring progress in machine-translation quality, finally standardizing our methodology in 2011. Since then, we've been measuring the average Time to Edit (TTE) of a word, an indicator of the time required by the highest-performing professional translators to check and correct MT-suggested translations.

TTE starts when a translator begins translating a segment and keeps track of all the time spent on the task, ending when the translator finally marks the task as done. We consider TTE the best possible measure of translation quality, as there is no concrete way to define translation quality other than measuring the average time required to check and correct a translation in a real working scenario.

TTE is average because it is related to a single segment, but when we consider many segments, TTE converges. Researchers in the machine translation field have not yet had the opportunity to work with such a large amount of data provided by an actual working scenario. For this reason, they have had to rely on estimates such as edit distance.

We prefer TTE over other metrics because, for example, a sentence with a single character mismatch could score high in BLEU (Bilingual Evaluation Understudy), even though the translator spends a significant amount of time understanding and resolving the issue. Additionally, both edit distance and semantic difference measurements cannot be used as a consistent and accurate indication of MT quality in a production scenario. As a matter of fact, this is greatly influenced by varying content type, translator competence, and turnaround time expectations, all elements that are not considered by the aforementioned methods.

In 20+ years of business, Translated has gathered evidence that TTE is a much more reliable indicator of progress in MT quality than automated metrics like BLEU or COMET (Crosslingual Optimized Metric for Evaluation of Translation), as it represents a more accurate approximation of the cognitive effort required to correct a translation.

According to data collected across billions of segments, TTE has been regularly shrinking since Translated started monitoring it as an operational KPI.

When plotted on a graph, the TTE data show a surprisingly linear trend approaching a predictable point where MT will provide what could be called “a perfect translation.” Indeed, top professionals spend one second checking a translation produced by their colleagues which doesn’t require any editing. If the TTE trend line continues to decline at the same rate as the previous several years, the TTE will lower to one second in a few years. The exact date when that will happen could vary somewhat, but the trend is clear.

The singularity in machine translation will occur when the best-performing professional translators spend more time correcting a translation provided by their peers than one provided by machines.

Our initial hypothesis to explain the surprisingly continuous linear trend in improvement we see in the TTE measurements is that while language is an exponential, complex problem, we are addressing it with exponentially growing assets: computing power (doubling every two years), data availability (the number of words translated grows at a compound annual growth rate of 6.2%, according to Nimdzi Insights), and machine learning algorithms efficiency (compute needed for training, 44x improvement from 2012-2019, according to OpenAI).

Another surprising aspect is the smoothness of the progress. We expected drops in TTE with every introduction of a new major model, from statistical to seq2seq to the transformer and adaptive transformer. The impact has likely been diluted because translators were free to adopt the upgrade when they wanted.

About the Data and Process

Translated has collected over 2 billion edits on sentences effectively translated in production work. These edits and corrections were made by 136,000 of the best-performing freelancers worldwide working with our CAT tool Matecat. We began working on this software as a research project funded by the European Union, developed by a consortium consisting of Translated, Fondazione Bruno Kessler (led by Marcello Federico), the University of Edinburgh (led by Philipp Koehn), and the Université du Maine (led by Holger Schwenk), and finally released as open-source software. The EU Commission included it among the projects with the highest potential for innovation funded by the Seventh Framework Program.

Translated relies on a proprietary AI-based technology called T-Rank to pick the best-performing professional translator for a given task. This system gathers work performance and qualification data of over 350,000 freelancers who have a work history with the company over the last two decades. The AI ranking system considers over 30 factors, including resume match, quality performance, on-time delivery record, availability, and expertise in work-specific subject areas.

Working in Matecat, translators check and correct translation suggestions provided by the MT of their choice. The data were initially collected using Google's statistical MT (2015-2016), then Google's neural MT, followed by ModernMT's adaptive neural MT introduced in 2018, soon becoming the preferred choice of almost all the translators.

To refine the sample, we only considered the following:

  • Completed jobs, delivered at a high level of quality.
  • Sentences with MT suggestions that had no match from translation memories.
  • Jobs in which the target language has a vast amount of data available along with proven MT efficiency (English, French, German, Spanish, Italian, and Portuguese).

From the resulting pool of sentences, we removed the following:

  • Sentences that didn’t receive any edits, because they don’t provide information about TTE, and sentences that took more than 10 seconds per word to be edited, since they suggest interruptions and/or unusually high complexity. This refinement was required to enable TTE comparison across multiple years.
  • Locale adaptation work, i.e. translations between variations of a single language (e.g., en-UK to en-US), as it is not representative of the problem at hand.
  • Large customer jobs, as they leverage highly customized language models and translation memories, where TTE performance is far better than average.

TTE is impacted by two main variables beyond MT: the evolution of the editing tool (Matecat) and the quality delivered by translators. The impact of the first has a smaller order of magnitude than the typical TTE and its low impact is also confirmed by the stability of perfect translations’ TTE. The second has an even smaller impact because the quality of the translation delivered, measured with Errors per Thousand (see below) has not significantly changed during the monitoring. Thus, the sum of these elements is irrelevant to TTE improvement.

Impact on Translators and Industry

Our progress in machine translation is a collaborative achievement built on a perfect symbiosis between humans and machines.

Translated has always recognized and valued the contribution of translators. Ever since we started using MT, we've been paying freelancers for both the words they translated and those processed by the MT. This approach has resulted in an average increase of 25% in translator compensation.

To provide better quality translation in less time, Translated has always focused on removing redundant, productivity-hampering tasks from the translator’s workflow. We have developed AI-powered tools combined with highly responsive and adaptive neural machine translation (ModernMT), and given the professionals working with us access to these powerful assistive tools: the more corrective feedback given to the machine, the better the translation suggestion supplied to the translator on a continuing, dynamic basis, in a perfect symbiosis between human creativity and machine intelligence.

Machines won't ever replace humans: indeed, AI is already proving to be a valuable tool for translation professionals, helping them translate more content at high-quality levels.

Quality management is an important aspect of MT use at Translated. To measure the overall quality of an MT suggestion, Translated uses a measurement called Errors per Thousand (EPT) words. Currently, translations performed by MT regularly score at an EPT rate of around 50, meaning there are about 50 linguistic errors in a thousand translated words. After review by top translators, the EPT decreases to around 10 on average. An additional review by a second professional further reduces the EPT to 5.

As the average quality of MT output continues to improve, as highlighted by the overall TTE trend, the MT suggestions start to be comparable with work produced by a top translator. Therefore, the same double review process described can reduce the EPT from 10 to 2 within the same budgetary constraints. Continuously improving MT allows more content to be translated at higher quality levels without increasing budgets.

Translated has noted that its clients are doubling down on the languages they localize into, as they see the increased ROI driven by making more multilingual content available. This increased momentum is directly connected to and enabled by the progress in AI-powered translation capabilities.

Based on the observed trend, we estimate that we will soon witness at least a tenfold increase in requests for professional translations and at least 100 times growth in demand for machine translation. This estimation is based on our observations of the growing translation demand required by an increasingly global world and the awareness of evolving quality in machine translation which allows translating more content while reducing costs. We see a future in which an increasing amount of new global business opportunities will emerge.

"All of us understand that we are approaching the singularity in AI. For the first time, we have been able to quantify the speed at which we are progressing towards it."
Marco Trombetti – Translated CEO

Get in touch.

We are here to answer your questions,
and help you get what you want.

Contact us