Time to Edit as a KPI for MT quality
Translated has continuously worked on monitoring progress in machine-translation quality, finally standardizing our methodology in 2011. Since then, we've been measuring the average Time to Edit (TTE) of a word, an indicator of the time required by the highest-performing professional translators to check and correct MT-suggested translations.
TTE starts when a translator begins translating a segment and keeps track of all the time spent on the task, ending when the translator finally marks the task as done. We consider TTE the best possible measure of translation quality, as there is no concrete way to define translation quality other than measuring the average time required to check and correct a translation in a real working scenario.
TTE is average because it is related to a single segment, but when we consider many segments, TTE converges. Researchers in the machine translation field have not yet had the opportunity to work with such a large amount of data provided by an actual working scenario. For this reason, they have had to rely on estimates such as edit distance.
We prefer TTE over other metrics because, for example, a sentence with a single character mismatch could score high in BLEU (Bilingual Evaluation Understudy), even though the translator spends a significant amount of time understanding and resolving the issue. Additionally, both edit distance and semantic difference measurements cannot be used as a consistent and accurate indication of MT quality in a production scenario. As a matter of fact, this is greatly influenced by varying content type, translator competence, and turnaround time expectations, all elements that are not considered by the aforementioned methods.
In 20+ years of business, Translated has gathered evidence that TTE is a much more reliable indicator of progress in MT quality than automated metrics like BLEU or COMET (Crosslingual Optimized Metric for Evaluation of Translation), as it represents a more accurate approximation of the cognitive effort required to correct a translation.
According to data collected across billions of segments, TTE has been regularly shrinking since Translated started monitoring it as an operational KPI.
When plotted on a graph, the TTE data show a surprisingly linear trend approaching a predictable point where MT will provide what could be called “a perfect translation.” Indeed, top professionals spend one second checking a translation produced by their colleagues which doesn’t require any editing. If the TTE trend line continues to decline at the same rate as the previous several years, the TTE will lower to one second in a few years. The exact date when that will happen could vary somewhat, but the trend is clear.
The singularity in machine translation will occur when the best-performing professional translators spend more time correcting a translation provided by their peers than one provided by machines.
Our initial hypothesis to explain the surprisingly continuous linear trend in improvement we see in the TTE measurements is that while language is an exponential, complex problem, we are addressing it with exponentially growing assets: computing power (doubling every two years), data availability (the number of words translated grows at a compound annual growth rate of 6.2%, according to Nimdzi Insights), and machine learning algorithms efficiency (compute needed for training, 44x improvement from 2012-2019, according to OpenAI).
Another surprising aspect is the smoothness of the progress. We expected drops in TTE with every introduction of a new major model, from statistical to seq2seq to the transformer and adaptive transformer. The impact has likely been diluted because translators were free to adopt the upgrade when they wanted.