Time to Edit as a KPI for MT quality
Over the years, Translated has continually worked to evaluate and monitor machine translation quality. In 2011, we finally standardized our methodology and settled on a metric we call “Time to Edit,” the time required by the world’s highest-performing professional translators to check and correct MT-suggested translations. Since then, we've been tracking the average Time to Edit (TTE) per word in the source language.
Time to Edit is calculated as the total time that a translator spends post-editing a segment of text divided by the number of words making up that segment. We consider TTE the best measure of translation quality as there is no concrete way to define it other than measuring the average time required to check and correct a translation in a real working scenario.
Averaged over many text segments, TTE gives an accurate estimate with low variance. Machine translation researchers have not yet had the opportunity to work with such large quantities of data collected from production settings. It is for this reason that they have had to rely on automated estimates such as the edit distance, or the number of operations required to make a correction.
By switching from automated estimates to measurements of human cognitive effort, we reassign the quality evaluation to those traditionally in charge of the task: professional translators.
This way, for example, a sentence with a single character mismatch that requires a significant amount of time to be understood and corrected by a professional translator would not receive an unduly high-quality estimation, as could be the case if using metrics like BLEU (Bilingual Evaluation Understudy).
Additionally, both edit distance and semantic difference measurements cannot be used as a consistent and accurate indication of MT quality in a production scenario. As a matter of fact, this is greatly influenced by varying content type, translator competence and turnaround time expectations - all elements that are not considered by the aforementioned methods.
In over 20 years of business, Translated has gathered evidence that TTE is a much more reliable indicator of progress in MT quality than automated metrics like BLEU or COMET (Crosslingual Optimized Metric for Evaluation of Translation), as it represents a more accurate approximation of the cognitive effort required to correct a translation.
According to data collected across billions of segments, TTE has been regularly shrinking since Translated started monitoring it as an operational KPI.
When plotted graphically, the TTE data show a surprisingly linear trend. If this trend in TTE continues its decline at the same rate as it has since 2011, TTE is projected to decrease to one second within the next several years, approaching a point where MT would provide what could be called “a perfect translation.” This would be equivalent to the time top professionals spend checking a translation produced by their colleagues which doesn't require any editing. The exact date of this outcome may vary, but the trend is evident.
Our initial hypothesis to explain the surprisingly consistent linearity in the trend is that every unit of progress towards closing the quality gap requires exponentially more resources than the previous unit, and we accordingly deploy those resources: computing power (doubling every two years), data availability (the number of words translated increases at a compound annual growth rate of 6.2% according to Nimdzi Insights), and machine learning algorithms’ efficiency (computation needed for training, 44x improvement from 2012-2019, according to OpenAI).
Another surprising aspect of the trend is how smoothly it progresses. We expected drops in TTE with every introduction of a new major model, from statistical MT to RNN-based architectures to the Transformer and Adaptive Transformer. The impact of introducing each new model has likely been distributed over time because translators were free to adopt the upgrades when they wanted.