For decades, machine translation has operated on a simple premise: translate words. This approach has led to significant advancements, with models capable of producing grammatically correct and often fluent translations. However, for enterprise localization managers and CTOs, a critical gap remains. Traditional machine translation, even at its best, often fails to capture the full meaning behind the words because it lacks a connection to the real world.
Moving beyond text-only translation
Text-only translation models are trained on vast datasets of parallel text, learning to associate words and phrases in one language with their counterparts in another. While effective, this process is inherently limited. These models operate in a vacuum, disconnected from the rich, multimodal context that humans use to understand language. They don’t see the product the text describes, understand the environment in which a conversation takes place, or grasp the cultural nuances that are often unwritten.
The challenge of ambiguity and context
This disconnect leads to a fundamental challenge: ambiguity. A word like “crane” can refer to a bird or a piece of construction equipment. Without real-world context, a translation model is simply making a statistical guess. For businesses, these guesses can have significant consequences, from confusing marketing copy to inaccurate technical documentation. The path to truly reliable translation lies in moving beyond text-only systems and embracing a new paradigm: grounded translation.
Reality connection: The foundation of grounded translation
To overcome the limitations of traditional machine translation, we must connect our models to the real world. This is the core principle of grounded translation, a field of research that aims to create AI systems that understand language in the context of its environment.
What is a grounded translation?
Grounded translation is an approach to machine translation that connects language to a representation of the real world. This “grounding” can take many forms, from images and videos to structured data and sensor readings. By providing models with access to this additional information, we can help them to disambiguate language, understand context, and produce more accurate and reliable translations.
How grounding bridges the gap between language and reality
When a translation model is grounded, it is no longer just translating words; it is translating concepts. For example, if a model is translating a product description and has access to an image of the product, it can use that visual information to correctly translate terms that might otherwise be ambiguous. This connection to reality allows the model to build a deeper understanding of the source text, resulting in a translation that is not only grammatically correct but also conceptually accurate.
Contextual grounding: From theory to practice
The theory of grounded translation is compelling, but how do we put it into practice? The key lies in leveraging multimodal data and understanding the situated nature of language.
The role of multimodal data
Multimodal data is data that comes in multiple forms, such as text, images, audio, and video. By training translation models on this type of data, we can help them to learn the connections between words and the real-world objects and concepts they represent. For example, a model that is trained on a dataset of cooking videos with transcribed recipes will learn to associate the word “chop” with the visual action of chopping, leading to a more accurate translation of the word in different contexts.
Situated translation: Understanding the environment of language
Situated translation takes this a step further, considering the broader environment in which language is used. This can include everything from the physical location of the speaker to the time of day and the social context of the conversation. By providing models with access to this situational information, we can help them to understand the subtle nuances of language that are often lost in translation.
Implementation strategies: Building reality-aware AI
Building AI systems that truly understand and translate language in context requires new tools and approaches. At Translated, we are developing the technologies and workflows that turn this vision into reality for our customers. With DVPS, one of the most ambitious initiatives in the field, we are leading a project that marks an important step forward for AI research in Europe. The name stands for Diversibus Viis Plurima Solvo, which translates from Latin as “By many diverse paths, I solve most things.” DVPS is a Horizon Europe flagship project launched in 2025 with €29 million in funding. DVPS pushes into multimodal foundational models—AI systems that can learn not only from text, but also from vision, sensor data, and real-world interaction. Unlike today’s systems, which learn from representations of the world via text, images, and video, these next-generation models are designed to learn across multiple input channels, including visual, auditory, linguistic, and sensory signals, to gain a grounded understanding of the physical world. This multimodal approach enables them to interpret meaning in parallel, manage complexity, and adapt to real-world scenarios where today’s single-modal AI often fails.
Translated is leading the consortium, setting the vision and coordinating the work of 20 partner organizations across 9 countries. That includes universities, research labs, and private companies.
We are developing an open-source toolkit to streamline the design, pre-training, fine-tuning, and modality expansion of MMFMs. It supports the reuse and composition of pre-trained models, reducing development time and cost while enhancing adaptability across modalities.
In short, DVPS is not just another AI project—it’s a paradigm shift in how we design and apply AI. For Translated, it’s the natural continuation of a journey that started in 1999, from pioneering adaptive machine translation to Lara, our context-aware AI launched in 2024. DVPS expands that vision into multimodal, real-world AI.
Evaluation methods: Measuring what matters
As we move towards a new paradigm of grounded translation, we also need new ways of measuring success. Traditional automated metrics, which are based on n-gram matching, are not sufficient for evaluating the quality of grounded translations.
New metrics for grounded translation
To truly measure the quality of grounded translations, we need to move beyond simple text-based metrics and develop new evaluation methods that take into account the model’s understanding of the real world. This could include metrics that measure the model’s ability to correctly identify objects in an image, understand the spatial relationships between objects, or follow a set of instructions in a real-world environment.
The human-in-the-loop: The ultimate test of quality
Ultimately, the best way to evaluate the quality of a translation is to have a human review it. This is especially true for grounded translations, where the nuances of meaning can be subtle and difficult to measure with automated metrics. By keeping a human in the loop, we can ensure that our grounded translation models are not only technically accurate but also produce translations that are natural, fluent, and culturally appropriate. This is the core of our human-AI symbiosis approach.
Conclusion: A future where translation understands reality
The journey towards grounded translation is just beginning, but the potential is clear. By connecting our translation models to the real world, we can create AI systems that are not only more accurate and reliable but also more helpful and intuitive to use. At Translated, we are committed to pushing the boundaries of what is possible in machine translation and building a future where language is no longer a barrier to understanding.