How large language models work in translation (without the hype)

Most translators interact with LLM translation output long before they understand how it's generated. That's worth correcting — not because knowing the architecture changes how you open a file or accept a segment, but because understanding how these models produce text directly shapes what errors to expect, how to prompt for better output, and when to trust the result versus when to spend more time reviewing. We've watched both excellent and poor LLM translations come through project pipelines, and the difference is almost always traceable back to something mechanical: something in how the model was prompted, what context it had, or what its training distribution covered. The mechanics aren't mysterious. Once you see them clearly, the output starts to make considerably more sense.

What happens inside a language model when it translates

A large language model is, at its core, a system trained to predict the next token given everything that came before it. A token is roughly 0.75 words in English on average — though this ratio shifts significantly for agglutinative languages like Finnish or Turkish, where a single word often encodes what English expresses in three or four. The model was trained on an enormous volume of text across languages, genres, and domains, and learned statistical relationships between sequences of tokens.

When you give a language model a source sentence and ask it to translate, you're not triggering a lookup table or a separate encoding step in the way older translation systems operated. The model generates the target sequence token by token, each new token conditioned on the entire preceding context: the system prompt, the source text, and every token it has already produced in the target language. This is why context placement matters so much in LLM translation — the model reads and weights everything it has been given, not just the sentence immediately in front of it.

What this means practically: the model doesn't have a discrete "understanding" of meaning that it then renders in another language. It has deeply learned patterns about what typically follows what, across many languages simultaneously. When it translates a legal phrase like "the contract shall be governed by the laws of the State of New York," it doesn't know contract law. It knows that in its training data, this phrase appeared consistently with specific target-language equivalents, and that the register and syntactic structure of legal text produce certain predictable output patterns.

This is also why domain vocabulary is such a reliable predictor of quality. Terms that appeared frequently in bilingual context during training tend to be handled well. Terms that are specialized, new, or inconsistently represented — internal product codenames, emerging technical vocabulary, regulatory neologisms — produce fluent-sounding output that may or may not be accurate. The model doesn't know it's improvising. It generates with the same confidence either way.

How llm translation works differently from dedicated MT engines

Traditional neural machine translation (NMT) engines — the kind behind DeepL or Google Translate — also use transformer architectures, but they were trained on parallel corpora: large datasets of aligned source and target sentence pairs. The training objective was to learn a direct source-to-target mapping, optimized specifically for translation quality across sentence boundaries.

LLMs were trained primarily as next-token predictors on general text. Translation emerged as a capability because multilingual text was part of the training mix, not because the model was optimized for it as a primary task. This distinction produces meaningfully different behavior in practice.

Dedicated MT engines tend to be faster and more consistent on high-volume, general-domain content. When text falls within their training distribution — business correspondence, news articles, standard legal boilerplate — they perform reliably at scale. A well-tuned MT engine can process millions of words per hour at a fraction of the cost of an LLM call.

LLMs perform better on tasks that require reasoning about context, following explicit instructions, or handling material outside a specialized MT engine's distribution. If you give an LLM a system prompt that says "translate this technical manual from German to English, maintain formal register, use the following glossary terms exactly as specified, and leave all product codes in the source language," the model can act on those instructions coherently. A dedicated MT engine cannot.

The tradeoff is cost and latency. LLM translation is slower and more expensive per word than dedicated MT by a considerable margin. This is why the most effective professional workflows we've seen layer both: use a fast MT engine for bulk pre-translation, then apply an LLM selectively for segments where precision, instruction-following, or specialized terminology matters most. Neither approach replaces the other in the current state of the technology.

Context windows and document-level coherence

One genuine advantage LLMs offer over sentence-by-sentence NMT is the ability to hold longer context. A modern LLM can process an entire chapter within its context window, which means it can sustain consistency in how it renders recurring terms, characters, or concepts across many paragraphs — something MT engines, which process segment by segment, cannot do natively.

Document-level coherence matters most in literary and marketing translation, where tone and word choice across sections should feel unified, and in technical documentation where the same concept must appear identically every time it is referenced. We've encountered cases where segment-level MT produces three different translations of the same product feature name within a single user manual — because each segment was processed independently, without any shared context about what came before or after.

LLMs don't fully solve this problem. Context windows have limits, and very long documents need to be chunked, which reintroduces the consistency problem at chunk boundaries. But the improvement over segment-level processing is real when a document fits comfortably within the context window and the prompt is structured to include relevant background.

This is also where glossary integration becomes more than a consistency mechanism. When you prepend glossary terms to an LLM translation prompt, you're not just giving the model a lookup table. You're giving it evidence about what the correct output should look like for specific source strings, which shapes the probability distribution of every subsequent token. The effect is strongest when the glossary terms are positioned close to the segment being translated in the context window — not buried at the bottom of a long system prompt where they compete with everything else the model has been asked to remember.

Terminology is where LLM output breaks down

Specialized terminology is where LLM translation fails most visibly and most predictably. The failure mode is specific: the model generates a translation that reads fluently and is grammatically correct, but uses the wrong term for a concept that has a precise, domain-specific equivalent.

A concrete example: a pharmaceutical team translates a clinical trial protocol. The source says "adverse event." The LLM might render this as "harmful incident" or "negative occurrence" in some language pairs — both comprehensible in plain language, neither acceptable in a regulatory context where "adverse event" maps to a defined ICH E2A equivalent in the target language. The error won't look like an error on first read.

The reason is that term preference is not always the most statistically probable output. The model learned from text where a concept appears in many forms, and without explicit instruction, it selects whatever form has the highest probability given the surrounding context. For general-audience content, that default behavior is usually fine. For regulatory submissions, it isn't.

Glossary enforcement matters here. When you specify "adverse event = événement indésirable" in your system prompt, you're constraining the model's output toward a specific target. Whether the model fully respects that constraint depends on prompt design, the specific model version, and how prominently the instruction is positioned. Testing with your domain-specific terminology before committing to a workflow is not optional.

This caveat works in both directions. For general content — a blog post, an internal memo, marketing copy for a broad consumer audience — terminology variance is usually acceptable, and a focused post-editing pass handles whatever adjustments are needed. The strict glossary enforcement overhead only pays off when the content genuinely requires it.

Where the model performs well and where it doesn't

LLMs handle these content types reliably: long-form prose where fluency and overall coherence matter more than term-level precision; text where style or register needs to be adapted across a document; languages with strong bilingual representation in training data, typically the major European languages, Japanese, Chinese, and Arabic; tasks that benefit from instruction-following, like "leave all proper nouns untranslated" or "use gender-neutral language throughout"; and shorter segments where surrounding context in the prompt helps the model resolve ambiguity.

LLMs perform poorly on: highly specialized technical or regulatory content where specific term equivalents are non-negotiable; languages underrepresented in training data, where the model falls back on approximate patterns; long repetitive documents with structured tables or formulaic legal boilerplate, where hallucination risk accumulates across repeated prompting; and specialized notation — chemical nomenclature, code comments with embedded logic, engineering measurements in niche formats.

Post-editors reviewing LLM output should develop pattern recognition for these failure modes rather than applying a uniform review approach to every segment. Spending equal attention on a marketing paragraph and a clause in a contract is a poor use of review time. The contract clause is where fluent-but-wrong errors are most likely and most costly to miss.

What post-editors need to recognize about LLM output

Machine translation post-editing (MTPE) developed around the specific error patterns of dedicated NMT: omissions, word-order inversions, incorrect agreement, false friends. LLM output has overlapping but distinct error patterns that require a somewhat different review mindset.

The most significant difference is fluency. Dedicated MT sometimes produces output that is obviously wrong because it sounds unnatural — the sentence doesn't parse, or the word order is off in a way that immediately signals something is broken. LLM output is often fluent even when it's incorrect. A sentence that reads smoothly but uses the wrong term, or that captures the general meaning while missing a specific qualifier (the difference between "not more than" and "not less than" in a binding contract), requires careful reading to catch. It won't announce its own failure.

The second difference is hallucination. Unlike NMT systems, LLMs can generate content that wasn't in the source — particularly for long documents or when the model's context becomes overloaded with competing instructions. A post-editor working on the second half of a lengthy document should pay attention to additions, expansions, or paraphrases that weren't present in the source text. These are rare in NMT output but not in LLM output.

The third difference is prompt sensitivity. Because LLMs respond to instructions, the quality of the output you receive is partly a function of how the translation was set up upstream. A post-editor reviewing output from a well-structured workflow — with domain context, a proper system prompt, and a curated glossary — will see a meaningfully different error profile than one reviewing output from a bare "translate this" request sent to a chat interface.

We've found that post-editors who understand how the translation was generated — what prompt was used, what glossary was applied, what model version was selected — work more efficiently than those reviewing without that context. If you're building an LLM translation workflow for your team, documenting these parameters and sharing them with reviewers is part of making the process repeatable.

Build your quality process around what the model doesn't know

Understanding how LLM translation works has a direct operational application: your quality control process should be designed around the model's specific failure modes, not around a generic MT review checklist applied uniformly to every file.

Identify the terminology that matters most in your domain and test whether the model applies it correctly before committing to any workflow. Build glossaries that reflect actual domain usage — not just a list of preferred terms, but terms tested against the output the model actually produces with and without the constraint. Position your glossary and context instructions as close to the segment being translated as possible in the prompt structure.

Train your reviewers — or yourself — to read for fluent-but-wrong errors, not only for obvious mistranslations. And be clear-eyed about where the model's statistical priors run out: specialized regulated content, underrepresented languages, and formulaic repetitive documents all carry higher error risk than general prose, and the review time allocation should reflect that.

For a broader look at how AI tools are being integrated into professional translation workflows right now, we covered some of those shifts in our article on how AI translation tools are changing the way translators work in 2026. The architectural detail covered here helps explain why those changes are uneven across content types, language pairs, and domains — and why understanding how the model generates output is the most direct path to working with it effectively.