How to measure translation quality: a practical guide for agencies

Most translation agencies will tell you they maintain high quality. Fewer can tell you exactly how they measure it. "Our translators are experienced" and "we always review before delivery" are not quality metrics — they're descriptions of inputs. Translation quality metrics measure outputs: whether the delivered translation meets a defined standard, in what ways it doesn't when it fails, and how failure rates change over time.

Getting this right matters for agencies because quality data drives real decisions: which translators to use for which projects, where your review process is catching errors and where it isn't, how to respond to a client complaint with something other than an apology. Without measurement, you're relying on instinct. With it, you have something to work from.

What you're actually measuring when you measure quality

Translation quality is not one thing. It's a cluster of properties that can fail independently: a translation can be accurate but unnatural, fluent but terminologically inconsistent, stylistically appropriate but missing content. Any metric that collapses all of this into a single score obscures more than it reveals.

The properties that matter most in professional translation — and that a quality framework should cover separately — are accuracy, fluency, terminology consistency, and style. Accuracy is whether the target text conveys the meaning of the source, without omissions, additions, or factual errors. This is the baseline. Fluency is whether the target text reads naturally in the target language, or reads like a translation. Terminology consistency is whether approved terms for this client or domain are used uniformly throughout. Style is whether the translation matches the register and tone specified for this content type.

These properties are worth measuring separately because they have different causes and different fixes. Accuracy errors often point to translator domain knowledge gaps or TM issues. Terminology errors often point to glossary gaps. Fluency problems may indicate that a translator is working outside their strongest language direction. Style issues often point to insufficient briefing or no documented style guide.

A translation that fails on accuracy fails regardless of how well it reads. A translation that fails on terminology consistency might read fine but still cause problems in a regulated context where approved terminology has compliance implications. Treating these as separate things, rather than collapsing them into a single "quality" judgment, is where useful measurement starts.

Error-based frameworks: MQM and what it actually gives you

The most widely used error-based quality framework in professional translation is the Multidimensional Quality Metrics (MQM) framework. MQM organizes translation errors into a hierarchical taxonomy covering accuracy, fluency, terminology, style, and locale conventions, with severity levels — critical, major, minor — that affect how errors are weighted in scoring.

The appeal of MQM is specificity. When two reviewers apply the same MQM typology to the same translation, they should produce comparable results. Not identical — reviewer judgment will always vary — but close enough to track trends over time and across translators. That comparability is what makes it useful for quality management rather than one-off evaluation.

A practical MQM-based approach: count errors by type and severity in a sample of the translation, apply severity weights (critical errors weighted more than minor ones), and calculate a score against the total word count. GALA (the Globalization and Localization Association) publishes guidance on MQM implementation that's worth reading if you're setting up a formal framework from scratch.

One limitation worth being honest about: MQM works best when reviewers are trained on it. An untrained reviewer applying MQM categories inconsistently produces data that looks systematic but isn't. Training your reviewers on the typology isn't optional if you want the scores to mean something.

Sampling strategy: you can't review everything, and shouldn't try

For high-volume workflows, reviewing every translated segment isn't practical. The QA process has to be designed around representative sampling that gives you confidence about the whole while reviewing a manageable portion.

The right sample size depends on stakes and volume. For high-stakes content — legal, medical, regulatory — reviewing more than a statistical sample is appropriate. For high-volume, lower-stakes content, a well-designed random sample can give you reliable quality signal without reviewing everything.

Some approaches that hold up: stratified sampling by section or document type catches errors that cluster in specific content areas. Sampling from the beginning, middle, and end of a document catches consistency problems that develop over long translations. Sampling segments flagged by automated QA checks — missing numbers, untranslated strings, glossary mismatches — prioritizes the segments most likely to have issues rather than reviewing randomly.

The goal isn't to catch every error before delivery — that's an impossible standard for any practical QA process. The goal is to reliably detect when quality is below threshold and to generate data that points at where problems are occurring.

Automated QA checks: what they catch and what they miss

Automated QA tools check translation for a defined set of structural and consistency issues without requiring human review of every segment. They're fast and consistent — two reviewers with different sensitivities don't produce different results, and running automated QA adds minutes to a project rather than hours.

What automated QA reliably catches: number consistency (a number in the source that doesn't appear in the target), missing or extra punctuation at segment boundaries, untranslated segments, glossary term mismatches, tag errors in files with formatting markup, and obvious omissions where target length differs dramatically from source.

What automated QA misses: anything that requires understanding meaning. A sentence that is grammatically correct but conveys the opposite of the source will pass automated QA. A terminology error where the translator used a plausible but non-approved synonym will often pass unless the QA tool has a complete glossary to check against. Register mismatches pass entirely. These require human review.

Automated QA is most valuable as a first pass that removes obvious errors and flags high-probability problems before a human reviewer touches the file. Running automated QA and calling the translation reviewed is not a quality process. Running automated QA to direct human review toward the segments that most need it is.

Building a QA report that's actually useful

A QA report that says "quality: good" tells you nothing. A QA report that lists error types, severity counts, score, and the segments where errors occurred gives you something to act on.

The elements worth including: error count by category and severity, score calculated against a defined threshold (what's acceptable for delivery for this content type?), a list of flagged segments with the error type noted, and whether the report came from automated tools, human review, or both.

This level of detail matters because it's what lets you trace quality issues back to their causes. If terminology errors cluster in a specific document section, that points to a glossary gap. If accuracy errors appear consistently in one translator's files, that's a training or assignment signal. If quality scores are declining on a specific content type, that's a workflow issue to investigate.

The QA report also matters in client conversations. When a client queries a translation, having documented QA process data is a different conversation than defending quality without evidence. It doesn't eliminate disputes, but it gives you something concrete to discuss.

If you work with Smartcat bilingual DOCX files and run AI translation jobs, SnapIntel includes a QA report and quality rating as part of the standard workflow output — covering AI translation results in a format that feeds directly into your review process. The SnapIntel docs cover what the QA report includes.

Tracking quality over time: where the data becomes useful

Single-point quality measurement tells you whether a specific translation is acceptable. Tracking quality over time tells you whether your workflow is improving, holding steady, or declining — and that distinction is what allows you to manage quality proactively rather than reactively.

What's worth tracking at minimum: quality scores per translator across projects, error category distribution over time (are terminology errors increasing relative to fluency errors?), revision rates after delivery (how often are clients requesting changes?), and any correlation between project characteristics and quality outcomes — does quality drop on rush projects, on specific content types, or with particular language pairs?

This doesn't require a sophisticated analytics system. A spreadsheet with project date, translator, content type, quality score, error breakdown, and revision count gives you the pattern over months. After six months of consistent tracking, you'll have real information about which translators perform reliably in which domains, where your QA process is catching errors and where it isn't, and which project types are most likely to generate client revisions. That information drives better assignment decisions and better conversations with clients about what quality expectations are realistic.

Setting thresholds: what counts as acceptable delivery

Quality scoring is only useful if you've defined what score is acceptable for delivery. Without a threshold, a quality score is just a number.

Thresholds should vary by content type and stakes. A marketing email has a different acceptable error rate than a regulatory submission or a legal contract. Some agencies use a single threshold across all content, which is simpler. Content-type-specific thresholds are more work to set up but produce more meaningful data — and they're more defensible when a client challenges a delivery.

Setting thresholds requires looking at historical data: what quality scores were associated with translations that generated client revision requests, and what scores were associated with translations that passed without issue? If you don't have that data yet, industry frameworks like MQM provide reference points. GALA and CSA Research have both published benchmarking data on quality score thresholds across content types.

Once thresholds are set, they need to be enforced consistently. A threshold that gets waived on rush projects because there isn't time for proper QA isn't actually a threshold — it's a suggestion. Tracking how often thresholds are bypassed, and what the client outcome is when they are, gives you data on whether the QA process is functioning as designed or being worked around.

Connecting measurement to workflow decisions

The real payoff from translation quality metrics is that they inform decisions that improve the workflow — not that they produce reports. If data shows that one translator consistently scores below threshold on technical content but above it on legal content, the assignment decision changes. If automated QA catches a spike in terminology errors at the start of a new project type, that signals a glossary gap to fill before the next similar project.

Quality measurement is how a QA process gets better over time rather than staying static. Without measurement, you're running the same process regardless of outcomes. With it, you have the information to adjust — which translators work on which content, where glossary maintenance is most needed, which QA steps are generating useful signals versus which are generating noise.

For more on building the underlying QA process that quality metrics feed into, our complete guide to translation quality assurance covers the full workflow in detail.