Skip to content

Translation memory (TM)

DocTranslater stores exact translation pairs in SQLite (~/.cache/doctranslate/cache.v2.db, table _translationcache). Translation memory extends this with:

Layer Behavior
L1 Legacy exact match on full source string + translator fingerprint (unchanged default).
L1b Exact match on a normalized source key (whitespace / punctuation / placeholder-safe normalization).
L2 Fuzzy match via RapidFuzz WRatio, gated by --tm-fuzzy-min-score and safety checks.
L3 Semantic similarity using optional embeddings (install optional extras; see below).

Glossary rows are authoritative: if a glossary source term appears in the segment, a TM candidate is rejected unless the candidate translation contains the required target substring.

CLI flags

Flag Default Description
--tm-mode off off — TM DB disabled (SQLite exact cache only). exact — L1 + L1b. fuzzy — + L2. semantic — + L3 when dependencies are available.
--tm-scope document document — reuse only rows tagged for the current input file. project — same project id, document, or global pool. global — any row with matching engine fingerprint.
--tm-min-segment-chars 12 Minimum source length for fuzzy / semantic reuse.
--tm-fuzzy-min-score 92 RapidFuzz score cutoff (0–100).
--tm-semantic-min-similarity 0.90 Cosine similarity floor for semantic hits.
--tm-project-id (empty) Optional scope label when using --tm-scope=project.
--tm-embedding-model sentence-transformers/all-MiniLM-L6-v2 Model id for --tm-mode=semantic.
--tm-import NDJSON file merged into TM before the run (idempotent upserts).
--tm-export After each successful PDF, write TM NDJSON for the active translator fingerprint. If the path is a directory, writes <input-stem>.tm.ndjson inside it.

--ignore-cache still bypasses all cache reads and writes (legacy + TM).

Fingerprint / invalidation

TM rows and the legacy cache key include a sorted JSON blob of translator-affecting parameters (model, prompt, router provider fingerprints, etc.). A digest of glossary (source, target) pairs is also stored as tm_glossary_signature so changing glossaries does not silently reuse incompatible translations.

After automatic term extraction finalizes a glossary, the translator cache context is refreshed so TM safety matches the prompts.

Optional semantic mode (L3)

Install optional dependencies (large download; CPU OK for small models):

uv sync --group dev --extra full --extra tm_semantic

If sentence-transformers / torch are missing, --tm-mode=semantic behaves like fuzzy (L3 is skipped when the backend is unavailable).

Legacy cache.v1.db import

If you still have ~/.cache/doctranslate/cache.v1.db, run doctranslate tm migrate-v1-cache once to import rows into the TM table (marker legacy_cache_v1_import in _tmmigration). The active database remains cache.v2.db.

Quality vs cost

  • off / exact-only path: safest; same behavior as before TM columns existed.
  • exact: better hit rate for whitespace / typographic variants; still deterministic.
  • fuzzy: fewer LLM calls on long/repeated documents; small risk of false positives — raise --tm-fuzzy-min-score if needed.
  • semantic: best for paraphrases; highest dependency and CPU cost; tune --tm-semantic-min-similarity upward for safety.