Local translation

Translate PDFs using models on your machine or on a LAN-hosted OpenAI-compatible server, without calling hosted APIs (OpenAI, Anthropic, etc.).

DocTranslater’s PDF pipeline still expects an LLM that can follow structured prompts and, where needed, JSON output (paragraph translation and automatic term extraction). This is not the same workflow as classic sentence-MT engines (e.g. Marian via CTranslate2); support for that may be added later.

Quick start (Ollama)

Install Ollama and pull a model, for example:

ollama pull qwen2.5:7b

Run DocTranslater:

doctranslate translate input.pdf \
  --translator local \
  --local-backend ollama \
  --local-model qwen2.5:7b \
  --lang-in en --lang-out zh \
  -o ./out

Validate configuration and reachability (no PDF required):

doctranslate config validate --translator local \
  --local-backend ollama --local-model qwen2.5:7b

Optional translation memory flags (--tm-mode, …) apply the same way in local mode as for the router; see Translation memory.

Backends

Preset	What it uses	When to use
`ollama`	Native Ollama HTTP API via LiteLLM	Easiest desktop setup; CPU or modest GPU
`vllm`	OpenAI-compatible `/v1` on your server (default base `http://127.0.0.1:8000/v1`)	High throughput on NVIDIA GPUs
`llama-cpp`	Same as OpenAI-compatible; point `--local-base-url` at `llama-cpp-python` server	Offline GGUF, Apple Silicon / CPU friendly
`openai-compatible`	Explicit OpenAI-compatible gateway	Custom local or LAN URL

Note: glm-ocr and similar vision / OCR models are for document reading / OCR experiments, not for replacing the main paragraph translation model. Use instruct / chat models suited to translation and JSON.

OpenAI-compatible server (vLLM, llama-cpp-python)

Start your server (examples only; see upstream docs for flags):

# vLLM (example)
vllm serve Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000

# llama-cpp-python server (example)
python3 -m llama_cpp.server --model /path/to/model.gguf --host 0.0.0.0 --port 8080

Then run DocTranslater with a base URL that includes /v1 (or omit /v1; DocTranslater normalizes it):

doctranslate translate input.pdf \
  --translator local \
  --local-backend vllm \
  --local-base-url http://127.0.0.1:8000 \
  --local-model Qwen/Qwen2.5-7B-Instruct \
  --lang-in en --lang-out de \
  -o ./out

Many local servers accept a dummy API key; DocTranslater sends EMPTY when none is configured.

Configuration file (TOML)

You can set the same knobs under [doctranslate] or under [doctranslate.local]:

[doctranslate]
translator = "local"
local_backend = "ollama"
local_model = "qwen2.5:7b"
local_timeout_seconds = 120
local_translation_batch_tokens = 256
local_translation_batch_paragraphs = 4
local_term_batch_tokens = 400
local_term_batch_paragraphs = 8

[doctranslate.local]
# Alternative nested form (overrides duplicate keys in [doctranslate] when both set)
term_model = "qwen2.5:3b"

Then:

doctranslate -c doctranslate.toml translate input.pdf -o ./out

CLI flags override TOML values when both are present.

Batching and performance

The IL translator batches paragraphs before each LLM call. Defaults match the previous hard-coded behavior:

Translation: flush when estimated tokens > 200 or paragraphs > 5
Term extraction: > 600 tokens or > 12 paragraphs

Tune with:

--local-translation-batch-tokens / --local-translation-batch-paragraphs
--local-term-batch-tokens / --local-term-batch-paragraphs

Smaller models or tight GPU memory: lower batch tokens and disable automatic glossary extraction (--no-auto-extract-glossary) if JSON term extraction is flaky.

Hardware hints

CPU-only: prefer smaller quantized instruct models (e.g. 3B–7B class); reduce batch sizes and concurrency (--qps, --pool-max-workers).
Apple Silicon: Ollama or llama-cpp with Metal; avoid expecting vLLM as the primary path on macOS.
NVIDIA + throughput: run vLLM (or another OpenAI-compatible server) on a workstation or LAN host; increase --pool-max-workers cautiously to match server capacity.

--local-context-window is stored for documentation / future tuning; the pipeline still uses tiktoken-style estimates for batching.

Troubleshooting

Symptom	What to check
`LocalPreflightError` / cannot reach Ollama	`ollama serve` running; `--local-base-url` matches your Ollama host
Model not found	`ollama pull <model>`; `GET /api/tags` lists the exact name (including tags)
JSON / term extraction failures	Smaller model struggling with `response_format`; disable auto glossary or use a larger term model (`--local-term-model`)
OOM or slow first request	Normal cold start; reduce batch tokens; smaller quant; server-side max context
Wrong cache hits after changing model	Cache keys include provider id + model + base URL; use `--ignore-cache` when comparing models

Benchmarks (manual)

Use the helper script (requires a real local server and sample PDF):

uv run python scripts/bench_local_translation.py \
  --pdf examples/ci/test.pdf \
  --lang-in en --lang-out zh \
  --backend ollama --model qwen2.5:7b