PDF pipeline and stages
The public PDF path is driven from doctranslate.format.pdf.high_level. Stages are listed in TRANSLATE_STAGES (order and rough cost weights).
Stage order (canonical)
From TRANSLATE_STAGES in doctranslate/format/pdf/high_level.py:
- ILCreater — Parse PDF and build intermediate representation (IR/IL).
- DetectScannedFile — Scanned-page detection (SSIM); records per-page scores in shared context when OCR routing is enabled.
- OCRRouting — Optional RapidOCR injection into
pdf_characterbefore layout (see--ocr-modein Configuration). - LayoutParser — Page layout (YOLO-based layout model is injected/configured externally).
- TableParser — Tables.
- ParagraphFinder — Paragraph grouping.
- StylesAndFormulas — Styles and formula-like text.
- AutomaticTermExtractor — Glossary term extraction (LLM, often JSON-shaped).
- ILTranslator — Paragraph translation (LLM batches).
- Typesetting — Reflow into geometry.
- FontMapper — Fonts.
- PDFCreater — Drawing instructions.
- SUBSET_FONT_STAGE_NAME — Font subsetting.
- SAVE_PDF_STAGE_NAME — Write PDF.
Agent guidelines
- Prefer minimal, stage-local changes; understand callers in
high_level.pybefore reordering stages. - Translation is synchronous on translator instances from the PDF pipeline’s perspective; async appears around progress/threading — see Async Translation API.
- Metadata and post-save fixes (
add_metadata,fix_cmap, etc.) are part of the output contract; do not strip without tests.
Related files
- IL creation:
document_il/frontend/il_creater.py - Translation midend:
document_il/midend/il_translator.py,il_translator_llm_only.py - Backend writer:
document_il/backend/pdf_creater.py