Skip to content

Serverless containers

This page describes how to run DocTranslater on managed container / serverless job platforms (Cloud Run, ECS Fargate, App Runner, Modal, Runpod, etc.). It is not about edge runtimes (Workers, Lambda@Edge).

For image contents and build targets, see Docker overview and Docker image profiles. For the optional HTTP API, see HTTP API.

Primary reference

Google Cloud Run is the recommended primary reference for OSS docs: single-container deploy, env-based config, and a good match for the runtime-api image. Step-by-step guide: Deploy on Cloud Run.

When jobs exceed request timeouts or need heavier isolation, graduate to ECS Fargate workers or Modal functions—see Deploy on Fargate and App Runner and the platform notes below.

Platform comparison

Platform Fit Native / large image Cold start Long jobs Ephemeral disk Notes
Google Cloud Run Strong Supported; use warm image or startup warmup Mitigate with min instances + warm cache Bounded by request timeout; use workers for very long PDFs Ephemeral; use object storage for durable artifacts Primary OSS reference
AWS ECS Fargate Strong Full control over task CPU/RAM Task cold start; scale with service + queue No hard platform request cap like Cloud Run HTTP Ephemeral task storage; EFS optional Best for worker queues
Modal Strong (workers) Custom images / volumes Good primitives for ML-style cold start Fits async job pattern Ephemeral; mount volumes or fetch artifacts Great burst worker; less universal than Cloud Run for “drop-in API”
AWS App Runner Possible Similar to Fargate service Autoscaling service Weaker for long-running batch Ephemeral Prefer for light API or short jobs
Runpod Possible GPU-oriented Pod startup Long runs OK Ephemeral / volume options Use when GPU or pod-style execution fits
AWS Lambda (container) Poor 250 MB unzipped / 10 GB image limits; heavy ONNX stack is awkward Cold start + size Hard max duration (minutes) Ephemeral Appendix only—see Lambda caveat

Fit vs repository characteristics

  • Native-heavy stack (ONNXRuntime, OpenCV, PyMuPDF, optional Hyperscan): prefer Linux amd64/arm64 images from this repo’s Dockerfile; avoid exotic libc unless you rebuild.
  • Cold-start sensitivity: layout ONNX and font caches dominate; use runtime-*-warm targets, DOCTRANSLATE_API_WARMUP_ON_STARTUP=eager, or POST /v1/assets/warmup before traffic; see HTTP API – Production notes.
  • Large images: acceptable on Cloud Run, Fargate, Modal, Runpod; Lambda container images are the outlier.
  • Model / asset warmup: outbound HTTPS or offline asset bundles; warm build stages need network at image build time. Runtime warmups: see Docker – Quick start (assets warmup and warm image targets).
  • Long-running jobs: HTTP API jobs run inside the container process; platform request timeouts do not cancel background tasks on all platforms—still cap risk with DOCTRANSLATE_API_JOB_TIMEOUT_SECONDS and worker architectures.
  • Local disk: treat as scratch only; persist outputs to object storage (S3, GCS, R2) for production.

Mode A — HTTP service (small / medium jobs)

  • Image: runtime-api (doctranslate serve, port 8000).
  • Scale replicas horizontally; keep one Uvicorn worker per container (see HTTP API).
  • Tune DOCTRANSLATE_API_MAX_CONCURRENT_JOBS (default 2) to match memory.

Mode B — Batch / worker (long or heavy jobs)

  • Image: runtime-cpu or runtime-vision (no FastAPI unless you add a thin sidecar).
  • Pull work from an external queue (SQS, Pub/Sub, Celery broker, Modal .map, etc.).
  • Write inputs/outputs to object storage; use local disk only for temp IL/PDF work.
  • Control plane: small runtime-api (or serverless API gateway + minimal service) for /v1/inspect, /v1/config/validate, health, optional job acceptance that enqueues work only.
  • Data plane: Fargate / Modal / Runpod workers run doctranslate translate or embed doctranslate.api.async_translate. For the reference HTTP API, you can run doctranslate worker (ARQ) beside doctranslate serve with shared DOCTRANSLATE_API_DATA_ROOT — see HTTP API workers.
flowchart LR
  client[Client] --> apiService[ApiService]
  apiService --> jobQueue[ExternalQueue]
  jobQueue --> workerPool[WorkerPool]
  workerPool --> objectStore[ObjectStorage]
  apiService --> objectStore

Runtime envelopes (starting points)

Tune per document size and OCR flags; these are documentation defaults, not hard limits.

Profile vCPU Memory Startup budget Timeout / duration
HTTP service 1–2 2–4 GiB 20–45 s cold; 10–20 s with warm cache Align platform HTTP timeout with largest expected upload + poll; set DOCTRANSLATE_API_JOB_TIMEOUT_SECONDS for safety
CPU worker 2–4 4–8 GiB 45–120 s for heavy cold pulls Task / job runner limit (hours on Fargate)
Vision / OCR worker 4+ 8–16 GiB 60–120 s Same as worker

Filesystem: ephemeral; mount a volume for ~/.cache/doctranslate when the platform supports it (Cloud Run volumes, EFS on Fargate).

Cache / models: persist HOME/.cache/doctranslate (default user doctranslater, UID 1000) or ship offline assets (pack-offline / restore-offline; see Verification).

OSS surface (what this repo ships)

Operational checklist

  • Autoscaling: scale replicas; avoid many Uvicorn workers per replica for memory-heavy ONNX + PDF work.
  • Concurrency: lower DOCTRANSLATE_API_MAX_CONCURRENT_JOBS when OOM risk is high; combine with max instances / max tasks.
  • Isolation: separate inspect/config traffic from translate workers when possible.
  • Observability: ship container stdout/stderr to your platform logs; include job_id from API responses in client logs. Enable Prometheus (/metrics) and optional OTLP tracing via DOCTRANSLATE_* variables — see Observability.
  • Startup / assets: use /v1/health/ready with DOCTRANSLATE_API_REQUIRE_ASSETS_READY=true when you require a warmed cache before serving (see HTTP API).
  • Multi-instance HTTP API: the in-process JobManager is per replica; clients must poll the same instance that accepted 202 or you must add an external job store—documented in HTTP API.

AWS Lambda container images (caveat)

Lambda’s maximum execution time (minutes) and packaging constraints conflict with long PDF pipelines and large native dependencies. DocTranslater does not target Lambda as a primary deployment.

If you experiment anyway: use the smallest possible extra set, skip LLM paths (e.g. skip_translation smoke only), aggressive memory, and accept frequent cold starts. Prefer Fargate or Cloud Run for real workloads.

Rollout phases (contributors)

PR-sized sequencing (mirrors the serverless deployment plan):

Phase Scope Done when
1 Overview docs Serverless containers merged with platform matrix + architecture
2 Primary reference Deploy on Cloud Run + Cloud Run sample YAML
3 Secondary platforms Fargate / App Runner, Modal / Runpod, ECS sample YAML
4 Runtime profiles Docker, Docker profiles, HTTP API serverless sections + runtime reference
5 CI Docker workflow runs API boot + health and CLI skip-translation fixture smoke
6 Verify mkdocs build --strict passes; cross-links and nav updated

Risks: large cold-start downloads; OOM from high concurrency; multi-replica job polling without sticky sessions or external queue — mitigations are documented above and in the HTTP API page.