Benchmarks¶

Performance numbers for udoc against the alternatives, per format and per task. This page is a stub: numbers coming soon!

Methodology¶

Benchmarks run on two reference machines so that the numbers reflect real-world hardware variation rather than a single CPU's quirks:

Host	CPU	Memory	OS
`linux-x64`	(TBD) AMD64 desktop	TBD	Linux 6.8 (Ubuntu 24.04)
`mac-arm64`	Apple M1	TBD	macOS (latest available)

Each benchmark run reports:

Median wall time over N iterations (N to be set per benchmark).
Throughput (pages/sec or MB/sec, depending on the workload).
Peak resident set size during the run.
Any output-quality metric that applies (character accuracy, SSIM, etc).

Benchmarks are not strictly apples-to-apples — tools differ in what they extract, in default DPI, in rendering profile, and in how strictly they reject bad input. Differences are noted per tool. Where a tool has a knob that materially changes the answer (e.g. pdftotext -layout), both modes are reported.

Corpus¶

A standing benchmark corpus lives in (TBD path / link). It mixes:

Born-digital PDFs — research papers (LaTeX), enterprise reports (Word print), marketing collateral (InDesign).
Scanned PDFs — government records, legal exhibits, pre- digital archives. Used for OCR-related benchmarks, not for raw text-extraction comparisons.
Hybrid PDFs — body pages digital, inserts scanned. The realistic input that exercises per-page detection.
Mixed Office documents — DOCX, XLSX, PPTX, plus the legacy binary formats (DOC, XLS, PPT) and OpenDocument equivalents.
Stress cases — pathological inputs (deeply nested tables, thousand-page reports, gigabyte-scale PDFs) used to exercise resource limits and memory budgeting rather than throughput.

Per-document provenance and a license review are documented alongside the corpus. Documents that cannot be redistributed are referenced by source and replayed locally.

PDF: text extraction¶

Comparisons against the common open-source PDF text extractors:

Tool	Source	Notes
`udoc`	this project	Default reading-order tier auto-selected.
`pdftotext`	poppler	Both default and `-layout` mode reported separately.
`pdfium`	Chromium	Reference for "what the browser sees."
`mupdf`	Artifex	`mutool extract` / `mutool convert -F txt`.
`pdfminer`	python	Pure-Python baseline for the Python ecosystem.
`pymupdf`	wrapper on mupdf	Python convenience baseline; same engine as `mupdf`.

Character accuracy¶

Accuracy is measured against a hand-curated ground-truth subset of the corpus (TBD count). Metric: edit distance per 1k characters between extracted text and the reference, after Unicode NFC normalisation and whitespace collapsing.

Document class	udoc	pdftotext (default)	pdftotext (-layout)	pdfium	mupdf	pdfminer	pymupdf
Research papers	TBD	TBD	TBD	TBD	TBD	TBD	TBD
Reports (Word PDF)	TBD	TBD	TBD	TBD	TBD	TBD	TBD
Marketing / InDesign	TBD	TBD	TBD	TBD	TBD	TBD	TBD
Multi-column	TBD	TBD	TBD	TBD	TBD	TBD	TBD
Tables-heavy	TBD	TBD	TBD	TBD	TBD	TBD	TBD

Throughput¶

Throughput is measured as MB/sec of input PDF processed, summed over the corpus. Single-threaded; the parallel-throughput numbers are reported separately under Batch processing.

Host	udoc	pdftotext	pdfium	mupdf	pdfminer	pymupdf
`linux-x64`	TBD	TBD	TBD	TBD	TBD	TBD
`mac-arm64`	TBD	TBD	TBD	TBD	TBD	TBD

Reading order¶

Reading-order quality is measured separately because edit distance on text extracted in the wrong order can still score well. Metric: sequence alignment error against the ground-truth reading order on a multi-column subset of the corpus.

Tier breakdown is reported alongside, since udoc's reading-order pipeline is a four-tier cascade (see PDF format guide) and the picked tier matters for interpreting the result.

PDF: rendering¶

Comparisons against the common open-source PDF rasterisers:

Tool	Source	Notes
`udoc`	this project	Both `viewer` and `ocr` profiles reported.
`poppler`	`pdftoppm`	Default cairo backend.
`pdfium`	Chromium	The "what the browser sees" reference.
`mupdf`	Artifex	`mutool draw -r <dpi> -o page-%d.png`.

SSIM¶

Structural similarity index against the chosen reference at a given DPI. The reference rasteriser is selected per document class (typically mupdf for born-digital, since it tracks the PDF spec most strictly).

Document class	DPI	udoc (viewer)	udoc (ocr)	poppler	pdfium	mupdf (ref)
Research papers	150	TBD	TBD	TBD	TBD	1.000
Research papers	300	TBD	TBD	TBD	TBD	1.000
Reports	150	TBD	TBD	TBD	TBD	1.000
Marketing	150	TBD	TBD	TBD	TBD	1.000
Forms (AcroForm)	150	TBD	TBD	TBD	TBD	1.000

Render speed¶

Pages per second at a fixed DPI, single-threaded.

Host	DPI	udoc (viewer)	udoc (ocr)	poppler	pdfium	mupdf
`linux-x64`	150	TBD	TBD	TBD	TBD	TBD
`linux-x64`	300	TBD	TBD	TBD	TBD	TBD
`mac-arm64`	150	TBD	TBD	TBD	TBD	TBD
`mac-arm64`	300	TBD	TBD	TBD	TBD	TBD

PDF: tables¶

Table detection is heuristic across all of these tools; metrics have to be careful about what counts as a "correct" table. The metric here is per-cell content match against a ground-truth table set.

Tool	Source	Notes
`udoc`	this project	Built-in lattice + text-edge strategies merged.
`pdfplumber`	python wrapper	The pdfminer.six-based table extractor used in the Python ecosystem.
`tabula`	java	The table-extraction reference for many ETL pipelines.
`camelot`	python	Both `lattice` and `stream` flavours.

Table style	udoc	pdfplumber	tabula	camelot (lattice)	camelot (stream)
Ruled, simple	TBD	TBD	TBD	TBD	TBD
Ruled, merged	TBD	TBD	TBD	TBD	TBD
Unruled, columns	TBD	TBD	TBD	TBD	TBD
Rotated headers	TBD	TBD	TBD	TBD	TBD
Multi-page	TBD	TBD	TBD	TBD	TBD

DOCX / XLSX / PPTX¶

For the modern OOXML stack, the comparison set is the language-native libraries:

Format	Reference tools
DOCX	`python-docx`, `mammoth` (DOCX-to-HTML), `docx2txt`, `pandoc`
XLSX	`openpyxl`, `xlsx2csv`, `libreoffice --headless --convert-to csv`
PPTX	`python-pptx`, `pptx2text`, `libreoffice --headless --convert-to txt`

Per-format throughput, character accuracy, and table-quality numbers will follow the same per-host, per-document-class shape as the PDF tables above.

Legacy binary Office (DOC / XLS / PPT)¶

The legacy formats are where most modern tooling either falls back on a LibreOffice subprocess or fails outright. Comparisons:

Tool	Notes
`udoc`	In-tree Rust parser per format.
`antiword` (DOC)	The classic stdout-only DOC reader. No table support.
`catdoc` (DOC / XLS / PPT)	Single-binary suite. Older Unicode handling.
`libreoffice --headless --convert-to`	The general-purpose subprocess fallback. Heavy startup cost.
`python-oletools`	Forensic-grade access; not optimised for throughput.

The interesting axes here are throughput (LibreOffice startup cost is a dominant factor for short documents), character accuracy (codepage decoding correctness), and structural recovery on fast-saved DOC files.

Hooks and OCR¶

OCR throughput is a function of the OCR engine, not udoc — but the per-page overhead udoc adds (page render, base64, JSONL, sequence-number bookkeeping) is worth measuring on its own. The microbenchmark:

A no-op OCR hook that returns empty text immediately.
Measures udoc-side overhead per page.
Reported for viewer and ocr render profiles at 150 / 300 DPI.

End-to-end OCR throughput numbers are reported for the example hooks shipped under examples/hooks/:

Hook	OCR engine	Notes
`tesseract-hook`	Tesseract 5	CPU baseline.
`glm-ocr-hook`	GLM-OCR	GPU; numbers reported on a CUDA-equipped host.
`deepseek-ocr-hook`	DeepSeek-OCR	GPU; same caveat.

Font engine (`udoc-font`)¶

Per-task microbenchmarks against FreeType, the reference font engine for the open-source ecosystem:

Task	udoc-font	FreeType	Notes
TrueType glyph outline (cold cache)	TBD	TBD	Synthetic load: 10k random glyphs from a representative TTF.
TrueType glyph outline (warm cache)	TBD	TBD	Same load, second pass.
CFF glyph outline	TBD	TBD	OTF font fixture.
Type 1 outline	TBD	TBD	Legacy PostScript fixture (PDF-embedded).
ToUnicode CMap parse	TBD	n/a	Specific to PDF; no FreeType comparison.
`cmap` table lookup	TBD	TBD	1k character codes against a representative font.
Auto-hinter (single glyph)	TBD	TBD	Software auto-hinter for unhinted CFF / Type 1.

Image decoders (`udoc-image`)¶

Per-format throughput against the commonly-used decoder per codec. The PDF use case is what motivates this crate, so the benchmark inputs are representative PDF-embedded images rather than standalone files.

Codec	udoc-image	Reference	Notes
CCITT	TBD	`libtiff` / Group4	Single-bit fax-style scans common in older PDFs.
JBIG2	TBD	`jbig2dec`	Compressed scans common in archive PDFs.
JPEG	TBD	`libjpeg-turbo`	The default colour-image codec.
JPEG 2000	TBD	`OpenJPEG`	Used in high-quality archive PDFs and some scan workflows.

Per-codec metrics: decode throughput (MB/sec), peak RSS during decode, and a small correctness suite (bit-exact against the reference for JPEG / CCITT; SSIM against the reference for lossy decoders).

Batch processing¶

End-to-end throughput for a representative ingest workload: N documents, mixed formats, processed by udoc.Corpus.parallel(...) on each host. Reported as documents/sec and MB/sec, with peak RSS across the worker pool.

Host	Workers	Mode	docs/sec	MB/sec	peak RSS
`linux-x64`	1	inline	TBD	TBD	TBD
`linux-x64`	4	thread	TBD	TBD	TBD
`linux-x64`	4	process	TBD	TBD	TBD
`linux-x64`	16	process	TBD	TBD	TBD
`mac-arm64`	1	inline	TBD	TBD	TBD
`mac-arm64`	4	thread	TBD	TBD	TBD
`mac-arm64`	8	process	TBD	TBD	TBD

The interesting cross-cuts:

Where the GIL still bites (thread mode trends).
The crossover point between thread and process mode.
The Config(memory_budget=...) setting's effect on peak RSS at high worker counts.

Reproducing¶

The benchmark harness lives at (TBD scripts/bench/). It wraps each tool in a uniform driver and emits CSV output suitable for Pandas / DuckDB analysis. To reproduce on a new host:

# (TBD harness invocation)
scripts/bench/run.sh --corpus path/to/corpus --out results/

Pull requests adding tools to the comparison set are welcome — keep the comparison apples-to-apples, document any tool-specific flags in the harness, and call out where a tool's defaults are materially different from udoc's.

Caveats¶

Numbers are wall-clock medians on otherwise idle hosts. CI-host numbers are not reported because they vary too much.
udoc's defaults are tuned for correctness over speed; flipping off overlays via Config.layers materially improves throughput for callers that only need text. The "udoc default" column is the conservative one; an "udoc text-only" column is added where it changes the picture.
Memory measurements use peak RSS as observed by the OS. RSS understates actual cost when the kernel pages cleanly; the benchmark harness pins workloads to a single NUMA node on linux-x64 to keep that consistent.
Quality metrics are noisy on small samples. Confidence intervals are reported alongside the medians where the sample size makes them meaningful.