Documents are weird¶

What's actually in the file¶

The thing on screen (the paginated report, the slide deck, the spreadsheet) bears little resemblance to what sits on disk. Each format encodes the visual artifact in a way that serves the producer (a layout engine, a word processor, a printer driver) and frustrates anyone trying to recover meaning afterward.

PDF. A PDF page is a sequence of drawing instructions: "place this glyph at coordinate (x, y) in this font at this size." Paragraphs, lines, columns, and reading order are not represented. A producer is free to emit the right column before the left, interleave footnotes with body text, or duplicate a glyph across overlapping clip regions. Recovering the text in human reading order means reconstructing structure the file never contained. This is a heuristic process that works well on born-digital output from LaTeX, Word, or InDesign, and degrades on producers that cut corners.

DOCX. Word encodes paragraphs, runs, and properties in XML, but properties cascade from section to paragraph to run with overrides at every level. Tables nest inside cells inside tables. Revision marks coexist with the underlying text; deletions persist in the file unless explicitly rejected. The structure is recoverable in principle, though walking the cascade correctly requires care that producers routinely skip.

XLSX. A spreadsheet is a sparse cell array with formatting. What a human calls "the sales table" is a contiguous rectangle that happened to be used as a table: possibly with merged header cells, a bold totals row, or three blank rows separating two unrelated data sets. Where one table ends and another begins is a judgment call the file does not record.

Legacy binary formats. .doc, .xls, and .ppt are CFB / OLE2 compound files from the 1990's. DOC stores text in a piece-table structure that supports fast-save, leaving uncommitted edits alongside committed text. XLS interleaves cell values, formulas, formatting, and chart definitions in a single record stream. PPT references slides through a PersistDirectory that survived multiple file revisions intact. These formats still fill government archives, legal discovery pipelines, insurance back-offices, and decades of corporate share drives. Most modern tooling has quietly stopped supporting them.

Scanned PDFs. A scanned document is a sequence of page-sized images wrapped in PDF structure. There is no glyph data to extract. The only path to text is rendering the page and handing it to an OCR engine. No toolkit can bypass that step. The question is whether the toolkit makes OCR straightforward to attach.

Hybrid documents. A 200-page legal exhibit often consists of 180 pages of digitally-generated text plus 20 inserted scans of signed forms. Without per-page detection, you either OCR nothing and lose the forms, or OCR everything and waste time re-processing pages you already had clean text for.

Why heuristics¶

Where the spec or the file is silent, a parser has to choose. udoc exposes those choices as tiered APIs rather than concealing them behind a single "right answer."

Reading order. The PDF backend runs a four-tier cascade for reading-order reconstruction: content-stream order on coherent producers, X-Y cut for multi-column layouts, region-projection for complex spreads, and a layout-model override via hook for the hardest cases. Each tier is documented. The tier selected for a given page surfaces in diagnostics, and you can constrain or override the choice from the config.

Tables. PDF table detection combines a ruled-lattice strategy with a text-edge column strategy and merges the results. Both can be fooled by dense unruled tables, rotated headers, or producers that emit cell content out of grid order. When they fail, a layout model attached via the hook protocol can take over. The toolkit keeps the process transparent by surfacing what it did rather than silently improvising.

Document recovery. Real producers ship malformed xref tables, incorrect stream lengths, missing ToUnicode CMaps, broken central directories in OOXML zips, fast-saved DOC piece tables, and dozens of other specification violations. udoc's approach is lenient parsing over strict failure: try the spec, recover when it doesn't hold, and emit a typed warning instead of an exception. A partial extraction with StreamLengthMismatch warnings is more useful than a full exception that aborts the pipeline.

The principle: every place the toolkit had to guess, the guess is visible. Filter on Diagnostics.kind in your pipeline to sort the documents the parser was confident about from those that warrant manual review before trusting the text.

Why OCR¶

A frequent question: if udoc is a full document toolkit, why does it not include OCR? Because OCR is not a parser; it is a model that reconstructs text from pixels. No parser can substitute for it. The relevant question is whether the parser knows when to invoke it.

udoc's approach:

Automatic scan detection. Pages with one large image, fewer than five text spans, and no extractable glyph data are flagged as LikelyScanned on the diagnostics sink. The OCR hook fires only on those pages by default.
OCR as a hook, not a built-in. Tesseract, GLM-OCR, DeepSeek-OCR, Textract, Document AI, Azure Form Recognizer — the right engine depends on the document, the language, the hardware, the budget, and the data-egress policy. udoc does not ship one. The hook protocol lets you wire whichever engine you need, and that choice remains stable as udoc evolves.
Per-page granularity. The detector runs per page, not per document. OCR fires on scanned inserts and skips the digitally-generated body.

A scanned page processed without an OCR hook returns empty text and a LikelyScanned warning rather than producing a clean-looking but empty result. That is the diagnostic contract, not a bug.

Why hooks¶

Hooks are udoc's extension mechanism. They exist because the operations you might perform on a document are open-ended in a way that the document model itself is not.

The toolkit converges every format onto one document model. The shape of the data is fixed: blocks, inlines, tables, presentation overlays. But the content of those blocks can be enriched by any number of external systems: OCR engines, layout detectors, NER models, classifiers, table reconcilers, language detectors, redaction filters. Embedding any of these in core would be a bet on one model family in a space that turns over every six months.

Hooks turn that liability into a pipeline:

udoc parse -> [OCR hooks] -> [layout hooks] -> [annotate hooks] -> Document

Each phase is optional. Anything that reads JSON line by line on stdin and writes JSON line by line on stdout can plug in. The hooks chapter documents the protocol; the examples directory provides working hooks for Tesseract, GLM-OCR, DeepSeek-OCR, DocLayout-YOLO, NER, and a cloud-OCR template adaptable to any provider.

The hook process is long-lived: udoc spawns it once per extraction and reuses it across every page, so model setup amortizes across the document. For async backends (cloud OCR with poll-for-result APIs), the hook owns the polling and udoc waits.

Why the toolkit philosophy¶

The design tenets that follow from the above are catalogued in Architecture. In short: udoc recovers and warns rather than failing a parse outright, exposes heuristic layers as tiered APIs so the caller can select a confidence level, and surfaces recoverable issues as typed diagnostics instead of stderr noise. Per-page work is deferred until the caller requests it, and every parser is in-tree.

The thread running through all of this is honesty about what documents are. Some answers are heuristic, some inputs need OCR, and a pipeline built on top is more robust when it reads the diagnostics than when it assumes every extraction returned a clean answer.

Where to next¶

Overview — install, highlights, and quick examples for each surface.
Library guide — the document model, configuration, diagnostics, escape hatches.
Hooks chapter — the JSONL protocol with worked Python hooks.
PDF rendering & OCR — the page rasterizer and the hook-driven OCR wiring.
Per-format guides — the quirks specific to each backend and the diagnostics they emit.
Architecture — the document model, design tenets, performance notes.