Architecture¶
A short tour of how udoc is put together. You do not need this to use the library; read it when you want to understand a behaviour, contribute a recommendation, or build something on top.
Design tenets¶
-
Lenient parsing beats strict failure. Real documents lie. We try the spec, recover when it doesn't hold, and warn instead of abort. A partial extraction with known caveats beats an exception.
-
Tiered APIs over false certainty. Where layout analysis is hard, expose a cascade and let the caller pick the tier they trust. PDF text has
text()/text_lines()/raw_spans(); reading order runs four tiers internally. We don't hide failure behind one method. -
Diagnostics are a feature. Recoverable issues flow through
DiagnosticsSinkas typed warnings. Pipelines filter onkind. Silent recovery is its own kind of bug. -
Page-oriented, deferred work. Backends operate on pages and defer expensive work until a caller asks for it.
Extractor::open(path)doesn't interpret a content stream until you callpage_text(i). -
Unsafe stays isolated. Workspace-wide
#![deny(unsafe_code)], one audited exception inudoc_pdf::io::mmap_impl. New unsafe requires a deliberate, reviewable change. -
Vertical ownership. Parsers live in-tree. No subprocesses to system tools. When a document does something unusual, the fix lands in this codebase.
-
Permissive licence. Dual MIT / Apache-2.0.
The 5-layer document model¶
Backends converge on a single Document shape:
doc = udoc.extract("paper.pdf")
doc.content # list[Block] — paragraphs, headings, tables, lists, images
doc.metadata # title, author, page_count, created, ...
doc.presentation # bounding boxes, fonts, colours (overlay)
doc.relationships # links, footnotes, bookmarks (overlay)
doc.interactions # form fields, comments, tracked changes (overlay)
doc.images # shared image store referenced by Block::Image
doc.content — content spine. The text-bearing tree. Block
nodes (paragraphs, headings, lists, tables, image references, code
blocks, page breaks, sections, shapes) hold Inline children (text
spans, bold, italic, links, code, footnote refs, soft and hard
breaks, inline images). Each node carries a typed node_id into the
document's arena. Always present. This is what udoc -t walks for
tables, what the markdown emitter renders, and what block.text
returns for plain text.
doc.metadata — document facts. Title, author, subject, creator,
producer, creation and modification dates, page count, plus a
properties map for format-specific extended fields (Dublin Core
from OOXML / ODF cores, PDF Info dictionary entries, OOXML extended
properties from app.xml). Always present, even when all fields
are None for documents that didn't carry metadata.
doc.presentation — geometry overlay. Where things live on the
page and how they look. Bounding boxes per block, font name + size +
styling per text span, fill and stroke colours, paint paths (for PDF
rendering), page geometry (rotation, media box, crop box, page size).
Optional. Disable via Config(presentation=False) if you only need
text — the spine is unaffected. The PDF renderer reads from this
overlay, and downstream layout models consume the bounding boxes for
region-of-interest crops.
doc.relationships — link overlay. The connections between
content. Footnote and endnote definitions paired with their inline
references, hyperlinks (URL targets and anchor ranges), bookmark
targets, table-of-contents entries, cross-references between blocks.
Each entry references a node_id from the content spine and resolves
to its target. PDF link annotations, DOCX / ODF cross-refs, and
Markdown link reference definitions all flatten into this overlay.
Optional.
doc.interactions — actionable overlay. Things a viewer can act
on. Form fields (text, checkbox, radio, select, signature), comments
threaded by author, tracked changes — insertions, deletions, and
formatting revisions stamped with author and timestamp. PDF AcroForm
fields and DOCX revision marks live here. Documents without any of
these features carry an empty Interactions; the overlay is
independently optional via Config(interactions=False).
doc.images — shared image store. Each Block::Image and
Inline::InlineImage carries an ImageRef index into doc.images;
the actual bitmap bytes plus metadata (width, height, MIME type,
original filter chain) live once in this Vec. An image referenced
N times is stored once — important for slide decks with repeated
logos and DOCX with repeated header images. The store is part of
the spine, not an overlay; if you want extraction to skip image
bytes entirely, set Config(images=False) and the references in
the content tree become empty placeholders.
Overlays are independently toggleable on the Config. Disabling one
skips the work that produces it; the spine (text, tables, image
references) is unaffected.
The backend trait¶
All format-specific code lives behind one trait:
trait FormatBackend {
type Page<'a>: PageExtractor where Self: 'a;
fn page_count(&self) -> usize;
fn page(&mut self, index: usize) -> Result<Self::Page<'_>>;
fn metadata(&self) -> &DocumentMetadata;
// ...
}
trait PageExtractor {
fn text(&mut self) -> Result<String>;
fn text_lines(&mut self) -> Result<Vec<TextLine>>;
fn raw_spans(&mut self) -> Result<Vec<TextSpan>>;
fn tables(&mut self) -> Result<Vec<Table>>;
fn images(&mut self) -> Result<Vec<PageImage>>;
}
Each backend (udoc-pdf, udoc-docx, udoc-xlsx, ...) implements
these. The facade dispatches to the right backend based on format
detection, calls into the trait, and converts the result into the
unified Document model.
The macro define_internal_backend! wires a backend into the facade
in about seven lines per format; the conversion from format-specific
types to the core Document lives in each backend's convert.rs.
The crates¶
| Crate | Role |
|---|---|
udoc |
The facade. Public API, CLI binary, format detection, conversion glue. |
udoc-py |
Python bindings (PyO3) over the same engine. |
udoc-core |
Format-agnostic types: Document, Block, Inline, NodeId, TextSpan, Table, PageImage, Error, DiagnosticsSink, FormatBackend, PageExtractor. |
udoc-containers |
Shared parsers: ZIP (OOXML / ODF), namespace-aware XML, CFB / OLE2 (legacy Office), OPC (OOXML packages). |
udoc-pdf |
PDF parser. Layered: io → parse → object → font → content → text → document. |
udoc-font |
Font engine. TrueType, CFF, Type 1, hinting, cmaps, ToUnicode. |
udoc-image |
Image decoders. CCITT, JBIG2, JPEG, JPEG 2000. |
udoc-render |
PDF page rasteriser. Auto-hinter, font cache, glyph compositor. |
udoc-docx |
DOCX backend (ZIP + XML). |
udoc-xlsx |
XLSX backend. Typed cells, shared strings, number-format mini-language. |
udoc-pptx |
PPTX backend. Shape tree, slide layouts, notes slides. |
udoc-doc |
Legacy DOC backend (CFB + FIB + piece table). |
udoc-xls |
Legacy XLS backend (BIFF8). |
udoc-ppt |
Legacy PPT backend (CFB + PowerPoint records + PersistDirectory). |
udoc-odf |
ODF backend covering ODT / ODS / ODP from a single crate. |
udoc-rtf |
RTF parser. Control words, groups, codepage decoding. |
udoc-markdown |
Markdown parser. CommonMark + GFM tables. |
Backend internals¶
Each backend has its own internal layering. The format guides cover the per-backend architecture: see PDF, DOCX, XLSX, PPTX, DOC, XLS, PPT, ODF, RTF, and Markdown.
Performance Notes¶
- Zero-copy lexer. Tokens borrow from the input buffer; only the
object parser converts to owned
PdfObjecttypes. Most objects are never dereferenced, so the lexer is the hot path and the parser is the cold path.crates/udoc-pdf/src/parse/lexer.rs. - Hash-DoS-resistant maps on attacker-controlled keys.
ahashon the PDF object resolver, font cmap tables, ToUnicode lookups, and ZIP central directory. SipHash by default in std is slow;ahashis fast and DoS-resistant. The audit-and-swap cut wall time roughly 15% on a 200-document Archive.org sample. - Per-page (font, glyph) decode cache. 256-entry approximate-LRU
in
crates/udoc-pdf/src/content/decode_cache.rs. Caches decoded Unicode strings by(font_obj_ref, packed_code). Per-page rather than per-document, so multi-thousand-page reports do not balloon the cache. Covers Latin + CJK paragraphs at ~8 KB per page. - Pre-sized hot-path vectors. Glyph count is known up front from
the code length;
Vec::with_capacity(byte_count / code_len)for bbox and advance vectors avoids incremental realloc per span. - Per-page move-semantics. The raw-span →
PositionedSpanemit loop consumes spans by value rather than cloning. Dropped 8 clones per span (text, font name, char advances, etc.) on a typical page. - Stream filter buffer pool. Thread-local buffer pool in
udoc_pdf::object::streamreduces large allocations the kernel sees, sidestepping amm_structrwsem contention point at high thread counts. - Memory budget.
Config::memory_budgetis a soft per-process RSS cap that triggers between-document cache resets. Use it when ingesting 10K+ documents in one process to bound peak heap. - Reading-order tier 1 fast path. If the content stream is already laid out in reading order (the case for most LaTeX, Word, and InDesign output), the geometric reordering pipeline is skipped. Coherence is detected with a Y-monotonicity check inside detected column regions; threshold 0.75. See PDF format guide.
Diagnostics¶
Recoverable issues during extraction are not exceptions. They flow
through a DiagnosticsSink trait that callers can attach. The default
sink drops warnings; the CLI's default sink prints them on stderr;
batch workers typically attach a CollectingDiagnostics and aggregate
results.
A warning carries:
- a structured
kind(an enum, not a string). - a level (
WarningorInfo). - a context (
page_index,obj_ref). - an optional byte offset.
- a human message.
Common kinds you will see in the wild:
kind |
When |
|---|---|
StreamLengthMismatch |
PDF /Length is wrong; recovered by scanning for endstream. |
ToUnicodeMissing |
Font has no ToUnicode CMap; encoding-table fallback used. |
MalformedXref |
An xref entry was malformed; skipped, parse continued. |
UnsupportedFilter |
Stream uses a filter we do not implement; stream dropped. |
DocFastSaveFallback |
Word 95 fast-save piece-table fragment; text may be empty. |
TierSelection |
Reading-order tier picked for this page (Info, not Warning). |
HookTimeout |
A hook exceeded its per-request timeout; page un-augmented. |
Agents and CI pipelines filter on kind. Humans read the message.