Skip to content

Python API reference

The complete public surface of the udoc Python module. Mirrors python/udoc/__init__.pyi; this page is the prose companion to those stubs. For higher-level walkthroughs with worked examples, see the Library guide.

The module is a PyO3 cdylib re-export: import udoc brings every symbol into the udoc namespace. The cdylib itself is reachable as udoc.udoc for callers that want to bypass the wrapper.

Top-level functions

udoc.extract

def extract(
    path: PathLike,
    *,
    pages: PagesArg = None,
    password: str | None = None,
    format: Format | str | None = None,
    max_file_size: int | None = None,
    config: Config | None = None,
    on_warning: Callable[[Warning], None] | None = None,
) -> Document

One-shot extraction from a file path. Returns a fully-materialised Document.

Parameter Meaning
path Path to the file. str, pathlib.Path, or any os.PathLike[str].
pages Page selector. int, range, sequence of ints, or a string spec like "1,3,5-10".
password Decryption password. PDF only; ignored for other formats.
format Force a backend instead of magic-byte detection. Accepts a Format or its str name.
max_file_size Per-call override for the file-size limit. Falls back to config.limits.max_file_size.
config Full Config bag. Wins over the per-call shortcuts above on conflict.
on_warning Live callback for Warning records. Called on the extraction thread.

Raises one of the exception types on failure. Always returns a Document even when the result is empty (e.g. a scanned PDF without an OCR hook attached); inspect doc.warnings for the recoverable issues that surfaced during extraction.

udoc.extract_bytes

def extract_bytes(
    data: bytes,
    *,
    pages: PagesArg = None,
    password: str | None = None,
    format: Format | str | None = None,
    max_file_size: int | None = None,
    config: Config | None = None,
    on_warning: Callable[[Warning], None] | None = None,
) -> Document

Same shape as extract but takes bytes already in memory. Pass format= whenever the bytes lack a reliable magic signature (e.g. RTF without the {\rtf1 prefix, OOXML zips with unusual ordering).

udoc.stream

def stream(
    path: PathLike,
    *,
    pages: PagesArg = None,
    password: str | None = None,
    format: Format | str | None = None,
    max_file_size: int | None = None,
    config: Config | None = None,
    on_warning: Callable[[Warning], None] | None = None,
) -> ExtractionContext

Open a streaming extractor. The returned ExtractionContext is a context manager that defers per-page work until requested. Use this for documents whose content does not need to fit in memory at once.

with udoc.stream("big.pdf") as ext:
    for i in range(len(ext)):
        text = ext.page_text(i)

ExtractionContext does not materialise a Document; it is the streaming counterpart that hands out per-page text, lines, spans, tables, and images on demand.

Document

class Document

The materialised result of extract or extract_bytes. Iterators (pages, blocks, tables, images, text_chunks) are materialised lazily on first call and then cached, so calling them twice does not redo work.

Property Type Notes
metadata DocumentMetadata Always present, even when every field is None.
format Format | None None if format detection was inconclusive at this layer.
source Path | None None when extracted from extract_bytes.
warnings list[Warning] Populated when Config(collect_diagnostics=True), default off for extract.
is_encrypted bool True when the source declared encryption (decrypted or not).
Method Notes
pages() -> Iterator[Page] Iterates Page objects in document order.
blocks() -> Iterator[Block] Walks the content spine recursively, yielding every Block.
tables() -> Iterator[Table] Yields every Table in document order.
images() -> Iterator[Image] Yields every Image (block-level and inline).
text() -> str Plain text reconstruction of the whole document.
text_chunks(*, by, size) -> Iterator[Chunk] Yields Chunk records sized for downstream embedding / LLM input.
to_markdown(*, with_anchors) Serialise to Markdown. with_anchors=True adds heading anchors.
to_dict() -> dict The full document model as a Python dict, ready for JSON.
to_json(*, pretty) Same as json.dumps(doc.to_dict()); pretty=True indents.
render_page(index, *, dpi=150) PNG bytes for one page. Raises UnsupportedOperationError on non-renderable formats.

len(doc) and doc[i] walk pages. for page in doc: is equivalent to for page in doc.pages():.

text_chunks

def text_chunks(
    self,
    *,
    by: ChunkBy = "heading",
    size: int = 2000,
) -> Iterator[Chunk]

ChunkBy is one of "page", "heading", "section", "size", "semantic". size is a soft target (characters) — chunks are emitted at the natural boundary closest to the target, so they tend to vary by ±20%. Each yielded Chunk carries a ChunkSource so downstream indexers can record provenance back to the page and originating block.

Page

class Page

A page of the document. For formats with no first-class page concept (DOCX, Markdown, RTF), a Page aggregates the entire content spine into a single page; treat the count as informational in those cases.

Property Type Notes
index int Zero-based page index.
blocks Sequence[Block] The page's content spine.
Method Notes
text() -> str Reading-order text reconstruction for the page.
text_lines() -> Sequence Line-broken text + per-line baseline / direction info.
raw_spans() -> Sequence Positioned spans in content-stream order (PDF specifics; see Library guide / Raw spans).
tables() -> Sequence[Table] Tables on this page, detected per the format's strategy.
images() -> Sequence[Image] Images placed on this page.
render(dpi=150) -> bytes PNG bytes for this page. Same renderer as Document.render_page.

Block

class Block

A block-level element. The kind discriminant selects which fields are meaningful:

Kind Meaningful fields
paragraph text, spans
heading text, spans, level (1–6; clamp values outside)
list list_kind, list_start, items (sequence-of-sequence-of-blocks)
table table (Table handle)
code_block text, language
image image_index, alt_text
page_break (no extra fields)
thematic_break (no extra fields)
section section_role, children
shape shape_kind, children, alt_text

block.node_id is the typed handle into the document's arena. Overlay payloads (presentation, relationships, interactions) key off this id. block.text returns the recursive text reconstruction for any block kind. Misspelled attributes raise AttributeError with a kind-aware hint.

Block is __match_args__-equipped, so structural matching works:

match block:
    case Block(kind="heading", level=l, text=t):
        index.add_section(level=l, title=t)
    case Block(kind="paragraph", text=t):
        index.add_paragraph(t)

Inline

class Inline

An inline span within a block. The kind discriminant selects which fields are meaningful:

Kind Meaningful fields
text text, bold, italic, underline, strikethrough, superscript, subscript
code text
link url, content (sequence of child Inline)
footnote_ref label
inline_image image_index, alt_text
soft_break (no extra fields — collapse to a space when reflowing)
line_break (no extra fields — emit a hard newline)

Style booleans are False (not None) when the variant does not carry styling, so inline.bold is always safe to read.

Table

class Table
class TableRow
class TableCell

Table primitives. A Table carries shape metadata plus the row vector; rows hold cells; cells hold content blocks (cells can contain anything — paragraphs, lists, nested tables).

Table properties:

Field Notes
rows Sequence of TableRow.
num_columns Logical column count, after merge resolution.
header_row_count How many leading rows are headers.
has_header_row header_row_count > 0.
may_continue_from_previous True when the table likely begins mid-flow on a page break.
may_continue_to_next True when the table likely continues onto the next page.

TableCell properties: text, content (sequence of Block), col_span, row_span, value (typed string for spreadsheet cells with a normalised representation; None for plain text cells).

Table.to_pandas() materialises the table as a pandas DataFrame. The pandas integration lives under udoc.integrations.pandas and is imported lazily, so importing udoc does not pull in pandas.

Image

class Image
Field Notes
node_id Arena handle.
asset_index Index into the document-wide image asset store (deduplicated).
width, height Pixel dimensions.
bits_per_component Bit depth.
filter The codec the bytes are stored in (flate, dct, ccitt, jbig2, jpx, ...).
data Raw bytes. Decoded if the source format embedded a recognised codec.
alt_text Optional alt text from the source.
bbox Optional BoundingBox for placement.

Multiple Image objects can share an asset_index when the same underlying bitmap is referenced from many places (slide deck logos, DOCX repeating header images). The bytes live once in the asset store; Image records carry per-placement metadata.

DocumentMetadata

class DocumentMetadata
Field Notes
title Document title.
author Primary author.
subject Subject (PDF Info dict, OOXML core).
creator Application that authored the document.
producer Application that wrote the file (often a print driver).
creation_date ISO 8601 string when known.
modification_date ISO 8601 string when known.
page_count Logical page count. 0 for paged formats with no pages discovered.
properties dict[str, str] of format-specific extended fields.

Common keys in properties:

  • dc:creator, dc:subject, dc:description — Dublin Core entries (OOXML core, ODF meta).
  • dcterms:created, dcterms:modified — typed Dublin Core dates.
  • pdf:Producer, pdf:Creator, pdf:Trapped — PDF Info dict entries that did not fit the structured fields above.
  • app:Application, app:AppVersion, app:Company, app:Pages, app:Words, app:Characters — OOXML extended properties from app.xml.

The properties dict is open-ended. Format-specific guides list the keys their backend writes.

Warning

class Warning

A typed diagnostic record from the extraction. Fields:

Field Notes
kind Stable enum string (e.g. "StreamLengthMismatch"). Filter on this.
level "info" or "warning".
message Human-readable description. Do not parse.
offset Byte offset in the source when known.
page_index Zero-based page when known.
detail Optional supplementary string (e.g. obj reference, font name).

Common kind values are listed in the Architecture page. New kinds are added as backends grow; treat unknown kinds as info-level unless level says otherwise.

The shadowing of the Python builtin Warning is intentional — udoc.Warning mirrors the Rust shape. If you import warnings generically alongside Python's, alias one of them: from udoc import Warning as UdocWarning.

Format

class Format

Enum-style pyclass. Each variant is a class-level constant:

udoc.Format.Pdf, Format.Docx, Format.Xlsx, Format.Pptx,
Format.Doc, Format.Xls, Format.Ppt,
Format.Odt, Format.Ods, Format.Odp,
Format.Rtf, Format.Md

Capability accessors (read-only properties):

Accessor True when
can_render The backend implements page rasterisation (PDF only at present).
has_tables The backend produces Block::Table from native structures.
has_pages The backend has a first-class page concept (PDF, PPTX, XLSX, legacy PPT/XLS, ODP, ODS).

Format.from_str("pdf") parses the canonical lowercase name. str(format) returns the same canonical name for round-tripping.

BoundingBox

class BoundingBox

Axis-aligned rectangle in PDF user space (origin lower-left, y-axis points up). Fields: x_min, y_min, x_max, y_max, plus computed properties width, height, area.

Supports (x, y) in bbox for point-in-rect tests. PDF page coordinates run in points (1/72 inch).

Chunk

class Chunk
class ChunkSource

A chunk of text plus its provenance. Chunk.text is the chunk body; Chunk.source is a ChunkSource carrying:

Field Notes
page Originating page index (when the chunk is page-local).
block_ids The node_ids the chunk was assembled from.
bbox Optional BoundingBox covering the chunk's source region.

The provenance is what makes Document.text_chunks useful for RAG indexing — every chunk carries enough information to highlight the source region in a viewer or recover the originating block from Document.blocks().

Config

class Config

The top-level configuration bag passed to extract, extract_bytes, and stream. Frozen dataclass-like — fields are read-only; mutate via dataclasses.replace(cfg, ...) or by constructing a new Config.

Field Type Default
limits Limits Conservative resource caps.
hooks Hooks No hooks attached.
assets AssetConfig Images and fonts both extracted.
layers LayerConfig All overlays enabled.
rendering RenderConfig 150 DPI, viewer profile.
strict_fonts bool False. When True, font fallbacks raise instead of warn.
memory_budget int \| None Soft per-process RSS cap. Triggers between-document cache resets.
format str \| None Force a backend (lowercase name).
password str \| None PDF decryption password.
collect_diagnostics bool When True, populates Document.warnings. Off for extract (the default on_warning=None already drops); on for the named presets that need warnings later.

Named presets

Preset Tuned for
Config.default() Interactive use, balanced defaults.
Config.agent() LLM agent loops — collects diagnostics, keeps overlays on.
Config.batch() Bulk ingest — disables expensive overlays, raises limits.
Config.ocr() Hybrid scanned-document pipelines — pre-wires OCR-friendly render profile and detection.

Limits

class Limits

Resource limits enforced during extraction. All fields default to sensible caps and are overridden by passing keyword arguments.

Field Default unit Purpose
max_file_size bytes Reject inputs larger than this.
max_pages count Cap per-document page count.
max_nesting_depth levels Recursion cap on Section/Shape/list nesting.
max_table_rows count Per-table row cap.
max_cells_per_row count Per-row cell cap.
max_text_length chars Per-string cap on extracted text fragments.
max_styles count Cap on distinct style records (defends against style-table bombs).
max_style_depth levels Cap on style cascade depth.
max_images count Per-document image cap.
max_decompressed_size bytes Cap on cumulative decompressed bytes (zip / flate bombs).
max_warnings count or None Cap on emitted diagnostics. None disables the cap.
memory_budget bytes or None Soft per-process RSS cap; cache reset trigger.

Hitting a limit raises LimitExceededError.

Hooks

class Hooks
Field Type Notes
ocr str \| None Path or $PATH name of an OCR hook executable.
layout str \| None Layout-detection hook executable.
annotate str \| None Annotation hook executable.
timeout int \| None Per-request timeout in seconds. Default 60 s when omitted.

A hook is a long-lived subprocess that follows the hooks protocol. The hook executable receives one JSON request per line on stdin and writes one JSON response per line on stdout.

AssetConfig

class AssetConfig
Field Default Notes
images True When False, image bytes are not loaded; references in the content tree become empty placeholders.
fonts True When False, font assets are not collected (saves memory on PDF).
strict_fonts False When True, missing ToUnicode CMaps raise instead of warn.

LayerConfig

class LayerConfig
Field Default Notes
presentation True Geometry / fonts / colours overlay.
relationships True Footnotes / links / bookmarks overlay.
interactions True Form fields / comments / tracked-changes overlay.

Disabling an overlay skips the work that produces it; the content spine is unaffected.

RenderConfig

class RenderConfig
Field Default Notes
dpi 150 Render resolution. 300 for OCR-quality output.
profile "visual" One of "visual", "ocr_friendly". See PDF rendering & OCR.

ExtractionContext

class ExtractionContext

Returned by udoc.stream. A context manager that holds a streaming extractor open for the duration of a with block and surfaces per-page accessors.

Method Notes
__enter__, __exit__ Acquire / release the underlying file handle.
close() Manual close. __exit__ calls this.
__len__(), page_count() Total page count. len(ctx) is the idiomatic call.
__iter__() Iterates per-page reading-order text strings.
page_text(i) Reading-order text for one page.
page_lines(i) List of (text, baseline_y, is_rtl) tuples.
page_spans(i) List of (text, x, y, w, h) positioned spans.
page_tables(i) List of tables, each represented as list[list[str]].
page_images(i) List of dicts with image metadata + bytes for one page.

The streaming view trades the unified document model for memory locality. Convert to a full Document with udoc.extract(...) if you need block-level structure, overlays, or the richer chunking.

Corpus

class Corpus

A lazy iterable of Document instances built from a directory or a sequence of paths. Constructed once; iterated many times. Designed for batch ingest pipelines.

udoc.Corpus("docs/")
udoc.Corpus(["a.pdf", "b.docx", "c.pptx"], config="batch")

config= accepts a Config instance or one of the preset names "default", "agent", "batch", "ocr".

Method / property Notes
__iter__() -> Iterator[Document \| Failed] Yields one record per file; failures surface as Failed markers rather than raising.
count() -> int Eager count of files. I/O for directory sources.
__len__() Always raises TypeError with a hint to call count().
filter(pred) Returns a new Corpus with pred(doc) applied lazily.
with_config(cfg) Returns a new Corpus using a different config.
parallel(n_workers, *, mode) Fans out across n_workers. mode="process" uses subprocess workers; mode="thread" uses a thread pool (best for I/O-bound input).
text(*, join="\n\n") Eager concatenation of every document's text.
tables(), images(), chunks(), metadata(), warnings() Yield Sourced[T] records with origin-file provenance attached.
render_pages(indices, *, dpi=150) Yields Sourced[bytes] of rendered PNGs for the given page indices in each document.
list() Materialise all documents into a list.
to_jsonl(path) Stream-write Document.to_dict() for each file as JSONL; returns the count written.

Sourced

class Sourced(Generic[T])

Provenance wrapper around a value extracted from a corpus iteration. Carries path, value, page (optional), and block_id (optional). Returned by Corpus.tables(), .images(), .chunks(), and friends so the caller never loses track of which file (and which place in the file) a value came from.

Failed

class Failed

Per-document failure marker yielded during corpus iteration. Carries path and the originating error (UdocError subclass). Iteration does not abort on a single failure — the caller decides how to handle the marker.

Async iteration (udoc.asyncio)

The udoc.asyncio module is a small wrapper that wraps the blocking extractor in a thread pool so corpus iteration plays nicely with asyncio consumers. See python/udoc/asyncio.py for the surface.

Exceptions

All exceptions inherit from udoc.UdocError, which inherits from Exception. Catch the base class to handle every udoc-originated failure uniformly; catch the specific subclass when the caller should react differently per cause.

Class Raised when
UdocError Base class; never raised directly.
ExtractionError Backend-internal failure that does not fit a more specific class.
UnsupportedFormatError The bytes do not match a known backend (or format= is wrong).
UnsupportedOperationError The backend cannot do what was asked (e.g. render_page on DOCX).
PasswordRequiredError Document is encrypted and no password was provided.
WrongPasswordError The password did not unlock the document.
LimitExceededError A Limits cap was hit during extraction.
HookError An OCR/layout/annotate hook failed fatally.
IoError The underlying I/O operation failed.
ParseError The bytes are malformed below the level of structured recovery.
InvalidDocumentError The document parses but its structure is not coherent.
EncryptedDocumentError Encryption decoding failed or is not supported.

Every exception carries the same code attribute the CLI uses on exit — agents can match e.code rather than parsing the message. The full code list lives in CLI reference / Exit codes.

Type aliases

Alias Definition
PathLike Union[str, os.PathLike[str], pathlib.Path]
PagesArg Union[int, range, Sequence[int], str, None]
ChunkBy Literal["page", "heading", "section", "size", "semantic"]
InlineKind Literal["text", "code", "link", "footnote_ref", "inline_image", "soft_break", "line_break"]
BlockKind Literal["paragraph", "heading", "list", "table", "code_block", "image", "page_break", "thematic_break", "section", "shape"]
CorpusSource Union[PathLike, Sequence[PathLike]]
CorpusMode Literal["process", "thread"]