Skip to content

Rust library

The udoc workspace publishes a set of crates that share one document model. The udoc crate is the facade: it dispatches to the right backend based on format detection and emits the unified Document. The per-format backends are also independently usable when you want a typed handle that does not pay the conversion cost into the unified model.

Status

udoc is not on crates.io for the alpha period. To use the Rust API today, depend by git path:

[dependencies]
udoc = { git = "https://github.com/newelh/udoc", tag = "v0.1.0-alpha.1" }

The Rust API is alpha; expect to bump frequently. Per-crate publishing to crates.io lands at beta, once the public API has stabilised across at least one external integration.

Facade surface

use udoc;

let doc = udoc::extract("paper.pdf")?;
println!("{:?}", doc.metadata.title);
for block in &doc.content {
    println!("{}", block.text());
}
# Ok::<(), udoc::Error>(())

The Rust facade mirrors the Python shape:

Python Rust
udoc.extract(path) udoc::extract(path) -> Result<Document>
udoc.extract_bytes(b) udoc::extract_bytes(&bytes) -> Result<Document>
udoc.stream(path) udoc::Extractor::open(path) -> Result<Extractor>
udoc.Config(...) udoc::Config::new() (builder)
udoc.extract(p, on_warning=) udoc::extract_with(path, cfg.diagnostics(sink))

Document is udoc_core::document::Document. Iteration is by direct field access (doc.content, doc.metadata, doc.images) — the Python wrapper hides this behind iterator methods so it can materialise spine and overlays lazily.

Per-backend access

Each format backend ships as a separate crate. Reach for them when you want format-specific structure or want to skip the conversion step into Document.

Crate Role
udoc The facade. Public API, format detection, conversion glue, CLI binary.
udoc-py Python bindings (PyO3) over the same engine.
udoc-core Format-agnostic types: Document, Block, Inline, NodeId, TextSpan, Table, PageImage, Error, DiagnosticsSink, FormatBackend, PageExtractor.
udoc-containers Shared parsers: ZIP (OOXML / ODF), namespace-aware XML, CFB / OLE2 (legacy Office), OPC packages.
udoc-pdf PDF parser. Layered: io → parse → object → font → content → text → document.
udoc-font Font engine. TrueType, CFF, Type 1, hinting, cmaps, ToUnicode.
udoc-image Image decoders. CCITT, JBIG2, JPEG, JPEG 2000.
udoc-render PDF page rasteriser. Auto-hinter, font cache, glyph compositor.
udoc-docx DOCX backend (ZIP + XML).
udoc-xlsx XLSX backend. Typed cells, shared strings, number-format mini-language.
udoc-pptx PPTX backend. Shape tree, slide layouts, notes slides.
udoc-doc Legacy DOC backend (CFB + FIB + piece table).
udoc-xls Legacy XLS backend (BIFF8).
udoc-ppt Legacy PPT backend (CFB + PowerPoint records + PersistDirectory).
udoc-odf ODF backend covering ODT / ODS / ODP from a single crate.
udoc-rtf RTF parser. Control words, groups, codepage decoding.
udoc-markdown Markdown parser. CommonMark + GFM tables.

The trait that backends implement:

trait FormatBackend {
    type Page<'a>: PageExtractor where Self: 'a;
    fn page_count(&self) -> usize;
    fn page(&mut self, index: usize) -> Result<Self::Page<'_>>;
    fn metadata(&self) -> &DocumentMetadata;
}

trait PageExtractor {
    fn text(&mut self) -> Result<String>;
    fn text_lines(&mut self) -> Result<Vec<TextLine>>;
    fn raw_spans(&mut self) -> Result<Vec<TextSpan>>;
    fn tables(&mut self) -> Result<Vec<Table>>;
    fn images(&mut self) -> Result<Vec<PageImage>>;
}

Per-backend extensions live behind each crate's own types. PDF's extra surface includes raw object access:

let mut doc = udoc_pdf::Document::open("paper.pdf")?;
let mut page = doc.page(0)?;
for span in page.raw_spans()? {
    println!("({:.1}, {:.1}) {}", span.x, span.y, span.text);
}
# Ok::<(), udoc_pdf::Error>(())

The PDF backend's internal types (PdfObject, PdfDictionary, PdfStream, Lexer, ObjectResolver) are public when you need parser-level access.

Configuration

use udoc::{Config, Format, LayerConfig};

let cfg = Config::new()
    .format(Format::Pdf)
    .password("secret")
    .pages("1,3,5-10")?
    .layers(LayerConfig::content_only());

let doc = udoc::extract_bytes_with(&bytes, cfg)?;
# Ok::<(), Box<dyn std::error::Error>>(())

Named presets:

Config::default()    // interactive defaults
Config::agent()      // collects diagnostics, keeps overlays on
Config::batch()      // disables expensive overlays, raises limits
Config::ocr()        // OCR-friendly render profile + scan detection

Diagnostics

Recoverable issues flow through DiagnosticsSink. The default sink drops; the CLI's default sink prints to stderr; batch workers typically attach a CollectingDiagnostics and aggregate:

use std::sync::Arc;
use udoc::diagnostics::{CollectingDiagnostics, DiagnosticsSink};

let diag = Arc::new(CollectingDiagnostics::new());
let cfg = udoc::Config::new().diagnostics(diag.clone());
let _doc = udoc::extract_with("paper.pdf", cfg)?;

for w in diag.warnings() {
    eprintln!("[{}] {}: {}", w.level, w.kind, w.message);
}
# Ok::<(), udoc::Error>(())

For live emission, implement DiagnosticsSink directly:

struct LogSink;
impl udoc::diagnostics::DiagnosticsSink for LogSink {
    fn warning(&self, w: udoc::diagnostics::Warning) {
        log::warn!("{}: {}", w.kind, w.message);
    }
}

Errors

udoc::Error carries a context chain. The top-level message describes what failed; chained context describes what it was doing.

Error: parsing object at offset 12345
  caused by: reading token
  caused by: I/O error: unexpected end of file

Error::code() returns a stable code matching the CLI exit codes; agents and pipelines match on the code rather than parsing prose.

Generating rustdoc

cargo doc --workspace --no-deps --open

When udoc lands on crates.io, the per-crate docs will live at docs.rs/udoc and the equivalent paths for each backend.

See also