Skip to content

Format guides

udoc parses twelve formats end to end:

Family Modern Legacy binary
Word processing DOCX DOC (Word 97-2003)
Spreadsheet XLSX XLS (Excel 97-2003)
Presentation PPTX PPT (PowerPoint 97-2003)
OpenDocument ODT / ODS / ODP
PDF PDF
Lightweight RTF, Markdown

Every backend produces the same Document shape (see the library guide). The pages here cover what is specific to each format: capabilities, escape hatches, edge cases, and known limitations.

Pick a guide

Format When you'd reach here
PDF Anything PDF: encryption, fonts, reading order, table detection, OCR triggering, rendering.
DOCX Modern Word documents (.docx).
XLSX Modern Excel workbooks. Typed cells, formulas-as-text, multi-sheet.
PPTX Modern PowerPoint decks. Shape trees, speaker notes.
DOC (legacy) .doc binaries from Word 97-2003. Piece tables, fast-save fallbacks.
XLS (legacy) .xls BIFF8 workbooks (Excel 97-2003).
PPT (legacy) .ppt binaries from PowerPoint 97-2003. PersistDirectory walking.
ODF LibreOffice / OpenOffice formats. ODT, ODS, ODP share one backend.
RTF Rich Text Format. Codepage decoding, Unicode escapes.
Markdown CommonMark + a useful GFM subset.

Cross-cutting topics

These live in their own pages because they are not specific to any one format:

  • Font engine — how udoc parses TrueType / CFF / Type 1 fonts, ToUnicode resolution, encoding fallback chains, the bundled fallback faces.
  • Image decoders — CCITT, JBIG2, JPEG, JPEG 2000. Used by PDF (always) and by the OOXML / ODF backends for embedded images.
  • PDF rendering & OCR — when to use udoc render, resolution choices, autodetecting scanned PDFs that need OCR, wiring layout-detection hooks for hard reading-order cases.

Capabilities at a glance

What every backend does today. Empty cells mean "not implemented for this format yet" — sometimes because the format does not have the concept (Markdown has no pagination), sometimes because the work is deferred to a later release.

Format Text Tables Images Metadata Encrypt Render
PDF
DOCX
XLSX
PPTX
DOC
XLS
PPT
ODT
ODS
ODP
RTF
Markdown

Encrypted documents in formats marked blank fail with a structured PasswordRequired error rather than partial output. Page rendering for non-PDF formats is not currently supported — the format model carries the geometry, but the rasterisation pipeline is PDF-only. If you need rendering for a specific format, please open a feature request.

What "format" means in udoc

Format detection runs at the facade layer, not the backend. It looks at magic bytes first (%PDF-, PK\x03\x04, \xD0\xCF\x11\xE0, {\rtf1, etc.), inspects the OPC content-types entry inside ZIP containers to distinguish DOCX from XLSX from PPTX, and only falls back to file extension when bytes are inconclusive.

Pass format= (Python) or --input-format (CLI) to override. The typed Format enum is part of the public API; agents can pin a format when they have out-of-band knowledge.

Where to start

If you are evaluating udoc for a specific format, the per-format page is the right entry point. If you are doing PDF work, also read PDF rendering & OCR — it covers the table-detection / reading-order / column-detection nuances that matter for analytical pipelines.