RTF (Rich Text Format)¶
RTF predates the Web. It first shipped with Word for Windows 1.0 in 1989, has accumulated decades of producer-specific extensions, and remains in heavy use as an interchange format for legacy office tooling, government records systems, and email-attachment workflows that have never moved off it. udoc parses control words, groups, codepage-encoded text, and the embedded picture format natively.
Why this format is interesting¶
RTF is plain text. A document is an outermost group {...} containing
control words \rtf1, \ansi, \b, etc., nested groups, and
literal text. The format's age is its complexity: the syntax has
been extended and re-extended for thirty years, and most extensions
were added by Microsoft Word for one version and then never removed.
What you parse against is not "the RTF spec" but "the union of every
RTF dialect Word has ever emitted".
Two structural choices that shape the parser:
- Control words have global scope but group-level effects. A
\b(bold-on) inside a group affects only the runs in that group; when the group closes, formatting reverts. This means the parser maintains a stack of formatting states pushed at each{and popped at each}. Get the bracket count wrong by one and the styling for the rest of the document is silently inverted. - Mixed text encoding. RTF text can come through as 7-bit
ASCII literals,
\'XX8-bit codepage escapes,\uN?Unicode escapes (where?is a fallback character for non-Unicode readers), and raw bytes inside binary\binblocks. A single paragraph may use all four. udoc's lexer handles the multiplexing so the higher layers see uniform text runs.
What you get¶
- Paragraph text with formatting (bold, italic, underline, font, size, colour).
- Tables (
\trowd...\rowgroups). - Lists.
- Embedded images decoded from
\pictgroups (PNG, JPEG, EMF, WMF). - Document metadata via the
\infogroup (title, author, subject, creation date, etc.). - Codepage-aware character decoding via
\ansicpg,\u,\'XX. 20+ CJK codepages (Shift-JIS, GB18030, Big5, EUC-KR, etc.) via the sharedudoc_core::codepage::CodepageDecoder.
What you do not get¶
- Drawing-object rendering. RTF can carry vector drawings via
\shp/\shpinstcontainers; the drawing data is preserved in the document model but rasterising it is not currently supported. - OLE-embedded objects (
\objemb). The reference is surfaced; OLE payload decoding is not currently supported. - Field codes evaluated dynamically (
{\field {\fldinst HYPERLINK "..."} {\fldrslt "label"}}). The cached result text comes through; the field is not re-evaluated. (Design decision — udoc reports what is there, not what would be there at view time.) - Form fields with runtime state.
If you need any of the items marked "not currently supported", please open a feature request.
Format nuances worth knowing¶
Codepage governs 8-bit byte escapes¶
The \ansicpgN control word at the document start declares the
codepage for \'XX 8-bit byte escapes. A document that says
\ansicpg932 decodes \'82\'A0 as Japanese (Shift-JIS). udoc honours
the declaration and routes 8-bit escapes through the matching encoder.
If the document does not declare a codepage, udoc falls back to
CP-1252 (Windows Latin-1) — the default Word has used since the early
1990s. Documents from Word for Mac may declare \mac instead, in
which case Mac Roman applies.
Mixed-codepage documents¶
Some documents (rare but real, especially exports from translation
tools) carry text fragments in multiple codepages within one file.
The codepage in effect at any given moment is the most recent
\ansicpg declared inside the current group scope. udoc tracks the
codepage stack alongside the formatting stack so each text fragment
decodes correctly even when the codepage changes mid-document.
Unicode escape pairing¶
The \uN? escape carries a 16-bit signed Unicode codepoint
followed by a fallback character (the ?) that non-Unicode-aware
readers should display instead. udoc reads the Unicode code point and
discards the fallback. For surrogate pairs (codepoints > U+FFFF), RTF
emits two consecutive \u escapes; udoc pairs them per UTF-16
surrogate rules.
A common producer bug: emitting a single low surrogate without a preceding high surrogate, or vice versa. udoc treats orphan surrogates as U+FFFD and warns.
\bin directives carry binary data with explicit length¶
\binN declares that the next N bytes are raw binary, not RTF
syntax. The parser must skip those bytes literally without trying
to interpret control words inside them. udoc validates the declared
length against the actual bytes available before the next group
delimiter and warns if they disagree (a producer that miscounted, or
a truncated file).
Group depth bounds¶
Pathological RTF documents can declare arbitrary nesting depth and exhaust the parser's stack. udoc caps recursion at a configurable limit (default safe single-digit number) and rejects documents that exceed it with a clear error.
Layers within udoc-rtf¶
lexer control-word + group-marker + literal-text tokenizer
state formatting + codepage stack pushed at { and popped at }
codepage codepage-aware byte decoder (re-exports udoc_core::codepage)
parser token stream -> structured groups + paragraphs + table rows
table \trowd...\row group reconstruction into Table blocks
image \pict + \shp + \objemb decoders (PNG/JPEG/EMF/WMF)
convert RTF nodes -> unified Document model
document public API
state is the layer that makes nested groups work. Every {
pushes a copy of the current formatting state plus the active
codepage; every } pops. Get the bracket balance wrong by one and
trailing text inherits the wrong styling — it is silent and easy
to miss, so the lexer asserts on } underflow.
Failure modes¶
- Unbalanced braces. udoc's lexer detects mismatched
{/}counts and warns; depending on where the imbalance falls, the document may extract with the wrong formatting state for trailing text. - Codepage / declaration mismatch. Bytes do not decode under the declared codepage; udoc emits replacement characters and warns.
\binlength disagreement. Declared length does not match the actual binary block length.- Group depth past the limit. Document rejected with a clear error.
Diagnostics¶
kind |
When |
|---|---|
BraceImbalance |
The brace counter reaches zero before EOF, or stays nonzero at EOF. |
CodepageDecodeFailed |
A \'XX byte sequence does not decode under the active codepage. |
OrphanSurrogate |
A \uN value is a low surrogate without a preceding high surrogate (or vice versa). |
BinLengthMismatch |
A \binN declaration's length disagrees with the data available. |
GroupDepthLimit |
Nesting depth exceeded the configured maximum. |
Escape hatches¶
```rust,no_run use udoc_rtf::RtfDocument; use udoc_core::backend::{FormatBackend, PageExtractor};
let mut doc = RtfDocument::open("notes.rtf")?; for i in 0..doc.page_count() { let mut page = doc.page(i)?; println!("{}", page.text()?); }
Ok::<(), udoc_core::error::Error>(())¶
```
See also¶
- The font engine page covers the
CodepageDecodershared with the legacy DOC and XLS backends.