Zpdf: PDF text extraction in Zig

<- Back

Zpdf: PDF text extraction in Zig

lulzx

Comments (66)

forgotpwd16
74910,74912c187768,187779 < [Example 1: If you want to use the code conversion facetcodecvt_utf8to output tocouta UTF-8 multibyte sequence < corresponding to a wide string, but you don't want to alter the locale forcout, you can write something like:\237 D.27.21954 \251ISO/IECN4950wstring_convert<std::codecvt_utf8<wchar_t>> myconv; < std::string mbstring = myconv.to_bytes\050L"Hello\134n"\051; --- > > [Example 1: If you want to use the code conversion facet codecvt_utf8 to output to cout a UTF-8 multibyte sequence > corresponding to a wide string, but you don’t want to alter the locale for cout, you can write something like: > > § D.27.2 > 1954 > > © ISO/IEC > N4950 > > wstring_convert<std::codecvt_utf8<wchar_t>> myconv; > std::string mbstring = myconv.to_bytes(L"Hello\n"); Is indeed faster but output is messier. And doesn't handle Unicode in contrast to mutool that does. (Probably also explains the big speed boost.)
manmal
Is there the possibility to hook in OCR for text blocks flattened into an image, maybe with some callback? That’s my biggest gripe with dealing with PDFs.
lulzx
I built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.~41K pages/sec peak throughput.Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.~5,000 lines, no dependencies, compiles in <2s.Why it's fast: - Memory-mapped file I/O (no read syscalls) - Zero-copy parsing where possible - SIMD-accelerated string search for finding PDF structures - Parallel extraction across pages using Zig's thread pool - Streaming output (no intermediate allocations for extracted text) What it handles: - XRef tables and streams (PDF 1.5+) - Incremental PDF updates (/Prev chain) - FlateDecode, ASCII85, LZW, RunLength decompression - Font encodings: WinAnsi, MacRoman, ToUnicode CMap - CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)
xvilka
Test it on major PDF corpora[1][1] https://github.com/pdf-association/pdf-corpora
amkharg26
Impressive performance gains! 5x faster than MuPDF is significant, especially for applications processing large volumes of PDFs. Zig's memory safety without garbage collection overhead makes it ideal for this kind of performance-critical work.I'm curious about the trade-offs mentioned in the comments regarding Unicode handling. For document analysis pipelines (like extracting text from technical documentation or research papers), robust Unicode support is often critical.Would be interesting to see benchmarks on different PDF types - academic papers with equations, scanned documents with OCR layers, and complex layouts with tables. Performance can vary wildly depending on the document structure.
fainpul
These vibe coded tests are terrible:https://github.com/Lulzx/zpdf/blob/main/python/tests/test_zp...
mpeg
very nice, it'd be good to see a feature comparison as when I use mupdf it's not really just about speed, but about the level of support of all kinds of obscure pdf features, and good level of accuracy of the built-in algorithms for things like handling two-column pages, identifying paragraphs, etc.the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MITpython bindings would be good too
agentifysh
excellent stuff what makes zig so fast
odie5533
Now we just need Python bindings so I can use it in my trash language of choice.
pm2222
What’s the format that’s perhaps free, easy to parse and render? Build one please.
littlestymaar
- First commit 3hours ago.- commit message: LLM-generated.- README: LLM-generated.I'm not convinced that projects vibe coded over the evening deserve the HN front page…Edit: and of course the author's blog is also full of AI slop…2026 hasn't even started I already hate it.
nullorempty
Tomorrow's headlinesfpdfjpdfcpdfcpppdfbfpdfppdf...opdf