<- Back
Comments (66)
- forgotpwd1674910,74912c187768,187779 < [Example 1: If you want to use the code conversion facetcodecvt_utf8to output tocouta UTF-8 multibyte sequence < corresponding to a wide string, but you don't want to alter the locale forcout, you can write something like:\237 D.27.21954 \251ISO/IECN4950wstring_convert<std::codecvt_utf8<wchar_t>> myconv; < std::string mbstring = myconv.to_bytes\050L"Hello\134n"\051; --- > > [Example 1: If you want to use the code conversion facet codecvt_utf8 to output to cout a UTF-8 multibyte sequence > corresponding to a wide string, but you don’t want to alter the locale for cout, you can write something like: > > § D.27.2 > 1954 > > © ISO/IEC > N4950 > > wstring_convert<std::codecvt_utf8<wchar_t>> myconv; > std::string mbstring = myconv.to_bytes(L"Hello\n"); Is indeed faster but output is messier. And doesn't handle Unicode in contrast to mutool that does. (Probably also explains the big speed boost.)
- manmalIs there the possibility to hook in OCR for text blocks flattened into an image, maybe with some callback? That’s my biggest gripe with dealing with PDFs.
- lulzxI built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.~41K pages/sec peak throughput.Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.~5,000 lines, no dependencies, compiles in <2s.Why it's fast: - Memory-mapped file I/O (no read syscalls) - Zero-copy parsing where possible - SIMD-accelerated string search for finding PDF structures - Parallel extraction across pages using Zig's thread pool - Streaming output (no intermediate allocations for extracted text) What it handles: - XRef tables and streams (PDF 1.5+) - Incremental PDF updates (/Prev chain) - FlateDecode, ASCII85, LZW, RunLength decompression - Font encodings: WinAnsi, MacRoman, ToUnicode CMap - CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)
- xvilkaTest it on major PDF corpora[1][1] https://github.com/pdf-association/pdf-corpora
- amkharg26Impressive performance gains! 5x faster than MuPDF is significant, especially for applications processing large volumes of PDFs. Zig's memory safety without garbage collection overhead makes it ideal for this kind of performance-critical work.I'm curious about the trade-offs mentioned in the comments regarding Unicode handling. For document analysis pipelines (like extracting text from technical documentation or research papers), robust Unicode support is often critical.Would be interesting to see benchmarks on different PDF types - academic papers with equations, scanned documents with OCR layers, and complex layouts with tables. Performance can vary wildly depending on the document structure.
- fainpulThese vibe coded tests are terrible:https://github.com/Lulzx/zpdf/blob/main/python/tests/test_zp...
- mpegvery nice, it'd be good to see a feature comparison as when I use mupdf it's not really just about speed, but about the level of support of all kinds of obscure pdf features, and good level of accuracy of the built-in algorithms for things like handling two-column pages, identifying paragraphs, etc.the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MITpython bindings would be good too
- agentifyshexcellent stuff what makes zig so fast
- odie5533Now we just need Python bindings so I can use it in my trash language of choice.
- pm2222What’s the format that’s perhaps free, easy to parse and render? Build one please.
- littlestymaar- First commit 3hours ago.- commit message: LLM-generated.- README: LLM-generated.I'm not convinced that projects vibe coded over the evening deserve the HN front page…Edit: and of course the author's blog is also full of AI slop…2026 hasn't even started I already hate it.
- nulloremptyTomorrow's headlinesfpdfjpdfcpdfcpppdfbfpdfppdf...opdf