Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

<- Back

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

ses425500000

Comments (34)

bonoboTP
LLMs for OCR is super risky because just as much as they can fix OCR mistakes, they can inadvertently "fix" correct stuff too and hallucinate instead.Its that xerox bug on steroids, where scanned pages would get their digits swapped by other digits...I'd want to see some proper hallucination analysis.
themanmaran
> Never change the original language of any text. Keep Korean in Korean, Japanese in Japanese, and English in English.I love the double prompting to keep GPT from translating the text. I've definitely had this problem before, and spent ages trying to prompt it into not randomly translating the text.
novaRom
> Built With: DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, OpenCV, and more.the whole pipeline is not open source
sandreas
How does this compare against marker[1]?1: https://github.com/VikParuchuri/marker
8thcross
so you are saying i can feed my last 10 years of exam question papers and get predictions on what we will get this year?
constantinum
For the more curious: there is also Unstract open source for pipeline. Lets us plug in your AI stack eg. OS llm models, vector db, ocr parsers etc.https://github.com/Zipstack/unstract
aghilmort
super great work -- do you convert math formula to latex &/or how is that or other symbolic not necessarily unicode chars handled?
GPerson
Did you ethically acquire permission to train on the data set?
liangzhe88
Curious if there are plans to update this. Seems interesting.
samstave
[dead]
jlcases
This is a valuable contribution. The quality of ML models heavily depends on the quality of training data, and extracting structured information from unstructured documents (like PDFs) is a critical bottleneck.A key challenge after OCR is organizing the extracted data into a coherent knowledge structure. We've seen significant improvements in downstream ML tasks when the extracted data is organized using a hierarchical, MECE (Mutually Exclusive, Collectively Exhaustive) framework. This ensures that relationships between entities (tables, diagrams, text) are explicitly captured.Does your pipeline include capabilities for semantic structuring of the extracted content beyond basic layout analysis? That seems like the next frontier for maximizing the value of OCR data in ML training.