Hey HN, we’re Soami, David, and Achyut, co-founders of Extract. Extract parses documents into structured data (text, tables, and figures). Teams use it for RAG, feeding llms, and populating databases and forms. Today we’re launching our first OCR model, now used in Extract. h You can try some examples here or upload your own (no signup required) to test it out: https://extract.page/demoWe built Extract out of YouLearn, where we were processing 70m+ pages and slow parsing was the bottleneck. We started with a purely algorithmic pipeline that pulled native text straight from the document and only ran OCR on pages that needed it. It was cheap and fast, but once we put it in front of our Extract customers and their hardest documents, it hit an accuracy ceiling. We wanted to keep the speed and cost while improving accuracy, so we trained our own VLM for the cases that broke. It also provides element level bboxes, so each result points back to its exact place on the page. That took one customer from 71% to 92% text accuracy in under a week, at the same speed and cost. We can do this because of our synthetic data generation pipeline that recreates the messy, real-world documents the model gets wrong, so we can retrain on those exact cases without having to hand-label data.To see how this holds up against other providers, we benchmarked Extract against AWS Textract, Extend, Reducto, LlamaParse, and Unstructured on 130 human labeled pages from difficult real-world documents. Extract is #1 on text accuracy (81.9%) and word-overlap F1 (84.5%), second on grounded accuracy, and competitive on layout IoU, while running at least 2x faster than every parser we tested.Here are the benchmarks: https://extract.page/benchExtract is $3 per 1000 pages and about 5x cheaper than AWS Textract (layout + table enabled). To see how it performs on your own docs, feel free to send us a few and we’ll run a benchmark on them. We’ll get back to you with the results in a few days once we receive the docs: https://cal.com/team/youlearnai/extract-introThanks for reading this post! It s our first version of the model and we re shipping further improvements to handwritten, multilingual, and table-heavy documents. We know there are documents it won t handle well yet. If you have one, we d love to see it.