From PDFs to AI-ready structured data: a deep dive (2024)

From PDFs to AI-ready structured data: a deep dive · ExplosionGo to top

From PDFs to AI-ready structured data: a deep dive

Dec 2, 2024 16 minute read Blog spaCy, Prodigy Layout Analysis, Optical Character Recognition (OCR), Object Detection, Span Categorization, Named Entity Recognition LLMs, Annotation, NLP Strategy, Computer Vision Ines Montani

PDFs are ubiquitous in industry and daily life. Paper is scanned, documents are sent and received as PDF, and they’re often kept as the archival copy. Unfortunately, processing PDFs is hard. In this blog post, I’ll present a new modular workflow for converting PDFs and similar documents to structured data and show how to build end-to-end document understanding and information extraction pipelines for industry use cases.

With more powerful Vision Language Models (VLMs), it’s finally become viable to complete many end-to-end tasks using PDFs as inputs, like question answering or more classic information extraction. This makes it tempting to consider PDF processing “solved” and treat PDF documents like yet another data type. I’ve even heard from people now converting plain text to PDFs because their AI-powered tool of choice was designed for PDFs. (Note: Don’t do this!)

When working with data, you typically want to operate from a “source of truth” with a structure you can rely on and develop against. This is a big reason why we use relational databases. The problem is, saying “I have the data in a PDF” is about as meaningful as saying “I have it on my computer” – it can mean anything. It may be plain text, scanned photos of text with varying image quality, or a combination of both. The layout properties and images embedded in the document may be extremely relevant, or they may not. All of these things fundamentally change the approach required to extract useful information. Machine learning rarely happens in a vacuum. There’s always an end goal: a product feature or a business question you want to answer.

So I believe it’s crucial to get your data out of PDFs as early as possible. If you’re dealing with text, it shouldn’t matter whether it came from a PDF, a Word document or a database. All of these formats are used interchangeably to store the same information.

If you use PDFs as the “source of truth” for machine learning, you end up with a monolithic and operationally complex approach. For example, to sort PDFs into different categories, the model has to do many things at once: process the document, find text, extract it where necessary, embed it all, and predict a classification label. And in the case of Retrieval-Augmented Generation (RAG), additionally parse the question, find the relevant document, find the relevant slice of the document and formulate a response. If we remove the document format PDF and its intricacies from the equation, the task suddenly becomes fairly straightforward: text classification, with optional layout features.

At their core, many NLP systems consist of relatively flat classifications. You can shove them all into a single prompt, or you can decompose them into smaller pieces that you can work on independently. A lot of classification tasks are actually very straightforward to solve nowadays – but they become vastly more complicated if one model needs to do them all at once.

— A practical guide to human-in-the-loop distillation: Making problems easier

These are all considerations that went into developing some of our own workflows for handling PDFs, specifically in the context of Natural Language Processing (NLP) and large-scale information extraction. It’s been one of the bigger missing pieces for smooth, end-to-end NLP in industry and will hopefully be useful for teams working with various input formats, including PDFs, Word documents and scans.

Are you working with PDFs or similar documents and have an interesting use case to share? Or do you need help with implementing a similar workflow in your organization? Feel free to get in touch!

A practical implementation with spaCy and Docling

Docling is developed by a team at IBM Research, who have also trained their own layout analysis and table recognition models. It takes a pipeline approach, combining modules for file parsing, layout analysis, Optical Character Recognition (OCR), table structure recognition and postprocessing to generate a unified, structured format. This makes it a great complement to spaCy, which is designed around the structured Doc object, a container for linguistic annotations that always map back into the original document.

The Docling architecture for PDF processing (Auer et al., 2024) spacy-layout extends spaCy with document processing capabilities for PDFs, Word documents and other formats, and outputs clean, text-based data in a structured format. Document and section layout features are accessible via a layout extension attribute and can be serialized in an efficient binary format.

import spacy from spacy_layout import spaCyLayout

nlp =...

From PDFs to AI-ready structured data: a deep dive (2024)

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast