When a scan becomes a searchable PDF

ericdanielski1 pts0 comments

When a scan becomes a searchable PDF

When a scan becomes a searchable PDF

June 17, 2026

Let's say that you signed a 5-page work contract. You have a paper document but no electronic version. Now you want to find a particular clause in the contract. How can you find it quickly? If you scan it with FairScan, you can now get a PDF which you can directly search for text. You may also copy any text from it, for example the paragraph you were looking for, and paste it in another application. This is the result of a text recognition system, which is often referred to as OCR (Optical Character Recognition).

An invisible text layer

When you scan a document, what you get is basically an image of the document. It's often stored in a PDF file and your PDF viewer displays this image, containing the pixels of a signature, the pixels of a logo, or the pixels that make the characters of a word. But to your PDF viewer, it's just the pixels of an image, and you can't search for text.

OCR is about identifying which pixels belong to text and reconstructing the text information they contain: which words and where they are located. The PDF file can then contain both the image and an invisible text layer. That's the recipe for a useful illusion:

You still see the document exactly as it is in the image

When you search for text, your PDF viewer highlights an area where you see a word that matches your search

When you point at an area of the image where you see a word, the cursor lets you select the word and copy it

Now you know the trick! And getting searchable PDFs like that out of FairScan is what I wanted to do.

Why now?

That looks nice, so why didn't FairScan have it before? That's because FairScan wasn't ready for it: I wouldn't be comfortable if FairScan was giving you wrong text half of the time. To get good results from an OCR engine, you need to provide it a readable image. If it's hard to read for a human, it's certainly also hard to read for an OCR engine.

Over the past year, I worked on different parts of FairScan's automatic processing to transform photos of documents into PDFs that are easy for humans to read:

document detection

perspective correction

shadow reduction

brightness and contrast enhancement

All this work on image processing helped FairScan produce clean PDFs and can now also contribute to making text recognition effective.

Implementing OCR in FairScan

Developing an OCR engine is quite hard. It was definitely out of the question for FairScan. I use Tesseract, an open source project that was initiated 40 years ago at HP.

Tesseract can be integrated easily into an Android app and covers about 120 languages. To avoid increasing drastically the app download size, language-specific data is not included in the app: you will have to download it separately for each language in FairScan's settings screen.

Like all the other parts of FairScan's scanning process, OCR runs on your device: your data is not sent to any server to be processed. This is unlike many mobile apps that rely on cloud processing to achieve the same task. On a mobile device, OCR typically takes a few seconds per page. That's significantly more than the other processing steps in FairScan, but still fast enough to remain practical for everyday scanning.

The challenge of PDF fonts

Making Tesseract run inside FairScan wasn't particularly difficult. Generating PDFs that work reliably across different writing systems turned out to be much harder.

PDF files need to know about fonts, fonts that cover the characters used in the file. Which font should FairScan use to generate PDFs? Tesseract can read many languages like English, Russian or Chinese. So we need a font for the latin alphabet, the cyrillic alphabet or Chinese characters. Fonts can be huge and that could impact both the app download size and the size of the generated PDFs. The irony is that, for invisible text, users never see the characters produced by the font. But PDF still needs a font for the text.

My research led me to the Adobe NotDef font: it renders all Unicode characters (covering all writing systems) the same way. As it's a tiny file, it looked like a good solution to avoid file size issues. However, because this font covers so many characters, the library I use to generate PDFs takes a couple of seconds to load it on my phone. I didn't manage to solve that and looked for another solution.

In fact, Tesseract itself can render PDFs. It faces the exact same problem and found a solution for it based on a kind of fake font. I couldn't use that directly because FairScan needs direct control on the PDF, for example to feed information about the physical dimensions of the document. But I took a lot of inspiration from what Tesseract does to generate PDF files. It requires hand-crafting PDF, which I never did before, but it works: it produces PDFs for all kinds of characters very fast and with a tiny font.

Keeping it simple

The technical solution I worked on...

fairscan text pdfs image font characters

Related Articles