← Blog

OCR — making a TIFF-derived PDF searchable

OCR adds an invisible text layer behind the image — same look, now searchable Image-only PDF scan stored as picture ⌕ search returns nothing + OCR Searchable PDF scan + invisible text behind Invoice #12345 Quantity: 3 × $19.99 Subtotal: $59.97 Tax 8%: $4.80 Total: $64.77 Date: 2024-01-15 Thank you for your purchase ⌕ "invoice" → match found

Most TIFFs are scans — pictures of pages, with no real text inside. The PDF that TIFF2PDF produces is the same: visible content but no machine-readable text. Searches return nothing; copy-paste returns nothing. Adding OCR (Optical Character Recognition) layers actual text behind the image, making the PDF searchable while looking identical.

What OCR does

OCR analyzes the pixels of each page, identifies regions of text, segments them into characters, and recognizes each character. The output is text + position information: "the word 'invoice' starts at x=380, y=890, with these character bounding boxes".

For a "searchable PDF", the text is added to each page as invisible text objects — drawn with 3 Tr (text rendering mode 3, "invisible"), positioned at the same coordinates as the visible glyphs. The PDF reader's search and copy-paste tools find this hidden text; the visual rendering shows only the original image.

TIFF2PDF doesn't do OCR

OCR is a heavy-weight operation: a typical engine takes 2–10 seconds per page for English text on a modern CPU, more for non-Latin scripts or complex layouts. It's a different problem domain (machine learning on images) from PDF assembly (file structure manipulation).

TIFF2PDF produces image-only PDFs. The OCR step is separate.

Adding OCR after the fact

Most full-featured PDF readers offer a "Recognize Text" or "Make Searchable" feature: open the image-only PDF, run the command, save. Adobe Acrobat, ABBYY FineReader, and similar desktop tools all do this with a single click. The output is a PDF that looks identical to the input but has searchable, copy-pasteable text behind every visible word.

The same tools usually expose a few useful options:

Workflow: TIFF2PDF to assemble the image-only PDF, then run "Recognize Text" in your PDF tool to add the searchable layer.

Quality of OCR

OCR accuracy depends on input quality:

TIFF mode affects OCR quality

Bilevel TIFFs (CCITT G4 from office scanners) are typically the best input for OCR:

Grayscale and color TIFFs work too but produce slightly lower accuracy on the same content because anti-aliasing introduces ambiguous pixels. The trade-off is preservation of detail (signatures, stamps, colored highlighting) at the cost of a few percentage points in OCR accuracy.

Non-Latin scripts

The major OCR tools support 100+ languages. For Cyrillic, Greek, Arabic, Hebrew, CJK (Chinese, Japanese, Korean), and many others, the OCR dialog usually has a language dropdown — pick one or several (most tools allow concatenated languages for mixed content) before starting recognition. Some tools download additional language packs on demand the first time you select them.

Cloud OCR for higher accuracy

Desktop OCR is the standard option for bulk in-house work. Cloud-based services from major providers offer typically higher accuracy, especially for complex layouts (multi-column documents, forms, tables); they cost roughly a fraction of a cent per page and produce structured output (tables, forms, key-value pairs). For bulk archival of 1000+ pages where accuracy matters, cloud OCR is often worth the cost.