OCR — making a TIFF-derived PDF searchable
Most TIFFs are scans — pictures of pages, with no real text inside. The PDF that TIFF2PDF produces is the same: visible content but no machine-readable text. Searches return nothing; copy-paste returns nothing. Adding OCR (Optical Character Recognition) layers actual text behind the image, making the PDF searchable while looking identical.
What OCR does
OCR analyzes the pixels of each page, identifies regions of text, segments them into characters, and recognizes each character. The output is text + position information: "the word 'invoice' starts at x=380, y=890, with these character bounding boxes".
For a "searchable PDF", the text is added to each page as invisible text objects — drawn with 3 Tr (text rendering mode 3, "invisible"), positioned at the same coordinates as the visible glyphs. The PDF reader's search and copy-paste tools find this hidden text; the visual rendering shows only the original image.
TIFF2PDF doesn't do OCR
OCR is a heavy-weight operation: a typical engine takes 2–10 seconds per page for English text on a modern CPU, more for non-Latin scripts or complex layouts. It's a different problem domain (machine learning on images) from PDF assembly (file structure manipulation).
TIFF2PDF produces image-only PDFs. The OCR step is separate.
Adding OCR after the fact
Most full-featured PDF readers offer a "Recognize Text" or "Make Searchable" feature: open the image-only PDF, run the command, save. Adobe Acrobat, ABBYY FineReader, and similar desktop tools all do this with a single click. The output is a PDF that looks identical to the input but has searchable, copy-pasteable text behind every visible word.
The same tools usually expose a few useful options:
- Skip pages that already have text: avoids double-OCR'ing partially-OCR'd inputs.
- Force OCR: re-OCR even pages that already have text — useful when the existing layer is poor quality.
- Deskew: detect and correct rotation up to a few degrees. Common with hand-fed scanners that don't load paper perfectly straight.
- Clean / despeckle: pre-process the image to remove specks and ink bleed before OCR. Improves accuracy on noisy scans by 1–3 percentage points.
- Remove background: convert near-white backgrounds to pure white. Helps for old or yellowed paper.
- Auto-rotate pages: detect and rotate pages that are sideways or upside-down. Useful for scans where the operator fed pages inconsistently.
Workflow: TIFF2PDF to assemble the image-only PDF, then run "Recognize Text" in your PDF tool to add the searchable layer.
Quality of OCR
OCR accuracy depends on input quality:
- Printed text at 300+ DPI, clean scan: 99%+ accuracy. Free local OCR engines read typeset books, contracts, articles correctly with rare errors (typos, occasional letter substitutions).
- Printed text at 200 DPI: 95–98% accuracy. Some fine-detail letters get confused (i/l, c/e, 1/I).
- Printed text below 150 DPI: 85–95%. Letters fall below the 2.5-pixel-per-stroke threshold for reliable recognition.
- Typewriter or older photocopy: 90–98%. Inconsistent character forms, but engines trained on noisy data handle them well.
- Handwriting: 30–80% with general-purpose OCR. Specialized handwritten-text-recognition (HTR) models do better but cost a per-page lookup fee.
- Mixed languages or fonts: drops accuracy. Specify the language(s) to help.
TIFF mode affects OCR quality
Bilevel TIFFs (CCITT G4 from office scanners) are typically the best input for OCR:
- Text strokes have well-defined edges from the scanner's threshold.
- No JPEG artifacts to confuse the recognizer.
- File size is minimal, so processing is fast.
Grayscale and color TIFFs work too but produce slightly lower accuracy on the same content because anti-aliasing introduces ambiguous pixels. The trade-off is preservation of detail (signatures, stamps, colored highlighting) at the cost of a few percentage points in OCR accuracy.
Non-Latin scripts
The major OCR tools support 100+ languages. For Cyrillic, Greek, Arabic, Hebrew, CJK (Chinese, Japanese, Korean), and many others, the OCR dialog usually has a language dropdown — pick one or several (most tools allow concatenated languages for mixed content) before starting recognition. Some tools download additional language packs on demand the first time you select them.
Cloud OCR for higher accuracy
Desktop OCR is the standard option for bulk in-house work. Cloud-based services from major providers offer typically higher accuracy, especially for complex layouts (multi-column documents, forms, tables); they cost roughly a fraction of a cent per page and produce structured output (tables, forms, key-value pairs). For bulk archival of 1000+ pages where accuracy matters, cloud OCR is often worth the cost.