What does OCR actually do?

It looks at each page as an image and identifies the shapes of letters, then writes the recognised text into the PDF as a hidden layer over the original page image. Selection rectangles, copy/paste, and Ctrl/Cmd+F search all start working — without changing how the page looks.

No. The PDF stays in your browser. Tesseract.js runs the OCR engine locally as WebAssembly. The library does download about 12 MB of public English language data from the Tesseract project's CDN the first time you use it — that's library code, not your file.

OCR is genuinely heavy work — Tesseract has to look at every glyph on every page. A typical page is 5–15 seconds depending on your device. A 20-page document can take 2–5 minutes. We keep the engine warm across pages so multi-page jobs amortise the startup cost.

Which languages are supported?

English only at launch. Multilingual OCR (French, German, Spanish, etc.) is on the v2 roadmap — each language adds another ~12 MB download.

How accurate is the recognised text?

On clean scans of printed text (200+ DPI, well-lit, straight), accuracy is typically 95%+. Accuracy drops on low-resolution scans, skewed pages, faded ink, handwriting, decorative fonts, and pages with heavy background patterns. Handwriting in particular is not reliably recognised.

Will the output file be bigger?

Usually a bit smaller — we re-encode each page as a JPEG at 85% quality and embed a thin text layer. Massive lossless source PDFs can drop substantially in size; already-compressed scans stay about the same.

What's the file size limit?

50 MB per PDF. Long PDFs are also limited by your device's memory — Tesseract's peak memory usage runs to several hundred MB during recognition.

My PDF already has text — should I still run OCR?

No. If the text is already selectable in Adobe Reader or your browser's PDF viewer, OCR won't improve it and may slightly degrade it (the original text layer is replaced). OCR is only for scanned or image-based PDFs.

EDIT TOOL

OCR PDF

Make a scanned PDF searchable.

Pick a scanned or image-based PDF (max 50 MB)

Heads up — first run downloads ~12 MB. The OCR engine and English language data load from the Tesseract project’s public CDN the first time. After that it’s cached. Expect about 5–15 seconds per page on a modern device. English only at launch.

Run optical character recognition on a scanned or image-based PDF. We render each page, recognise the text with Tesseract, and write the original pages back with an invisible text layer on top — so the PDF looks identical but text is now selectable and searchable. English only at launch. Runs entirely in your browser, but the first run downloads about 12 MB of language data from the Tesseract project's public CDN (one-time, cached).

How it works

How OCR PDF works

Upload your PDF
Drop in a scanned or image-based PDF. The smaller and clearer the scan, the better the OCR result.
Wait for OCR
Tesseract.js processes each page (5–15 seconds per page). The first run downloads ~12 MB of language data; subsequent runs use the cache.
Download the searchable PDF
Pages look identical to the original, but text is now selectable, copyable, and searchable.

FAQ

Frequently asked questions

It looks at each page as an image and identifies the shapes of letters, then writes the recognised text into the PDF as a hidden layer over the original page image. Selection rectangles, copy/paste, and Ctrl/Cmd+F search all start working — without changing how the page looks.

OCR PDF

How OCR PDF works

Upload your PDF

Wait for OCR

Download the searchable PDF