PDF File Format Explained: A Beginner's Guide

The PDF is everywhere. It's the format your boss emails contracts in, the format governments accept for forms, the format your bank uses for statements, the format your phone produces when you "Save as PDF" from any document. For something so ubiquitous, surprisingly few people know what's actually inside one, why it works the way it does, or what makes it different from a Word document or an image.

This guide is the friendly version: enough technical depth to understand what a PDF really is, but no specification-document jargon. By the end, you'll know what's inside a typical PDF, why some PDFs let you select text while others don't, and why certain operations on PDFs (like editing) are harder than they should be.

A short history

In 1991, Adobe co-founder John Warnock published a paper called "The Camelot Project." The premise was simple: documents created on any computer, in any application, should be viewable on any other computer, in identical form, forever. At the time, sharing documents between Mac and Windows was painful. Fonts didn't transfer. Layouts shifted. Print output looked different from screen output.

The solution Adobe shipped two years later was the Portable Document Format — PDF 1.0. The pitch: a single file format that captures the visual result of a document, including embedded fonts, exact layout, and printable instructions. Open it on any device, it looks the same.

PDF 1.0 was a paid product (you needed Acrobat to create files). But Adobe released the specification publicly, which meant anyone could write tools that read or wrote PDFs. That openness drove adoption faster than Adobe could have alone. Today, PDF is an ISO standard (ISO 32000) and tens of thousands of tools work with the format.

What's inside a PDF

A PDF file is structured like a small database. Internally, every PDF contains:

Header — declares the PDF version (%PDF-1.7, etc.)
Body — a series of objects representing pages, fonts, images, content streams
Cross-reference table — a directory that says "object 47 is at byte offset 18,294"
Trailer — points to the start of the cross-reference table

Each object can be:

A page — describes one page's size, content, and resources
A content stream — a sequence of drawing commands (like "draw the letter A at this position in this font")
A font — full or subset font data
An image — embedded raster image
A form (XObject) — a reusable drawing object
A metadata dictionary — author, title, creation date

When you open a PDF, the viewer reads the trailer, jumps to the cross-reference table, finds the root page object, and starts rendering. The structure is designed for fast random access — you can open a 1000-page PDF and jump to page 500 without reading the previous 499.

Why text in PDFs is sometimes selectable and sometimes not

This trips people up constantly. You try to copy text from a PDF and either nothing happens or you get a rectangle highlight. What's going on?

Text-based PDFs store text as character codes plus drawing instructions: "place the character 'H' at position (72, 720) in 12-point Times." You can copy, search, and resize this text without quality loss because it's stored as text.

Image-based PDFs store pages as images. Each page is essentially a JPEG or TIFF embedded in the PDF. There is no text as such — only pixels arranged to look like text. You can't copy, search, or extract this text without first running it through OCR (Optical Character Recognition).

Scanners and many phone scanner apps produce image-based PDFs by default. Word, Google Docs, LaTeX, and similar tools produce text-based PDFs. A "Save as PDF" from a browser or word processor almost always gives you text-based output.

Tip

Quick test: open a PDF and try to select a word. If the cursor moves through the text letter by letter, it's text-based. If selection grabs whole rectangles, it's image-based and you'd need OCR to extract the text.

Why PDFs preserve formatting (when other formats don't)

A Word document describes content abstractly: "this is a paragraph in Body Text style." The visual result depends on which fonts your computer has, which version of Word you have, and your screen size.

A PDF describes content concretely: "this page has the character 'A' rendered at position (72, 720) in the embedded Helvetica font at 12 points." The visual result is locked. Every viewer that can read PDFs will produce identical output.

This is why PDFs are the format of choice for legal contracts, government forms, and printed materials — the recipient sees exactly what you sent. It's also why PDFs are harder to edit than Word documents: there's no "paragraph" concept underneath, just placed characters.

The PDF/A standard (long-term archival)

PDF/A is a stricter subset of PDF specifically designed for archival storage. The key constraints:

All fonts must be embedded (no relying on external font availability)
No external references (no links to web pages, no audio/video that lives elsewhere)
No encryption that requires a decryption key (anyone in the future must be able to open the file)
Standardized color profiles must be embedded
No JavaScript (which could behave differently in future viewers)

Government archives, libraries, and legal repositories often require PDF/A specifically because it can be opened reliably decades from now. Most modern PDF tools have a "Save as PDF/A" or "PDF/A-compliant" export option.

PDF/X (print production) and other variants

A few other named variants exist:

PDF/X — for prepress and commercial printing; constraints on color profiles, fonts, and bleed marks
PDF/E — for engineering drawings; supports 3D models, technical annotations
PDF/UA — for accessibility; requires proper tagging, alternate text, reading order

These exist because the base PDF spec is permissive — it lets you do many things. The variants narrow that to what's appropriate for specific use cases.

Common PDF operations and what they're really doing

Merging

Combining the page trees of two or more PDFs into one. No re-rendering, no quality loss. See how to combine PDFs.

Splitting

Extracting some pages into a new PDF, leaving the rest behind. Like merging, structural — no quality loss.

Compressing

Re-encoding embedded images at lower quality, downsampling resolution, subsetting fonts, and discarding unused objects. See why PDFs get large and how to fix it.

Converting to JPG

Rasterizing each page at a chosen DPI. Loses text selectability — you end up with images of pages. See the complete PDF to JPG guide.

Extracting text

Reads the text content from a text-based PDF (or runs OCR on an image-based one) and outputs plain text. Try the PDF to text tool.

Rotating, reordering, deleting pages

All structural — modifying the page tree without touching the underlying content streams. Fast and lossless.

Adding page numbers or watermarks

Drawing new content on top of existing pages. The new content becomes part of the page after saving. See how to add page numbers or how to watermark a PDF.

Free tool

Try a free PDF tool — all run in your browser

Combine multiple PDFs into one file.

Try Merge PDF

Why PDFs are hard to edit

A common frustration: "I just want to change one word in this PDF." Why is it so much harder than editing a Word doc?

Because PDFs don't store paragraphs, they store positioned characters. Changing one word means shifting every subsequent character's position, possibly reflowing the entire page, recalculating line breaks, and potentially affecting subsequent pages. There's no "paragraph" object to ask politely to reflow.

PDF editors (Adobe Acrobat Pro, PDF Expert, etc.) work around this by guessing the document's structure — grouping characters into words, lines, and blocks based on position — then modifying the underlying content streams. It works, but it's brittle. Complex layouts often break.

For real editing, the standard advice is: edit at the source. Open the original Word/Pages/InDesign file and re-export the PDF.

Encryption and permissions

PDFs can be password-protected. There are two passwords:

Open password — required to open the file at all
Permissions password — restricts actions (printing, copying, modifying) without restricting viewing

Modern PDF encryption (AES-256) is strong if the password is strong. Older PDF encryption (40-bit RC4) is trivially breakable.

It's worth knowing that permissions are advisory — they're enforced by polite PDF viewers but ignored by tools that don't care. If you mark a PDF as "no copying allowed," most browsers and Adobe Reader honor that, but a quick command-line tool can extract the text anyway. For sensitive content, encryption (the open password) is the meaningful protection; permissions are speed bumps.

File structure quirks

A few oddities worth knowing:

PDFs can grow without re-saving the whole file. Every save can append an incremental update to the end, rather than rewriting the whole file. After many edits, the file can balloon. The fix is "Save As" to force a full rewrite.
PDFs can contain attachments. Other files (spreadsheets, images, other PDFs) can be embedded inside a PDF. Most viewers show these in a sidebar.
PDFs can contain JavaScript. Mostly used for form validation, occasionally for animations. Disabled by default in many viewers for security reasons.
PDFs can have forms. Interactive fillable forms, dropdowns, checkboxes — all part of the spec. Form data lives separately from the underlying page content.

Warning

PDFs with embedded JavaScript have historically been a vector for malware. Modern PDF viewers run JavaScript in a sandbox and most disable it by default — but be cautious opening PDFs from unknown sources, especially if they prompt to enable scripting.

Why PDF won

Ask "why didn't another format win?" and the answer comes down to a few specific decisions:

The spec was published openly. Anyone could build a PDF reader or writer. Word's .doc format wasn't openly documented for decades.
It locked the visual result. Other portable formats (HTML, RTF) reflow on different devices. PDF doesn't.
It was free to view. Adobe gave away Acrobat Reader from day one.
Print fidelity was perfect. PDFs print exactly as they appear on screen, which was a big deal in the 1990s.
It became an ISO standard. Governments and institutions could mandate PDF without locking themselves to a single vendor.

Newer formats (EPUB, ODF, even HTML) work better in specific scenarios — EPUB reflows nicely for e-readers, HTML works natively on the web. But for "I need to send you a document that looks exactly like this when you open it on any device, today or in twenty years," PDF still has no real rival.

FAQ