Published April 15, 20268 min read

What a PDF really is

A PDF is not a source layout file. It is mostly a final page-description format made of objects, drawing streams, fonts, images, and sometimes a text layer that is only partly reconstructible.

ByAlessandra MaldiniLead Technical Writer

When people open a PDF, they often feel like they are looking at a normal document: text, headings, paragraphs, tables, images.

That is true for the human eye.

But underneath, a PDF is not a source layout file like Word, Google Docs, or InDesign. It is not a format primarily designed for re-editing a document. It is mostly a format designed to describe reliably what a page should display.

That difference explains almost everything:

why some PDFs are easy to modify
why others feel locked
why text can be visible without being cleanly extractable
why a scan can look like a normal PDF while being structurally very different

A PDF is first a page-description format

The right mental model is simple:

a source document stores editing logic
a PDF mostly stores rendering logic

In other words, a source document still knows things like:

this is a paragraph
this is a heading style
this is a list
this sentence can reflow if the page changes

A PDF mostly tries to guarantee that the page will render correctly, in the right place, with the right shapes, images, fonts, and dimensions.

So it prioritizes visual fidelity much more than editability.

How a PDF is structured

Under the hood, a PDF is a collection of objects linked together.

It typically contains:

Page objects
resource dictionaries
content streams
font objects
images
annotations
forms
metadata
a cross-reference table to find objects quickly

In practice, a PDF page does not always say "here is a paragraph."

Much more often, it says something closer to:

use this font
move to this coordinate
draw this glyph
move the text cursor
show another sequence
draw this image here

So a PDF is very good at describing how to paint a page, but much less good at describing how to intelligently re-edit its content.

Visible text is not always "real text"

This is probably the most important point.

When you see a word in a PDF, several very different realities are possible.

Case 1: the PDF contains true digital text

In the best case, the file contains a genuine usable text layer.

That usually allows you to:

select words
search the document
copy and paste coherent text
target the content for editing or translation

These are the easiest PDFs to work with.

Case 2: the PDF displays text, but reconstruction is weak

Here, the word is visible on screen, but the link between what is displayed and the real characters is degraded.

The PDF may then:

display the word perfectly
but return bad copy-paste output
or produce messy extraction
or lose spaces, reading order, or some characters

Visually, everything looks normal. Structurally, it is much weaker.

Case 3: the text was converted to outlines

In some exports, letters are no longer stored as text, but as vector shapes.

To a human, this still looks like text.

To software, those are no longer characters. They are paths.

At that point, the page may be perfectly sharp while still being very poor for search, extraction, or targeted rewriting.

Case 4: the text is really an image

In a paper scan, a fax, a photocopy, or a camera capture, the entire page may just be an image.

The "text" only exists inside the pixels.

Software cannot really read the document until an OCR step adds a usable text layer.

Why some fonts are reconstructible and others are not

When people say that a font or a PDF is "reconstructible," it helps to be precise.

In practice, what matters is usually not reconstructing the full source font file. What matters is reconstructing the link between the glyphs being drawn and the real characters they represent.

That is where many PDFs become difficult.

Embedded fonts

A PDF can embed the font directly in the file. That is great for faithful rendering.

But displaying a font correctly is not enough to make the text cleanly extractable.

The rendering engine may know how to draw the intended glyph without the document providing a clean mapping between that glyph and its real Unicode character.

Subset fonts

Very often, the PDF does not embed the entire font. It only embeds the subset of glyphs actually used in the document.

That is commonly called a subset font.

Visually, this is efficient: the file is lighter and the appearance stays faithful.

But reconstruction can become harder:

glyph names may be partial
only a minimal internal encoding may remain
the mapping may be highly document-specific

Mapping tables and `ToUnicode`

The key factor is often whether the PDF contains a good mapping between internal character codes and real Unicode characters.

When that mapping exists and is correct, extraction is usually much better.

When it is missing, incomplete, or poorly generated, the PDF may:

display correctly
but extract bad text
confuse characters
break ligatures
lose accents
emit incoherent symbols

In other words: a PDF can display the right thing without clearly explaining what it is displaying.

Text converted to outlines

At that point, reconstruction becomes even harder.

If letters have become vector outlines, there may be no text left to reconstruct at all. The only options are geometric inference or an OCR-like recovery path.

Why editing a PDF is harder than people think

Many people imagine that editing a PDF means "opening a frozen document."

In reality, the problem is often much less semantic and much more geometric.

A PDF may preserve:

glyphs positioned one by one
separated text fragments
imperfect reading order
blocks with no paragraph structure
spacing that is only implicit

It does not always preserve clearly:

the original paragraphs
the logical styles
the relationships between columns
reflow behavior
the higher-level editorial structure

So a visible paragraph may actually be, under the hood, a long series of tiny drawing instructions placed at exact coordinates.

That is enough for rendering.

It is much weaker for clean editing.

Why two PDFs that look the same can behave so differently

This follows directly from everything above.

Two files can display exactly the same sentence in exactly the same place:

one with a genuine usable text layer
another with badly mapped glyphs
another with outlined text
another with nothing but a page image

To a user, these four pages "look the same."

To an editing engine, they are four completely different situations.

That is why a PDF can be:

easy to search
hard to copy
possible to translate
but frustrating to correct character by character

or even impossible to edit without an extra recovery step.

Where OCR fits in

OCR does not magically recreate the original source document.

It does not turn a scan back into a clean Word file or a perfect InDesign layout.

Its role is narrower and more practical: add a usable text layer to a document that does not have one, or not enough of one.

That is essential for:

paper scans
faxes
digitized archives
image-only documents

So OCR mainly improves machine readability. It does not restore every editorial structure that was lost during export or scanning.

What this changes for editing tools

In practice, the right workflow depends on the real kind of PDF you have.

If the PDF already has a good text layer

AI editing is often the best option.

That is the right case for AI Edit when you need to:

fix a sentence
rewrite a passage
update a name, date, or clause
translate content

If you mainly want to add something visual

Then the goal is no longer to reconstruct text, but to place something on the page.

That is where Manual Edit is useful for:

a signature
an annotation
a visual highlight
a drawn marker

If the document is mostly image-only

Then the first job is to make it readable by software.

That is where PDF OCR comes in, adding an invisible searchable text layer without changing the visible appearance of the document.

The mental model worth keeping

A PDF is not "a frozen Word file."

It is a final page representation.

The more that representation preserves a clean text layer, good mappings, and usable structure, the more recoverable the document is.

The more it loses that information in favor of images, outlines, or opaque encodings, the harder editing becomes.

So the short version is:

PDF is excellent at preserving appearance
it is variable at preserving editable structure
displaying correctly does not mean extracting correctly
seeing text does not mean there is usable text underneath
OCR helps recover a text layer, but it does not resurrect the entire source document

That is exactly why some PDFs are easy to work with, while others require OCR, manual adjustments, or a much more careful editing strategy.