Text in PDF documents

Volume 11, Issue 62; 26 Sep 2008; last modified 08 Oct 2010

You can see the words on the page, so you know they must be in there, right? Well, sorta.

We navigate our whole lives using words. Change and improve the words and I believe we can change and improve life.

My attention was drawn recently to Text Content in PDF Files by Jim King. If you've ever tried to extract text from a PDF file, you'll appreciate this clear and concise description of why it didn't work. Or at least, didn't work as well as you'd hoped.

I appreciate the art of the typeset page. I have a letterpress and several drawers of type in my basement. I also appreciate that sometimes information and it's presentation are inextricably linked. But that is not often the case.

As a consumer, I would almost always be happier if you simply published your information in reasonably well structured (X)HTML with a little CSS styling to establish whatever look and feel you want. (No Flash, no silverlight, no PDF, just the content, please). If you've got your information in some more structured XML, I'd probably like access to that too, but publishing arbitrary XML on the web seems not to have taken off.

The reason is simple: PDF gives all the formatting control to the author. I'm the reader and I want the control. I want the freedom to flow the text differently to fit on my wide or narrow screen or handheld device, or convert it into an audio format, or do something else that will make the information more useful to me.

I think the assertion that “enough auxiliary information has been defined for PDF files such that a well written program can extract the text content from PDF pages” only tells half the story. It's clear from my experience with PDF files in the wild that there are a lot of PDF generators out there that aren't writing enough auxiliary information.

King concludes:

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

I'll grant that the virtue of PDF is that it allows me to publish page images in a more compressed and flexible format than bitmaps, it's just not a virtue that holds very much appeal to me. I don't think I'd go so far as to say that it's worth the price of obscured content, but what really worries me is the implication that you might put content into PDF intending to get the text back out.

I understand the convenience of PDF for publishing page images. I sincerely wish that there was a free, fully conformant XSL FO processor. I think that'd be a big boon for a lot of people.

But for the love of whatever you hold dear, please don't use PDF as an archival format for your content! You want to hold onto originals with richer, more accessible structure. Ideally, you want them to be stored in a widely available XML format. But you knew I was going to say that.

Comments

I guess this is why Adobe is exploring XML as the foundation for the next version of PDF (codenamed Mars).

PDF files can include structure information which makes the exported content better. You can reflow the text into a single column the width of the screen, too, but pictures are lost while you are reading it. So, PDF has more capability than you are mentioning.

If the author created tagged PDF (as can be done with structured FrameMaker), then PDF text can be exported (or read aloud by screen readers) without including extraneous text like headers and footers. For any of this to work really well, the author must provide the information while authoring.

PDF is certainly not intended as an archival format. Couldn't agree with you more strongly!

It looks pretty. That sums up the Adobe product. No more.

I agree, PDF is NOT meant for archival format. But I suppose that by PDF you mean Traditional PDF. Because, there are many ways to create a valid PDF file. Not all valid PDF files are usable in every context. As Lowagie explains (in his book: iText in Action), there are several types of PDF:

Traditional PDF: is a read-only paginated document format that can contain all kinds of multimedia, links, bookmarks, and so forth; but it doesn’t know anything about text structure

Tagged PDF: is a stylized use of PDF; it defines a set of standard structure types and attributes that allow page content to be extracted and reused for other

Linearized PDF: is organized in a special way to enable efficient incremental access, thus enhancing the viewing performance. Its primary goal is to display the first page as quickly as possible.

PDFs preserving native editing capabilities: Adobe Illustrator was one of the ancestors of PDF. In Adobe Illustrator, you have the option to save files as a PDF file. If you open such a file in Illustrator, you can continue editing, just like with the native AI format. Note that these PDF files aren’t suited for general, online distribution: they’re larger than the traditional PDFs because they contain a lot of application-specific data.

PDF types that became an ISO standard: including PDF/X, PDF/A and XMP, PDF/E

So, I do agree that Traditional PDF is not the way to go when it comes to archiving.

Hybrid PDF support in OpenOffice 3.0 is already doing the PDF+XML (ODF) thing.

http://www.oooninja.com/2008/06/pdf-import-hybrid-odf-pdfs-extension-30.html

And PDF/A support seems pretty common now (largely because of the needs of the legal community for retrieval from archives).