We navigate our whole lives using words. Change and improve the words and I believe we can change and improve life.
My attention was drawn recently to Text Content in PDF Files by Jim King. If you've ever tried to extract text from a PDF file, you'll appreciate this clear and concise description of why it didn't work. Or at least, didn't work as well as you'd hoped.
I appreciate the art of the typeset page. I have a letterpress and several drawers of type in my basement. I also appreciate that sometimes information and it's presentation are inextricably linked. But that is not often the case.
As a consumer, I would almost always be happier if you simply published your information in reasonably well structured (X)HTML with a little CSS styling to establish whatever look and feel you want. (No Flash, no silverlight, no PDF, just the content, please). If you've got your information in some more structured XML, I'd probably like access to that too, but publishing arbitrary XML on the web seems not to have taken off.
The reason is simple: PDF gives all the formatting control to the author. I'm the reader and I want the control. I want the freedom to flow the text differently to fit on my wide or narrow screen or handheld device, or convert it into an audio format, or do something else that will make the information more useful to me.
I think the assertion that “enough auxiliary information has been defined for PDF files such that a well written program can extract the text content from PDF pages” only tells half the story. It's clear from my experience with PDF files in the wild that there are a lot of PDF generators out there that aren't writing enough auxiliary information.
The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.
I'll grant that the virtue of PDF is that it allows me to publish page images in a more compressed and flexible format than bitmaps, it's just not a virtue that holds very much appeal to me. I don't think I'd go so far as to say that it's worth the price of obscured content, but what really worries me is the implication that you might put content into PDF intending to get the text back out.
I understand the convenience of PDF for publishing page images. I sincerely wish that there was a free, fully conformant XSL FO processor. I think that'd be a big boon for a lot of people.
But for the love of whatever you hold dear, please don't use PDF as an archival format for your content! You want to hold onto originals with richer, more accessible structure. Ideally, you want them to be stored in a widely available XML format. But you knew I was going to say that.