webarch.pdf
Thoughts on producing quality printed output; specifically, a nice printed version of Architecture of the World Wide Web. [Update: added a pointer to the Recommendation PDF.]
[WebArch is now a Recommendation. I've written a short essay about that which includes pointers to a slightly modified stylesheet and to the resulting PDF.]
I read a lot of specifications. Most of the time, I read them online. I know a few folks who assiduously avoid paper all together, but I am not one of those people. For detailed review of a spec, I print it out and read it with a red pen in hand.
This brings me to an obvious point, one I hardly need to make in this crowd: web browsers suck at printing. Nevermind the fact that some browsers do a better job than others, they all suck. And CSS is never going to fix it. Did you hear me? CSS is never going to fix it. There are lots of programs that can produce more or less nice looking pages. TeX is an historical favorite, as is troff. More modern tools include various desktop publishing packages. In the XML world, the obvious tool is XSL, the Extensible Style Language, not the Transformation language.
It's important to realize, however, that XSL is an incomplete answer. You see, XSL is a constraint language. In XSL, you can specify how large the pages are, how many columns they have, the sizes of fonts, and a myriad other parameters. What you don't specify directly are where the page breaks necessarily occur, or which words get hyphenated, or where exactly any of the actual marks are going to wind up on paper.
The XSL Formatting Objects (FO) document is input to a formatter, a composition tool that renders marks on paper, typically these days in the form of a PDF file. Producing quality printed output is devilishly hard. Of all the various sorts of software systems I've encountered, a formatter is hands down the hardest to implement well.
There are several commercial formatters out there that do an adequate job. There are also a few free formatters that do a someone less adequate job. I desperately wish the quality of the free formatters would improve, but see the previous paragraph.
So where does all this lead? For a start, it leads to Architecture of the World Wide Web. As one of the editors of that document, and as a long time participant in the design of XSL, I really wanted to be able to render it on paper in a reasonably professional looking form with XSL.
To that end, I crafted an XSLT Stylesheet that would transform the XHTML of the specification into XSL FO so that I could produce PDF with xep. Herewith a few notes on that process.
Formatting the WebArch document
The WebArch document is authored in a dialect of XHTML. I say dialect because although its original sources are valid XHTML, they aren't quite the same XHTML that gets presented in the final specification. A series of transformations are applied to the sources. In order to produce the PDF, I decided to start with the transformed XHTML version of the specification, the document you view, not the original sources.
In principle, transforming to XSL Formatting Objects is as straight
                     forward as any other transformation. Starting with XHTML, you can see that
                     most of the block structures are going to get transformed to
                     fo:blocks and most of the inline structures are going to get
                     transformed to fo:inlines.
The tricky part is that FO documents have a fair bit of preamble at the front. The preamble is where you tell the formatter the size and shape of each page; you have to create a template, called a “master”, for each kind of page that will appear in your document. If you've never thought about composition in these terms, it may be a little hard to get your head around it.
Setting up the page masters
If you have a book nearbyFor the pedantic, a book written in a language presented left-to-right and top-to-bottom. Books with other orientations or writing directions will likely show analogous variation, though that is by no means true for all languages in all writing directions., pick it up and flip through it. While every page is probably different, odds are good that you will be able to find four different page layouts in the body of the book. First, left-hand (even numbered or “verso”) pages probably differ from right-hand (odd numbered or “recto”) pages. Look at the headers and footers, they are often mirror images of each other with, for example, page numbers in the outer corners of each page. Close inspection will probably also reveal that the margin on the “binding edge” of the page is a little wider than the margin on the other side. In many books, the first page of each chapter or section is different from both the left- and right-hand pages, perhaps having different or absent running headers or footers. The fourth layout style is for blank pages, if there are any. It is common for all chapters to begin on an odd page so if a chapter ends on an odd page, then a blank “even” page is inserted to force the next chapter to also begin on an odd page. Like the first page of a chapter, the blank page is often distinguished from other even pages by different or absent headers and footers. This is also the page that is sometimes annotated “This page intentionally left blank”.
Each of these page layouts is defined by a “master” with a specific name. After all the individual page masters have been created, you have to create a page sequence master. In XSL FO terms, a document consists of one or more page sequences. Each page sequence has a master that is a collection of individual page masters. For the WebArch document, there's only one page sequence, but in a book there might be different sequences for front matter, body, and back matter.
For WebArch, the page sequence master defines a master for the first page, for odd pages, for even pages, and for blank pagesNo, there won't actually be any blank pages, but I defined the master anyway. It doesn't do any harm..
Setting up the headers and footers
We're now almost ready to start generating FO markup for the document content, but there's one more little hurdle. Every FO page has five regions, the main body region in the center where the document goes, and four more regions around the edges for top, bottom, left, and right material. The top and bottom regions are used for headers and footers. If you look at the stylesheet, you'll see that each of these regions in each page master has a name. As soon as we've started a page sequence, we'll refer to these regions by name and fill in their content. The formatter will use this “static content” in the appropriate place on each page. The content is static in the sense that content from the document doesn't “flow” into it. It can change on a per-page basis, as we'll see.
Without going into a lot of detail here, if you look in the stylesheet, you'll see that I use tables to format the running headers and footers, placing the page numbers, for example, on the left side of left pages and the right side of right pages. Some masters, like the first page, have empty headers and/or empty footers.
At this point we can “apply templates” on the body and our FO document will come out.
Also of note
Two other parts of the stylesheet are perhaps notable: PDF bookmarks and the use of markers. Bookmarks will be a standard feature of XSL 1.1, but for the moment, I'm relying on a xep extension. Markers are more interesting.
Markers provide a mechanism for adjusting the running headers and footers as you progress through a document. Think of the way that headers and footers change as you flip through a dictionary: markers let you do that.
For WebArch, I decided to put the current first- or second-level section in the footer of each page. That way you can tell just where you are. It may prove to be more distracting than useful, but I figured there'd be no way to tell without trying it.
Markers are easy to use. Whenever you output content that should
                        appear in a header or footer, you output an fo:marker.
                        In the static content for the appropriate header or footer, you use
                        fo:retreive-marker. The formatter will replace the
                        fo:retreive-marker with the appropriate fo:marker.
                        
Careful inspection of the stylesheet will reveal several places where I've taken care to avoid an obvious compositional faux pas like leaving a single list item on the top or bottom of a page or allowing a page break to occur immediately after a section title.
Readers with design skill could certainly improve the presentation.
Getting the bits
If you want to play with it, you can get both the document and the stylesheet from the W3C site. I've also got a local copy of the stylesheet.
Note that it's designed to format exactly the WebArch document, it is not a general-purpose HTML stylesheet. But you might be able to turn it into one by adding the appropriate templates. I've only provided templates for exactly the XHTML elements used in WebArch.
If you want to see the results on A4 paper, you can simply set the
                     paper.type parameter to “A4”.
                     The page master markup is a simplified copy of the markup from the
                     DocBook stylesheets. I've preserved many, but in the interest of simplicity,
                     not all of the parameters.
Share and enjoy.
Comments
Many, many thank-yous for the lucid comments about fully-automated typography. I have tried to make similar points before (e.g. <http://cavlec.yarinareth.net/archives/2002/11/27/typesetters-are-not-machines/>), but you did it better. If you could expand that bit into an essay and get it published somewhere a lot of techies could read it and I could cite it, I would consider that a personal favor!
Because, yipes, I have had people coming to me wanting to "convert XML to PDF" (by which, of course, they meant beautifully-typeset pages), and I have wanted to wring their necks.
Speaking of quality printed output and formatting hardship. Take a look at the sidebar box in the pdf of this article. It's wide enough to start covering a bit of the letters in the 2nd column.
Great post Norm! The bit about odd and even numbered pages came as quite a revelation ;-)
How portable is the XSL? Or rather how standardised are the W3C pages? - I've been meaning to get certain other specs in hard copy for ages, but the default print CSS seems either to eat trees or be illegible.
Re. annotations - maybe run an Annotea server locally, with bookmarklets to aid posting? (Given FireFox has all that RDF underneath, there may even be a more direct way of supporting this, maybe already done...)
Next mission (should you choose to accept it): decent print version of RFCs on A4. This seems like a Holy Grail - I've heard it requested loads of times, never seen it done at all well.
Next mission++ : WebArch in RDF/XML. You know you want to ;-)
People interested in learning more about advanced use of XSL-FO (and XSLT!) stylesheets should check out Norm's XSLT stylesheets for converting DocBook to XSL-FO at http://docbook.sourceforge.net/projects/xsl/.
Re: Sidebar. Doesn't Amaya do annotation? It works offline, lets you save them, and the little web browser seems to do almost everything you describe. Granted, her UI sucks, though.
I do hope that element is retrieve-marker, or someone is going to have to Pay The Price.
http://www.w3.org/2001/tag/webarch/html2fo.xsl gives me a 403 forbiden (although your local copy link works) Unless I missed it you haven't a link to the generated pdf?
"Did you hear me? CSS is never going to fix it."
That sounds like a challenge to me :)
If the specification you are trying to print is written in XHTML, why not try printing it to PDF using Prince?
Prince supports headers/footers and duplex printing, with no XSLT transform required and no XSL-FO, just CSS all the way.
Thanks for writing this up, Norm. But it's a bit uneven. We get links to the XSL and XSLT specs, as if we needed help finding those, but then you fly by "... so that I could produce PDF with xep" as if xep were a household word. You seem to mostly use open source tools, so I expected a bit of explanation. I spent a few hours trying to get FOP (xml.apache.org/fop/) to work. Is there any hope? Is there any competition for xep?
GREAT!!
I was wondering for months why nobody had wrote an XSLT stylesheet for XHTML to FO. Wish somebody try to finish it to cover the full standard.
BTW passivetex-1.25-2 had some problems to get a finished pdf.
As usual, thanks Norm, you are great.
I was wondering for months why nobody had wrote an XSLT stylesheet for XHTML to FO. Wish somebody try to finish it to cover the full standard.
Antenna House have had one for ages (years?) It might be interesting to compare that with Norm's tuned one on this particular document....
http://www.google.com/search?hl=en&q=xhtml2fo
Here is another excellent article that was very useful to me when trying to devise a mostly XHTML - to - FO transformer stylesheet: HTML to Formatting Objects (FO) conversion guide, by Doug Tidwell.
I found this tool (http://www.re.be/css2xslfo/) on the net to convert XHTML witch CSS to PDF. Are there any other easy ways to print a XHTML document to paper?
Wonderful article. You should find out if you can get it printed an official essay somewhere! And @Boda: Do you know of any other tools by which you can convert XML to PDF files?