Thoughts on producing quality printed output; specifically, a nice printed version of Architecture of the World Wide Web. [Update: added a pointer to the Recommendation PDF.]
[WebArch is now a Recommendation. I've written a short essay about that which includes pointers to a slightly modified stylesheet and to the resulting PDF.]
I read a lot of specifications. Most of the time, I read them online. I know a few folks who assiduously avoid paper all together, but I am not one of those people. For detailed review of a spec, I print it out and read it with a red pen in hand.
This brings me to an obvious point, one I hardly need to make in this crowd: web browsers suck at printing. Nevermind the fact that some browsers do a better job than others, they all suck. And CSS is never going to fix it. Did you hear me? CSS is never going to fix it. There are lots of programs that can produce more or less nice looking pages. TeX is an historical favorite, as is troff. More modern tools include various desktop publishing packages. In the XML world, the obvious tool is XSL, the Extensible Style Language, not the Transformation language.
It's important to realize, however, that XSL is an incomplete answer. You see, XSL is a constraint language. In XSL, you can specify how large the pages are, how many columns they have, the sizes of fonts, and a myriad other parameters. What you don't specify directly are where the page breaks necessarily occur, or which words get hyphenated, or where exactly any of the actual marks are going to wind up on paper.
The XSL Formatting Objects (FO) document is input to a formatter, a composition tool that renders marks on paper, typically these days in the form of a PDF file. Producing quality printed output is devilishly hard. Of all the various sorts of software systems I've encountered, a formatter is hands down the hardest to implement well.
There are several commercial formatters out there that do an adequate job. There are also a few free formatters that do a someone less adequate job. I desperately wish the quality of the free formatters would improve, but see the previous paragraph.
So where does all this lead? For a start, it leads to Architecture of the World Wide Web. As one of the editors of that document, and as a long time participant in the design of XSL, I really wanted to be able to render it on paper in a reasonably professional looking form with XSL.
To that end, I crafted an XSLT Stylesheet that would transform the XHTML of the specification into XSL FO so that I could produce PDF with xep. Herewith a few notes on that process.
Formatting the WebArch document
The WebArch document is authored in a dialect of XHTML. I say dialect because although its original sources are valid XHTML, they aren't quite the same XHTML that gets presented in the final specification. A series of transformations are applied to the sources. In order to produce the PDF, I decided to start with the transformed XHTML version of the specification, the document you view, not the original sources.
In principle, transforming to XSL Formatting Objects is as straight
forward as any other transformation. Starting with XHTML, you can see that
most of the block structures are going to get transformed to
<fo:block>s and most of the inline structures are going to get
The tricky part is that FO documents have a fair bit of preamble at the front. The preamble is where you tell the formatter the size and shape of each page; you have to create a template, called a “master”, for each kind of page that will appear in your document. If you've never thought about composition in these terms, it may be a little hard to get your head around it.
Setting up the page masters
If you have a book nearby, pick it up and flip through it. While every page is probably different, odds are good that you will be able to find four different page layouts in the body of the book. First, left-hand (even numbered or “verso”) pages probably differ from right-hand (odd numbered or “recto”) pages. Look at the headers and footers, they are often mirror images of each other with, for example, page numbers in the outer corners of each page. Close inspection will probably also reveal that the margin on the “binding edge” of the page is a little wider than the margin on the other side. In many books, the first page of each chapter or section is different from both the left- and right-hand pages, perhaps having different or absent running headers or footers. The fourth layout style is for blank pages, if there are any. It is common for all chapters to begin on an odd page so if a chapter ends on an odd page, then a blank “even” page is inserted to force the next chapter to also begin on an odd page. Like the first page of a chapter, the blank page is often distinguished from other even pages by different or absent headers and footers. This is also the page that is sometimes annotated “This page intentionally left blank”.
Each of these page layouts is defined by a “master” with a specific name. After all the individual page masters have been created, you have to create a page sequence master. In XSL FO terms, a document consists of one or more page sequences. Each page sequence has a master that is a collection of individual page masters. For the WebArch document, there's only one page sequence, but in a book there might be different sequences for front matter, body, and back matter.
For WebArch, the page sequence master defines a master for the first page, for odd pages, for even pages, and for blank pages.
Setting up the headers and footers
We're now almost ready to start generating FO markup for the document content, but there's one more little hurdle. Every FO page has five regions, the main body region in the center where the document goes, and four more regions around the edges for top, bottom, left, and right material. The top and bottom regions are used for headers and footers. If you look at the stylesheet, you'll see that each of these regions in each page master has a name. As soon as we've started a page sequence, we'll refer to these regions by name and fill in their content. The formatter will use this “static content” in the appropriate place on each page. The content is static in the sense that content from the document doesn't “flow” into it. It can change on a per-page basis, as we'll see.
Without going into a lot of detail here, if you look in the stylesheet, you'll see that I use tables to format the running headers and footers, placing the page numbers, for example, on the left side of left pages and the right side of right pages. Some masters, like the first page, have empty headers and/or empty footers.
At this point we can “apply templates” on the body and our FO document will come out.
Also of note
Two other parts of the stylesheet are perhaps notable: PDF bookmarks and the use of markers. Bookmarks will be a standard feature of XSL 1.1, but for the moment, I'm relying on a xep extension. Markers are more interesting.
Markers provide a mechanism for adjusting the running headers and footers as you progress through a document. Think of the way that headers and footers change as you flip through a dictionary: markers let you do that.
For WebArch, I decided to put the current first- or second-level section in the footer of each page. That way you can tell just where you are. It may prove to be more distracting than useful, but I figured there'd be no way to tell without trying it.
Markers are easy to use. Whenever you output content that should
appear in a header or footer, you output an
In the static content for the appropriate header or footer, you use
<fo:retreive-marker>. The formatter will replace the
<fo:retreive-marker> with the appropriate
Careful inspection of the stylesheet will reveal several places where I've taken care to avoid an obvious compositional faux pas like leaving a single list item on the top or bottom of a page or allowing a page break to occur immediately after a section title.
Readers with design skill could certainly improve the presentation.
Getting the bits
If you want to play with it, you can get both the document and the stylesheet from the W3C site. I've also got a local copy of the stylesheet.
Note that it's designed to format exactly the WebArch document, it is not a general-purpose HTML stylesheet. But you might be able to turn it into one by adding the appropriate templates. I've only provided templates for exactly the XHTML elements used in WebArch.
If you want to see the results on A4 paper, you can simply set the
paper.type parameter to “A4”.
The page master markup is a simplified copy of the markup from the
DocBook stylesheets. I've preserved many, but in the interest of simplicity,
not all of the parameters.
Share and enjoy.
For the pedantic, a book written in a language presented left-to-right and top-to-bottom. Books with other orientations or writing directions will likely show analogous variation, though that is by no means true for all languages in all writing directions.
No, there won't actually be any blank pages, but I defined the master anyway. It doesn't do any harm.