Stylesheet organization

Volume 11, Issue 001; 01 Jan 2008; last modified 08 Oct 2010

The XSLT 2.0 stylesheets for DocBook are broken. They have been for a while, but I think maybe I've figured out how to fix them.

There's a pervasive complexity in the XSLT 1.0 stylesheets. It's not a bad or unnecessary complexity; it's caused, at least in part, by the fact that there's a lot of flexibility in DocBook.

A small example: when a chapter is formatted, the chapter will have a title and might have a subtitle (among other things); these elements each individually might or might not be inside an info element. For some elements, like bibliography, the situation is even more complicated because the title is optional (if not present, a locale-specific default title must be used).

In designing the XSLT 2.0 stylesheets for DocBook, I wanted to factor out some of this complexity. The result is a multi-phase transformation:

  1. The first phase adds an xml:base attribute to the root element so that URI resolution in subsequent phases will know the correct base URI.

    It resolves entityref attributes into fileref attributes because subsequent phases won't have access to the entity declarations in the original document.

    If the root element does not declare the DocBook namespace, it moves all elements in no namespace into the DocBook namespace. This allows the XSLT 2.0 stylesheets to format DocBook V4.x documents; in fact, not only are the elements moved into the DocBook namespace, but they're transformed a bit too, so that they (are more likely to) conform to DocBook V5.x. This is a convenience, the XSLT 2.0 stylesheets aren't guaranteed to format DocBook V4.x correctly.

  2. The second phase applies profiling.

  3. The third phase normalizes markup. This is the phase that reduces the complexity described above: info elements are added uniformly (so all elements that can be inside or outside of an info are always inside), defaulted titles are made explicit, and a number of other changes are made to make the markup more regular.

The problem with this approach is that it interferes with users' expectations. Not stylesheet users per se, but stylesheet customizers. If a customizer wants to change the root template (to tinker with the HTML page metadata, for example), he or she is naturally going to add a new root template to their customization layer:

<xsl:template match="/">

Trouble is, in this multi-phase approach, that just completely breaks everything. The root template expects to run the phases, and if you steal that, it just goes pear shaped.

What you have to do instead is add this template to your customization layer:

<xsl:template match="*" mode="m:root">

Now, I suppose, in the grand scheme of things that's not so bad, but it's ugly.

My first approach to fixing this problem was to break the DocBook stylesheet into explicit, discrete phases. Unfortunately, this requires the user to actually setup a pipeline and run a series of separate transformations. In a post-XProc world, this might actually be ok, but today, it's setting the bar awfully high for your average user.

So I have a new approach. Instead of using the root template to run the phases, I use a named template. One of the features of XSLT 2.0 is that you can start a transformation with a named template.

Using this approach, you must tell the processor to start with the right template (using -it:format-docbook in the Saxon command-line case), but otherwise, you're free to customize the “/” template to your hearts content.

But you're hosed if you forget to start at the right named template. To detect this error, the format-docbook template injects a harmless processing instruction before the document element so that subsequent processing can determine if the user forgot to use the named template.

In the short and medium term, I think this is the right approach. In the long term, XProc will rule the world and we can simplify things further with a series of explicit transformations.

I've got this implemented in the “xsl2-namedt” branch of the repository.

If no one tells me why this approach is either stupid or insane, or both, I'll probably move it to the trunk in a couple of weeks.


Hi Norman,

I must admit I don't fully understand your point here. I don't understand at all why you say "that's not so bad, but it's ugly." Actually, I even think this could be better...

First the PI thing to check that the main processing has been setuped as expected sounds rather convoluted to me. But that's about aesthetics and it is not so important as that should work at the first glance.

More important is that you try to hide the processing architecture when it would be instead interesting for the customization layer. If the Docbook stylesheets need to use several passes, maybe the user could want to benefit from thoses passes. It could want to translate an extension to plain Docbook elements in the "normalization" pass, and to deal another extension in the normalized document.

So it would maybe be more interesting to document accurately the several passes and say to the biginner "just put your template rules into the mode db:xxx."

But maybe I just didn't understand the problem?



—Posted by Florent Georges on 04 Mar 2008 @ 10:23 UTC #

Yes, perhaps the problem is less significant that it felt at first. I'll poke at it a bit before I commit. :-)

—Posted by Norman Walsh on 05 Mar 2008 @ 12:48 UTC #