Versioning DocBook

Volume 9, Issue 93; 06 Oct 2006; last modified 08 Oct 2010

Thoughts on the evolution of a technical documentation standard.

The TAG spent a fair chunk of our just concluded face-to-face meeting discussing various aspects of the large and complex problem of versioning. I think we're making progress, but I won't try to summarize that discussion, you can find that in the minutes when they're published.

However, the question was asked in passing, “to what extent would we each tell a different story about versioning?” With that in mind, I thought I'd write down some thoughts about DocBook versioning. I've probably said some of this before, but I'll try to stick any future updates in this essay.

DocBook has experienced grammar changes, both visible and invisible, it has been subsetted and extended, transitioned from SGML to XML, from not having a namespace to having a namespace, and from DTDs as the normative schema language to RELAX NG.

On balance, I think that we've made good choices. At least, in a decade of use, I've never felt that we would have been better off if we'd made different choices.

What's a version?

DocBook uses version numbers of the major/minor variety: 4.5, for example. We take the numbers fairly seriously and describe our contract with users in terms of those numbers. Every now and then we've had version numbers in three parts, 4.1.2, for example, but those only arise to correct bugs in the distribution: fixing a catalog file or changing comments. They aren't part of the usual story and don't represent changes to the schema proper at all.

Version numbers used to be exposed only in the public identifier. Now they're present on the root element of every DocBook document.

Experience with XSLT suggests that this is a practical strategy. It feels like the right choice for DocBook, but I concede that this is a relatively recent change and one with which we have relatively little direct experience.

What's a change?

From a schema design perspective, DocBook is about document validity. We consider one version of DocBook different from another to the extent that it accepts as valid a different set of documents. In particular, we have never promised to preserve the “invisible” markup constructs: in DTDs, the parameter entity structure; in RELAX NG, the set of patterns. We don't make invisible changes capriciously, but we treat them as editorial.

We classify changes as backwards compatible (or incompatible) in the traditional manner. If every document that is considered valid by the current version of the schema will also be considered valid by the next version, the next version is backwards compatible with the current version. If there may exist a document that is considered valid by the current version of the schema that would be considered invalid by the next version, the next version is backwards incompatible with the current version.

We don't have any notion of fowards compatibility. DocBook doesn't employ wildcards or open content models of any sort, its semantics do not include any form of “must ignore”, and there aren't any areas where validity is “lax” or “skip”Wildcards in the definition of SVG and MathML are a special case. Conceptually, those are shorthand for the appropriate schema; they aren't intended to be open content models.. This means that if you send “new markup” to an “old consumer”, the old consumer is expected simply to reject the document.

This is reasonable because DocBook is about prose authored by human beings. There are areas where I could imagine a plausible forward compatibility story; for example, I could imagine allowing arbitrary markup in info elements with a must ignore rule. But I'm not inclined to introduce this sort of facility. Human beings are unreliable creatures. If you're going to attempt to process a document in which an author has included markup that you haven't taken the trouble to allow explicitly, I think it's reasonable to require you to take the trouble to review it, determine what processing expectations the author intended, and adjust your systems so that the markup is explicitly allowed. Or remove the unexpected markup.

That's a little draconian, but interchange of prose documentation is difficult in practice. If you don't believe me, consider the interchange checklist in the reference documentation. And bear in mind that the checklist hasn't really been updated since the SGML days. I'm sure it should be longer.

What about customization layers?

We expect DocBook to be customized. (I assert that it works just fine out of the box and that, in fact, most users don't actually need to change anything, but that's not what this essay is about.) User customization introduces new issues.

Fortunately, our somewhat draconian attitude towards compatibility insulates us from many of the harder problems. Irrespective of what customization you've performed, before I put your document through my system, I'm going to validate it with the version (and possibily customization) of DocBook that I understand. If you've made a subset, or at least haven't used a superset, of what I understand, then everything will naturally work. If you've used an extension, I expect validity to fail and I can use the points of failure to determine what changes I'll need to make to process your documents.

We encourage customizers to document their intentions in the version identifier. This isn't expected to enable automated processing as much as give users a clue about what to expect in documents that use the customization.

What about semantics?

Yes, indeed, what about them? DocBook takes a fairly relaxed view of the semantics of its elements and attributes. It has to. Given the enormous variety of technical documentation and documentation styles, any attempt to rigerously bolt down the semantics of each element would inevitably lead to a huge proliferation of distinct elements with subtly variant semantics. That way lies madness for authors.

In practice, I think we've very rarely made incompatible changes to the documented semantics of existing elements (I can't actually think of any examples, but I won't assert that there aren't any).

If we decided to make a radical change to the semantics of an element, I think we'd probably introduce a new element instead. If we didn't or couldn't, I think we'd treat it like a backwards incompatible markup change, but I suppose it would be a judgement call since our working definition of compatibility is closely tied to validation.

Compatibility contract

So what do we actually say about compatibility and versioning? We have a straightforward set of rules designed to impose stability on DocBook evolution and provide security for users.

  1. Point releases are always be backwards compatible with the previous major release. This means you can upgrade from version 4.0 to version 4.1, for example, confident that no existing 4.0 document will now be considered invalid.

  2. Major releases may introduces backwards incompatible changes. This means that the upgrade from 3.1 to 4.0, for examples, requires more careful consideration. However, we ameliorate that somewhat with an additional rule:

  3. All of the backwards incompatible changes in a major version must have been announced in the previous major version. This means that when you upgrade to version 3.0, you'll already know all of the backwards incompatible changes for which you'll have to be prepared in version 4.0As I've said several times, we've given ourselves special dispensation to break this rule for the fairly radical transition to RELAX NG in V5.0..

    We used to have the additional rule that major versions would be at least one calendar year apart, but there's no evidence to suggest that we're ever likely to run afoul of that one.

SGML to XML

The transition from SGML to XML didn't remove or change the SGML DTD, so we can say that we held up our end of the contract. But the DocBook SGML DTD uses inclusions and exclusions which don't have any analog in XML.

Exclusions prevent markup from occurring where it would ordinarily be allowed. Paragraphs can contain warnings and warnings contain paragraphs, so warnings are recursive according to the content models, but it doesn't make sense to allow warnings to contain warnings, so the SGML exclusion feature is used to prevent them from appearing inside themselves.

Not having exlusions in XML is a backwards compatible change. It reduces the expressive power of the schema and allows more documents to be valid. This is unfortunate, but it's nothing new. There are already DocBook constraints that can't be expressed in the DTD.

Inclusions allow markup to occur where it would ordinarily not be allowed. For example, saying that book includes indexterm means that an index term can occur anywhere inside book or any of its descendants regardless of whether it is explicitly allowed in a content model or not.

They're a convenience for schema designers, but they make implementing processors more difficult. Back in the SGML days, I got bitten more than once by an indexterm in an unexpected place. One of the problems with inclusions is that they allow markup in illogical places. What does it mean for an index term to occur between two rows in a table? What if there's a page break between those rows, which page is the index term on?

Not having inclusions in XML is a backwards incompatible change. It treats as invalid some documents that are valid in the presence of inclusions. We added index terms to more content models and this hasn't turned out to be a big deal in practice, although we did recently get a bug report about the fact that there are a handful of DocBook V4.x elements that have very small content models that don't include index terms. There's no good reason for not allowing them, they just didn't get added when we moved away from inclusions.

Adding a namespace

The compatibility impact of adding or changing a namespace is pretty dramatic. At a high level, it's not really a version of the original language, it's a new language.

Nevertheless, we decided to do it. It's the twenty-first century and all that.

Moving to RELAX NG

This transition is tied up to some extent with adding a namespace. There's nothing about moving to RELAX NG that required any backwards incompatible changes. Anything you can express in DTDs, you can express in RELAX NG.

But like any large engineering task, growth by accretion eventually becomes a burden. Though it provides for stability and graceful change over time, each of those changes may involve compromises. Over time, the effects of those compromises become significant and it's valuable to engage in a more dramatic overhaul.

The transition from DocBook as defined by DTDs in no namespace to DocBook as defined by RELAX NG in a namespace is our opportunity to do this overhaul. I suppose the world might decide not to follow, but that seems unlikely.