On the changing nature of DocBook standardization

Volume 14, Issue 29; 22 Jul 2011

The focus of the DocBook Technical Committee recently has been document assembly and transclusion. That's…different.

For very nearly 20 years, DocBook development has focused almost exclusively on how to model the content of technical documents, specifically books about computer hardware and software.

DocBook has always enjoyed broader use than that, of course, because the basic structures of a technical book are the same as those of most other kinds of books: chapters, appendixes, sections, paragraphs, figures, and tables, etc. We've even branched out recently to explicitly support a broader range of publishing content.

DocBook markup may be verbose, but it's always been easy to understand:

<chapter>
  <title>This is my chapter title</title>
  <para>This is the first paragraph of my chapter.</para>
  <para>Etcetera.</para>
</chapter>

Sure, there are dark corners in DocBook, as there are in most schemas. [I'm looking at you msgset and funcsynopsis, —ed] And you might argue that some of the element names are longer than really necessary. I wish we'd chosen “p” as the name for paragraphs, for example. But DocBook's genesis was document interchange not authoring, so the long names were considered a benefit: they would help interchange partners understand what was intended. You might thing that pl would have been a better name that programlisting, and if you write a lot of documents that contain program listings, you might be right, but if you're shipping your documents over to some business partner who's going to translate them into a different schema, the long element names are appreciated. Trust me.

Against this long history of document modeling, it struck me the other day that our current big items for DocBook V5.1 aren't the same kind of modeling at all. The big one is document assembly. This is intended to provide a mechanism for improved reuse, the ability to construct new documents “on the fly” from a collection of “resources”.

The canonical use case for assembly is the software vendor who ships the same product for several different platforms, or several slightly different products customized for particular classes of customer. In either case, the vendor wants to provide documentation specific to each product. So users of platform A get the platform A documentation, users of platform B get the platform B documentation, etc. But from the vendor's perspective, the documentation for A and B are 80% the same. Assembly allows the vendor to maintain, let's say, 20 chapters of content and build the documentation for platform A from chapters 1,3,5-17, and 20 and the documentation for platform B from chapters 1,2,4, and 7-19.

There are lots of non-technical issues surrounding the development of high quality documentation that can be reused in this way. I'm skeptical, but nevermind that for now.

From a standardization perspective, the problem with assemblies is that they're a lot harder to describe in ways that are independent of processing. For DocBook content you can say: this is a book. It's an artifact, whole unto itself, with no special semantics beyond our ordinary, everyday understanding of what “book” means. (That's not entirely true, there are elements like annotation and cmdsynopsis with fairly rich processing expectations, but I still think of those as being declarative.)

The same is not true of assemblies. An assembly has complex relationships and prescribed processing semantics that don't feel as cleanly declarative. I worked fairly hard, unsuccessfully, to separate the assembly interpretation process from the document interpretation process. They're going to be inter-related.

Our other significant work item is transclusion. That too will result in new DocBook elements (or attributes or something) that have specific, prescribed processing expectations. The meaning of a document will depend on having tools that are able to process DocBook transclusions correctly.

I don't think it's wrong for the DocBook Technical Committee to extend DocBook in these ways. We're being driven by the evolving requirements of documentation producers and consumers.

But it's different, and I think it's worth acknowledging specifically that what we're doing is different. I think there's a greater risk of failure here and we should acknowledge that and proceed with keen awareness that it may take a few iterations to get this exactly right.

Comments

The main weak points I found in docbook these last years were more in the usage done that in docbook itself. They were the opposite of document assembly.

The first came with XML database where docbook files are manipulated directly by authors instead of storing data in their original format. It results data loss, schema interpretation and docbook fork like livredoc (a limited docbook version with french naming for tag and attributes).

The second one is a side effect of the first: docbook file doesn't describe a book anymore but only an article or sometimes less. The main target is often websites with a CMS or ECM integration or as web services. In that case, transformations available with docbook are at the same time too refined and not enough. Two examples with the xhtml transformation: there is no option to have only the content of the body but for an article's metadata (authors, contributors, publication date,...) you have extra div or span tags to group some metadata together instead of having them on the same level.

The last one come back on the document assembly. Outside a book articles have complex lives and more complex relationships. That which was a book as became a specific articles view like a facets' selection in articles metadata. Ontology are used to describe these metadata and the document assembly complexity is function of the ontology complexity. Each time we create specific rules to decrease the complexity. That was outside docbook because ontologies were outside docbook. We think that was also easier to maintain and to test with the authorship on one side and the structural features on the other. Putting them together will increase the complexity of the whole and still not cover all the usecases.