Ruminations on DocBook V.next

Volume 6, Issue 19; 21 May 2003

There comes a point in the life cycle of any system when adding one more patch is the wrong solution to every problem. Eventually, it's time to rethink, refactor, and rewrite. For DocBook, I think that time has come.

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

—Martin Fowler

The DocBook TC has been kicking the idea of DocBook V5.0 around for a long time. I think I've figured out why.

Considering the Past

These are my recollections of how DocBook developed. I do not claim that these are all facts, only that they are the most factual memories that I have.

It Was a Long Time Ago...

DocBook is more than ten years old; its design stretches back to the early 90's. Back then, men were real men, women were real women, and SGML applications were really rare and expensive. (What about XML, you ask? I'm not talking just talking pre-XML here, I'm talking pre-HTML.)

Hampered by the dearth and cost of commerical SGML applications, I eventually built my first publishing system with bailing wire and duct tape instead (SP output and beta versions of Perl 5). I recall struggling to get SP through gcc so that I could get at the ESIS output of the parser.

The Tools Were Weak

The limitations of tools, and the limitations of SGML DTDs, were a constant influence on our design.

DocBook was for Exchange

The original vision for DocBook was that it would be principally an exchange DTD. Different vendors (of things like Unix and X Windows) would all use DocBook to share content and build common documentation libraries.

DocBook is a Victim of its Own Success

Over the years, DocBook has experienced “growth by accretion.” Decisions that were made early on (like allowing some elements to have titles both inside and outside of the info wrappers), seemed fine at the time when there were probably only a handful of elements that had titles. But now those choices seem like inconsistent warts.

We Stumbled Once Before

We're also suffering from the consequences of an earlier refactoring attempt. The first refactoring of docbook occurred between the 2.4.1 and 3.1 releases. Eve Maler rationalized the parameter entity structure and applied the methodology she developed with Jeanne El Andaloussi for developing SGML DTDsDeveloping SGML DTDs: From Text to Model to Markup published by Prentice-Hall PTR (1996, ISBN: 0-13-309881-8). Out of print, but still a valuable resource if you can get your hands on one.

This refactoring was necessary and valuable, but it was never entirely complete. it left us with some pretty awkward content models:

<!element glossterm - o
  (#PCDATA FootnoteRef|XRef|Abbrev
  |Acronym|Citation|CiteRefEntry
  |CiteTitle|Emphasis|FirstTerm
  |ForeignPhrase|GlossTerm|Footnote
  |Phrase|Quote|Trademark|WordAsWord
  |Link|OLink|ULink|Action|Application
  |ClassName|Command|ComputerOutput
  |Database|Email|EnVar|ErrorCode
  |ErrorName|ErrorType|Filename
  |Function|GUIButton|GUIIcon|GUILabel
  |GUIMenu|GUIMenuItem|GUISubmenu
  |Hardware|Interface|InterfaceDefinition
  |KeyCap|KeyCode|KeyCombo|KeySym
  |Literal|Constant|Markup|MediaLabel
  |MenuChoice|MouseButton|MsgText|Option
  |Optional|Parameter|Prompt|Property
  |Replaceable|ReturnValue|SGMLTag
  |StructField|StructName|Symbol
  |SystemItem|Token|Type|UserInput
  |VarName|Anchor|Author|AuthorInitials
  |CorpAuthor|ModeSpec|OtherCredit
  |ProductName|ProductNumber|RevHistory
  |Comment|Subscript|Superscript
  |InlineGraphic|InlineMediaObject
  |InlineEquation|Synopsis
  |CmdSynopsis|FuncSynopsis|IndexTerm)+>

(A command synopsis inside a glossary term? Unlikely.)

In the intervening years, we've talked many times about “reworking the parameter entities”, but we've postponed it indefinitely as we've fixed bugs and added features.

Considering the Present

Today, HTML exists. A lot more developers have gotten used to the idea of writing structured documentation. (Say what you want about the structure of most HTML, it did expose people to the idea of putting elements and attributes in their documents and separating structure from presentation, at least a little bit.)

Today, XML exists. XML has supplanted SGML in every significant way. XML parsers are nearly ubiquitous. The state of the art in tools for manipulating XML includes powerful technologies tools like SAX, StAX, various flavors of DOM, and things like JAXB. On top of that platform, we have XSLT, XSL-FO, and support for transformation and rendering of XML in the browser.

Today, a lot of people author in DocBook. They do this for many reasons, and one of them is exchange, but they aren't principally writing in some private tag set, or deep customization of DocBook, and then converting to the standard to pass documents to other interchange partners. They're writing directly in standard DocBook.

A Modern Approach

If we were starting over, I think we'd approach the problem much differently:

We'd use XML.
We'd use RELAX-NG.
We'd design for the web.
We'd design for regularity and consistency at the current scale. (Designing a schema of roughly 400 elements is different than designing a schema of roughly 100.)
We'd almost certinaly put it in a namespace.
Perhaps controversially, we might allow foreign namespace elements to creep in. We might, for example allow Dublin Core in metadata.

Design Principles

A good place to start would be some design principles. If 100 people are going to ask you to make a 100 different changes, it's nice to have some rules for sorting out which ones make sense and which ones don't.

Whatever we do, it should still look and feel like DocBook. In all fairness, when I said “starting over”, I wasn't really thinking of going back to first principles and reinventing all the elements and content models. I think one of the goals should be that most valid DocBook documents can be transformed into new valid V.next documents with XSLT.
There are only a few kinds of elements: set and book; divisions (part and reference); components (preface, chapter, etc.); formal blocks (figure, example, etc.); and blocks (para, blockquote, etc.); and inlines.
There are only three kinds of inlines: “just text”, general inlines, and domain-specific inlines.
All the metadata goes in an info wrapper. RELAX NG lets us have different content models for info depending on the context. (So it can have a required title for some elements, an optional title for others, and a forbidden title for yet others).

Some Open Questions

I expect this section to get longer as I fiddle with instantiating an experimental V.next. These questions are in no particular order.

1.	Is the distinction between formal/informal useful anymore? I think it's a holdover from the days when building a “list of titles” based on whether or not the elements actually had titles was considred too hard. That's hardly the case these days.
2.	Are varying content models, such as described above for `info`, harder for users to understand? My intuition is no, I don't think most users envision things in terms of content models (“Oh, this is an `info` wrapper so it must (or must not) have a title.”), I think they envision things in terms of more semantic structures (“figures must have titles, titles go in the `info` wrapper.).”
3.	Ubiquitous linking is a no brainer, at least on inlines. Does it make sense on blocks too? If we're going to allow `<phrase href="...">`, is there any reason not to allow `<chapter href="...">`? And if you say “yes”, what is the design principle that you use to distinguish between the two cases?
4.	Are inlines cheap? This is more of a long-term maintainance question, but we have a large pool of inlines in DocBook and enough elements to make it hard for new users to see what goes where. So, on one hand, adding new inlines gives better semantic markup for the users that need those inlines. On the other hand, it's yet more tags for new users to learn.
5.	Brace yourself. We're just about to slam squarely into the character entity problem. No DTD means no named character entities. I think this just adds fuel to the fire that says the right answer here is to publish some normative entity sets separately from the DTD. Then you can include the sets you need directly: `<!DOCTYPE article [ <!ENTITY iso-lat1.ent PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN//XML" "http://example.com/path/to/iso-lat1.ent"> ]> <article xmlns="..."> ... </essay>`

Frankly, I Like the Timing

The imminent release of XSLT 2.0 is an ideal opportunity to refactor the DocBook XSL Stylesheets. Supporting a refactored DocBook schema at the same time makes good engineering sense.

Considering the Future

The XML world will continue to evolve. We should bear that in mind. Designing so we can add new features incrementally will keep DocBook stable and useful for another 10 years. Until the next refactoring.

Herewith, some things to bear in mind.

Using Schematron assertions with existing RELAX NG grammars gives us the ability to validate conditions that aren't easily modeled with grammar-based languages. For example, typed links (glossterms should only point to glossentrys, etc.)
A future version of RELAX NG might give us back our exclusions.

Are You On Crack?

It's certainly fair to ask: should we do this at all?

There's a lot of legacy out there. Of course, nothing that's suggested here will ever break that legacy. It'll still be valid DocBook and there will still be tools that process it. The only concern I really have on this front is how painful it will be for users of legacy systems to move forward.

Maybe it would it be better to just declare DocBook finished and move on? I've pretty well convinced myself that piling on yet more fixes is not practical.