More Ruminations on DocBook

Volume 6, Issue 24; 29 May 2003

Some ideas about what a refactored DocBook might look like, and a prototype.

He would like to start from scratch. Where is scratch?

Elias Canetti

It doesn't seem quite fair to suggest scrapping DocBook without at least considering what should replace it. And rather than waiting until I'm finished, I think it probably makes sense to publish what I've cooked up. It's maybe three-quarters finished, maybe a little more. In any event, it's just one guy's idea.

In general terms, my changes fall into four categories: rationalize the content model of inlines, normalize the metadata, discard cruft, and make changes that appear (to me) to simplify things.

Rationalizing Inlines

I've divided inlines into three classes: ubiquitous inlines (ones that should be available everywhere), general inlines, and domain-specific inlines.

In trying to find a design principle to discriminate between what should go in the content model of a particular inline and what should not, I eventually settled on a simple one: any given inline contains just text or it contains every inline. In my prototype, a lot of inlines contain just text.

Just Text

Given that there are some ubiquitous elements, what does “just text” mean? It means the following:

ubiq.inlines = db.inlinemediaobject
             | db.anchor
             | db.indexterm
             | db.remark
docbook.text = text | ubiq.inlines
             | text.phrase | text.replaceable

Anywhere that character data is allowed, so is inlinemediaobject (because it's the traditional DocBook way of allowing special characters; less necessary in XML but still valuable enough in legacy terms to justify inclusion), anchor, indexterm, remark, and special forms of phrase and replaceable.

What's special about phrase and replaceable in this context is that they contain “just text”. In contexts where all inlines are allowed, they're allowed inside phrase and replaceable too.

Normalizing Metadata

DocBook has a dozen or more flavors of metadata wrapper (bookinfo, chapterinfo, sidebareinfo, etc.). It has all these flavors because DTDs only allow one content model per element name and we wanted to provide some way for customizers to require or restrict metadata in different contexts.

RELAX NG removes the restriction that there can only be one content model per element name and allows us to replace all of these multifarious elements with a single wrapper: info.

Out-of-the-box, info comes in three flavors: with a required title, with an optional title, and with titles forbidden. The grammar is arranged so that customizers who need or want to add more flavors can easily do so, without adding more element names.

I've also taken the liberty of enforcing two additional rules: title, titleabbrev, and subtitle must appear first (and in that order) if they're allowed or required, and they may appear only once.

And titles are only allowed inside info. You can't have them outside anymore.

Discarding Cruft

Some stuff just has to go. I have no doubt that for every element in DocBook, there's a user somewhere. But I believe experience suggests that some of them are not worth the complexity they carry.

My list of candidates for deletion:

  • msgset. And perhaps more controversially simplemsgset.

  • graphic, inlinegraphic, graphicco.

  • sgmltag. Replaced by xmltag.

  • authorblurb. Replaced by personblurb.

  • toc and lot. Replaced by much simpler toc markup.

  • caption. Maybe we should allow captions on figures, but allowing them on mediaobject is clunky.

  • modespec, invpartnumber, pubsnumber, isbn, and issn (use biblioid), structname, structfield, medialabel, interface, action, property, otheraddr, contractnum, contractsponsor, corpauthor (author now allows either a personname or an orgname), corpname (replaced by orgname), beginpage (good riddance!), ackno, alt, and collabname.

  • Also segmentedlist.

  • link, olink, ulink. Replaced by ubiquitous linking. Every element can have either a linkend attribute or an href attribute.

  • Enumerated section elements (sect1, sect2, refsect1, etc.). Again, these exist because there was no other way to limit recursive depth in DTDs. In RELAX NG, you can do it without forcing the author to think about the element names.

Too aggressive, perhaps. Or not aggressive enough. Certainly not a finished, final list.


Finally, I've made some organizational changes. Some of these are documented as future use changes in V4.0, some are not.

In no particular order:

  • The components of a personal name (firstname, surname, etc.) are no longer allowed free-standing. You have to wrap them in a personname.

  • I've explicitly allowed both CALS and HTML table models. RELAX NG lets us segregate them so there's no overlap: it's exactly one or exactly the other. Perhaps HTML tables should (also or only?) be allowed in the XHTML namespace?

  • I removed the format attribute from verbatim environments.

  • I dropped the class attribute from productname.

  • I made title mandatory on equation.

  • I removed the srccredit attribute from imagedata and friends. Those elements now allow info and the credit can more properly go there.

  • I removed contrib, use othercredit instead.

A Prototype

All this work resulted in a prototype and a stylesheet that converts (some) DocBook V4.2 documents to conform to the prototype.

One important change that I haven't made (yet) is putting DocBook in a namespace. But we should.