Some ideas about what a refactored DocBook might look like, and a prototype.

He would like to start from scratch. Where is scratch?

Elias Canetti

It doesn't seem quite fair to suggest scrapping DocBook without at least considering what should replace it. And rather than waiting until I'm finished, I think it probably makes sense to publish what I've cooked up. It's maybe three-quarters finished, maybe a little more. In any event, it's just one guy's idea.

In general terms, my changes fall into four categories: rationalize the content model of inlines, normalize the metadata, discard cruft, and make changes that appear (to me) to simplify things.

Rationalizing Inlines 

I've divided inlines into three classes: ubiquitous inlines (ones that should be available everywhere), general inlines, and domain-specific inlines.

In trying to find a design principle to discriminate between what should go in the content model of a particular inline and what should not, I eventually settled on a simple one: any given inline contains just text or it contains every inline. In my prototype, a lot of inlines contain just text.

Just Text 

Given that there are some ubiquitous elements, what does “just text” mean? It means the following:

  1ubiq.inlines = db.inlinemediaobject
  2             | db.anchor
  3             | db.indexterm
  4             | db.remark
  5docbook.text = text | ubiq.inlines
  6             | text.phrase | text.replaceable

Anywhere that character data is allowed, so is <inlinemediaobject> (because it's the traditional DocBook way of allowing special characters; less necessary in XML but still valuable enough in legacy terms to justify inclusion), <anchor>, <indexterm>, <remark>, and special forms of <phrase> and <replaceable>.

What's special about <phrase> and <replaceable> in this context is that they contain “just text”. In contexts where all inlines are allowed, they're allowed inside <phrase> and <replaceable> too.

Normalizing Metadata 

DocBook has a dozen or more flavors of metadata wrapper (<bookinfo>, <chapterinfo>, <sidebareinfo>, etc.). It has all these flavors because DTDs only allow one content model per element name and we wanted to provide some way for customizers to require or restrict metadata in different contexts.

RELAX NG removes the restriction that there can only be one content model per element name and allows us to replace all of these multifarious elements with a single wrapper: <info>.

Out-of-the-box, <info> comes in three flavors: with a required title, with an optional title, and with titles forbidden. The grammar is arranged so that customizers who need or want to add more flavors can easily do so, without adding more element names.

I've also taken the liberty of enforcing two additional rules: <title>, <titleabbrev>, and <subtitle> must appear first (and in that order) if they're allowed or required, and they may appear only once.

And titles are only allowed inside <info>. You can't have them outside anymore.

Discarding Cruft 

Some stuff just has to go. I have no doubt that for every element in DocBook, there's a user somewhere. But I believe experience suggests that some of them are not worth the complexity they carry.

My list of candidates for deletion:

  • <msgset>. And perhaps more controversially <simplemsgset>.

  • <graphic>, <inlinegraphic>, <graphicco>.

  • <sgmltag>. Replaced by <xmltag>.

  • <authorblurb>. Replaced by <personblurb>.

  • <toc> and <lot>. Replaced by much simpler <toc> markup.

  • <caption>. Maybe we should allow captions on figures, but allowing them on <mediaobject> is clunky.

  • <modespec>, <invpartnumber>, <pubsnumber>, <isbn>, and <issn> (use <biblioid>), <structname>, <structfield>, <medialabel>, <interface>, <action>, <property>, <otheraddr>, <contractnum>, <contractsponsor>, <corpauthor> (<author> now allows either a <personname> or an <orgname>), <corpname> (replaced by <orgname>), <beginpage> (good riddance!), <ackno>, <alt>, and <collabname>.

  • Also <segmentedlist>.

  • <link>, <olink>, <ulink>. Replaced by ubiquitous linking. Every element can have either a linkend attribute or an href attribute.

  • Enumerated section elements (<sect1>, <sect2>, <refsect1>, etc.). Again, these exist because there was no other way to limit recursive depth in DTDs. In RELAX NG, you can do it without forcing the author to think about the element names.

Too aggressive, perhaps. Or not aggressive enough. Certainly not a finished, final list.


Finally, I've made some organizational changes. Some of these are documented as future use changes in V4.0, some are not.

In no particular order:

  • The components of a personal name (<firstname>, <surname>, etc.) are no longer allowed free-standing. You have to wrap them in a <personname>.

  • I've explicitly allowed both CALS and HTML table models. RELAX NG lets us segregate them so there's no overlap: it's exactly one or exactly the other. Perhaps HTML tables should (also or only?) be allowed in the XHTML namespace?

  • I removed the format attribute from verbatim environments.

  • I dropped the class attribute from <productname>.

  • I made <title> mandatory on <equation>.

  • I removed the srccredit attribute from <imagedata> and friends. Those elements now allow <info> and the credit can more properly go there.

  • I removed <contrib>, use <othercredit> instead.

A Prototype 

All this work resulted in a prototype and a stylesheet that converts (some) DocBook V4.2 documents to conform to the prototype.

One important change that I haven't made (yet) is putting DocBook in a namespace. But we should.

There are no comments on this essay.
Comments on this essay are closed. Thank you, spammers.