Creating a DocBook V5.0 DTD

Volume 13, Issue 10; 18 Mar 2010; last modified 08 Oct 2010

Taking another stab at the long-standing problem of producing DTD (and XSD) versions of the DocBook V5.0 family of schemas.

In the course of preparing the DocBook V5.0 schemas, I devised a process that would convert the DocBook V5.0 RELAX NG grammar into an XML DTD (we get the XSD by running Trang over the DTD). Closer inspection quickly reveals two flaws in this process:

  1. It's incredibly brittle; while it successfully converts the base DocBook schema, it's utterly useless on even a simple customization layer.

  2. It produces utterly crap DTDs.

The former problem is the one that's really causing me pain, though I admit I'm a little embarrassed by the second.

The Publishers Subcommittee identified an XML DTD as a requirement (Come on tool vendors! Get your act together. It's the twenty-first fscking century, already!). My hopes that someone else would fix the problem before I got to it went unfulfilled, so over the last few days I've turned my attention seriously to the problem. A few observations:

  • The general problem may be insoluble. It may simply be that there's no algorithmic path from the simple and expressive constraints of RELAX NG to the much less expressive constraints of DTDs. Or maybe there are several paths, but the results will never look rational to a human observer. Or maybe the solution is out there, just waiting for some enterprising grad student to find it (nudge, nudge). I'm not sure. I decided that I didn't have time to look for that solution.

  • If the computer can't solve the whole problem, then we'll have to rely on human intervention to solve the hardest parts. In the DocBook family of schemas, the apparent hard parts are attribute co-constraints and multiple patterns for the same element name.

  • The solution in both cases is to replace the problematic patterns with a single pattern that is the union of the various options. This creates content models that are too broad: they accept all valid documents, but they also accept some invalid documents. The union of the CALS and HTML table models, for example, is a model in which a tbody element can contain a mixture of CALS row elements and HTML tr elements, among other atrocities. Such was it always with XML DTDs.

  • The constraints on mixed content in XML DTDs are a pain in the *ss. In a DTD, mixed content must be expressed as (#PCDATA | a | b | c | …)*. The #PCDATA token must come first, the alternatives must be at the top level (no nested parenthesis), and there must be no duplicates among the alternatives.

  • There are a few places where we allow extensions in “other namespaces.” The info elements can contain arbitrary additional metadata elements, for example, and equations can contain any MathML markup. DTDs and namespaces do not play well together. It might be possible to create a DTD that allowed MathML in the appropriate places, but that's more than the minimum needed to declare victory.

  • Patterns are a little bit like parameter entities. (Ok, a really, really little bit.) It would be nice, where possible, to represent the patterns as parameter entities in the resulting schema. At worst, it does no harm, at best it makes the DTD easier to read and may allow some small amount of customization of the DTD, not that I'd recommend that! And in any event, will solve at least a tiny part of the second problem mentioned above.

With these things in mind, I decided to adopt the following approach:

  1. Create a “DTD” customization layer in RELAX NG that removes the most difficult problems: create unions for the attribute co-constraints, create unions for the elements that are defined by several patterns, remove the elements in “other namespaces” extension points, etc.

  2. Create an “override” document for describing a few more operations. For example, removing the db._phrase pattern and changing all the patterns that use it so that they use db.phrase instead. (If there's an easy way to accomplish that in the RELAX NG customization layer, but it eluded me.)

  3. Massage the modified schema until it's possible to create a DTD from it.

I'd like to say that there was some deep, theoretical insight in the last step, but there wasn't. I just built a pipeline of transformations that got me from A to B. I looked at the document, found something that wouldn't work in a DTD, wrote a transformation to remove it, and added that transformation to the pipeline. Repeat until done. When the result was accepted by an XML parser and accepted a small, valid DocBook document, I called it done.

Here's a 10,000 foot summary of the process.

  1. Starting with the “DTD” customization layer, perform some simplifications. Discard documentation, schematron rules, the start pattern, divisions, etc. Turn interleaves into choices; this is a little risky, but seems to be ok in the DocBook family of schemas. Extract the content of pattern definitions, producing a set of elements and a set of “parameter entities”. Drop schema facet constraints and not allowed content on the floor. Etc.

  2. Apply the overrides, as described above.

  3. Remove “choice” wrappers from around attributes. Fiddle with how “optional” is expressed. In RELAX NG, it's a wrapper, for the some steps in this process, it's more convenient to make it an attribute.

  4. Remove parameter entities that are no longer referenced.

  5. Fiddle with “optional” again, moving the optionality down to the references. (An optional reference to something is the same as a reference to an optional something.)

  6. In subsequent steps, it's going to be convenient to be able to distinguish references to attributes from other references, so turn all ref elements that point exclusively to attributes into attref elements.

  7. Flatten chains of references to attributes. (If A points to B points to C points to D which is an attribute, then just make A point to D.)

  8. Fiddle with “optional” again. This time move the optionality up to the attribute declaration. This may require splitting a declaration.

  9. Check for element names defined by more than one pattern. There better not be any.

  10. Remove “empty” parameter entities and references to them.

  11. Pull “text” up. Replace any reference to a parameter entity that contains #PCDATA with a copy of what the parameter entity contains.

  12. Unwrap nested “zero-or-more” elements.

  13. Sort the parameter entities so that we never attempt to use one before it's been declared.

  14. Convert the resulting document into a DTD. Turn parameter entities into !ENTITY declarations, turn elements into !ATTLIST and !ELEMENT declarations, substitute DTD attribute types for the specified types, expand mixed content models, etc.

The most disappointing part of that last step is fully expanding the content model of every element that contains mixed content. I'd been working pretty hard all along to preserve as many pattern names as possible as parameter entities.

The problem is that even though all the relevant parameter entities are simple lists of elements (so they could appear in a mixed content element declaration), sometimes the same element name appears in more than one pattern. So I punted, expanded them all, and removed duplicates. I still think it might be possible to do better.

In retrospect, this isn't too surprising. If you go back to the DocBook V4.x DTDs and study the parameter entity structure [“masochist” -ed], you'll find a few places where we twisted the parameter entity structure pretty hard to avoid exactly this problem.

In any event, the DTD that results from this process is an XML DTD that appears to validate DocBook documents. With different customization and overrides, the DTD version of the publishers schema also seems to work.

I'll get it out in the next day or so for wider testing. It's very likely that there are places where it's not quite right. But it's definitely an improvement over the old process.

Pipeline Notes

The process described above is not wholly unlike what I did before. One significant factor that made this attempt more successful was XProc. It's not impossible to chain together 14 transformations with a big XSLT 2.0 stylesheet and a bunch of modes, but it's a whole lot harder to manage.

Speaking of XProc, I cheated. The pipeline I'm using today will only work in XML Calabash because it relies on a compound extension step: cx:until-unchanged. That step is a little bit like p:for-each except that after each iteration it compares the input document to the result of applying the pipeline and repeats the process (using the output of one iteration as the input of the next) until the result is the same as the input.

It's not impossible to do this without extending XProc, but it requires writing a different recursive pipeline for each looping step. It was more interesting (for me) to see how hard it would be to write a compound extension step. (So shoot me.)

By the way, if you're curious, converting the base DocBook schema to a DTD is a 40 step pipeline (more or less):

INFO: Running pipeline main
INFO: Running xslt rng2dtx
INFO: Running xslt override
INFO: Running cx:until-unchanged remove-choice
INFO: Running xslt attr-remove-choice
INFO: Running xslt attr-remove-choice
INFO: Running xslt attr-remove-choice
INFO: Running xslt attr-remove-choice
INFO: Running cx:until-unchanged remove-unused
INFO: Running xslt attr-remove-unused
INFO: Running xslt attr-remove-unused
INFO: Running xslt attr-remove-unused
INFO: Running xslt attr-optional-to-ref
INFO: Running cx:until-unchanged to-attref
INFO: Running xslt ref-to-attref
INFO: Running xslt ref-to-attref
INFO: Running cx:until-unchanged flatten
INFO: Running xslt flatten-attref
INFO: Running xslt flatten-attref
INFO: Running xslt flatten-attref
INFO: Running xslt attr-optional-to-decl
INFO: Running xslt multiple-gis
INFO: Running xslt remove-empty-pes
INFO: Running cx:until-unchanged pull-up
INFO: Running xslt pull-up-text
INFO: Running xslt pull-up-text
INFO: Running xslt pull-up-text
INFO: Running xslt pull-up-text
INFO: Running cx:until-unchanged unwrap
INFO: Running xslt unwrap-zeroormore
INFO: Running xslt unwrap-zeroormore
INFO: Running cx:until-unchanged sort
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt dtx2dtd

What about a better W3C XML Schema?

The DTD that results from this conversion process is a little bit unsatisfying. It just not what a human being would do if they started from scratch. On the other hand, it doesn't matter much; there's no widespread use for DTDs beyond validation and perhaps guided authoring. (And you ought to be using RELAX NG for that, see previous comment about the twenty-first century.)

The same is not true of W3C XML Schemas. There would (just possibly, maybe) be value in having a better XSD for DocBook. There are data binding tools and other applications that would fare much, much better with DocBook if they were given something that took proper advantage of XSD's native facilities.

I've never had much interest in writing an XSD for DocBook. I'm not likely to ever be persuaded that “type inheritance” is a satisfying abstraction for how content models are related. I never felt that XSD 1.0 was a good foundation for the kind of “human prose” schemas of which DocBook is a typical example. But I'm almost convinced that XSD 1.1 has fixed some of the most inconvenient deficiencies.

My bailing wire and duct tape solution for generating DTDs doesn't seem like it's ever going to be up to the task of doing the conversion properly. I'd be delighted if the aforementioned enterprising grad student built a tool to do the conversion automatically, but if not, I just might (someday) take a crack at hand authoring a proper XSD 1.1 schema for DocBook.

Comments

>no widespread DTDs beyond validation and perhaps guided authoring.

Which is plenty. A lot of DTD-based systems aren't broke, so they never got fixed. People behind those systems feel pressure to move to XSD, but heard that RNG is better... and then they remember that what they have isn't broke, and the few extra features they'd like would be so much extra trouble that they stay where they are.

The day after St. Patrick's Day, let me use an Irish expression to say that for doing this, you are a friggin' saint, because this work is a key bridge to ease the transition from DTDs to RNG-based systems for a lot of people. DocBook is almost perfect for a lot of shops, and once they add a few customizations to make it perfect they still want the DTD versions of the complete schemas to work with their existing workflows, so this work really will benefit a lot of people.

—Posted by Bob DuCharme on 19 Mar 2010 @ 01:37 UTC #

The process for generating the MathML DTD (and XSD) is similar from the high level view: XSLT to simplify the relaxNG as necessary then convert to final form. One difference was that I continued to use trang for the final step rather than getting the xslt to write out DTD or XSD directly, although I wasn't always sure that was a good idea.

It might be possible to create a DTD that allowed MathML in the appropriate places, but that's more than the minimum needed to declare victory.

Hmph you set your standards so low:-)

Hopefully though all that should be needed is a (empty) parameter entity as an extension point in the content model of the equation element and then extending that to include math from the mathml dtd should be an easy customisation for those that need it?

—Posted by David Carlisle on 19 Mar 2010 @ 11:29 UTC #

Hi Norm, I'm very interested in this subject.

We use XMetaL and have tricked it out over the years with all sorts of convenience features. Unfortunately, XMetaL only supports DTDs and XSD. They seem unlikely to add support for RelaxNG (though they did recently add basic support for xinclude).

I've been planning to move to DocBook 5.0 this summer as part of a general update of our tool chain, but hadn't realized that creating DTDs/XSDs from a DocBook 5.0 customization layer was problematic. My customization of DocBook will mainly be to turn certain features off (e.g. html tables) and adding some common attributes.

Some questions:

It wasn't clear to me (in part because I haven't yet worked with a RelaxNG schema) which parts were the manual, human parts v. the parts handled automatically by the pipeline.

Will you make the pipeline and xslts available? (The intention is for people to customize the RelaxNG and then generate a DTD, right? Not recustomize a generated-by-Norm DTD with every new DocBook 5.x release.)

Should I assume from your last sentence that a hand authored xsd 1.1 schema would be the ideal solution? (i.e. that generating an xsd from the RelaxNG would be just as messy as generating a DTD)

Thanks, David

—Posted by David Cramer on 19 Mar 2010 @ 12:39 UTC #

Shame on me. You're absolutely right, David. I'll see if I can tweak things to embed an empty PE at the point where MathML (and elsewhere, SVG) could be inserted.

I generate the DTD directly because I decided not to even try to make the intermediate (or even final) stages valid RNG. The first step actually translates out of RNG into my own ad hoc vocabulary.

—Posted by Norman Walsh on 19 Mar 2010 @ 01:21 UTC #

Nice work Norm.

I'm curious why the Publishers Subcommittee ranked DTDs high on the list. At my place of employment, a textbook publisher, the introduction of DocBook 5 resulted in a hue and cry for XSD schemas.

I ended up customizing the publishers RNG to work around the numerous UPA issues. Not ideal, but a decent compromise.

—Posted by bill on 19 Mar 2010 @ 01:31 UTC #

The manual human parts are the simplification of complex content models, like choices between two patterns for the same element and attribute co-constraints.

I checked the pipeline and all the stylesheets into the DocBook project at SourceForge last night. They won't actually run until I publish a new version of XML Calabash though. I'll try to get to that this weekend.

If XMetaL supports XSD 1.1, then a hand authored XSD is likely to be better. I don't know how much better because I don't actually have a good sense of how good or bad the mechanically produced version really is.

—Posted by Norman Walsh on 19 Mar 2010 @ 01:33 UTC #

Bill, I can certainly comment on the Publishers SC decision to provide a DTD:

Back in August, when the Publishers spec was published for Public Review, we received a comment that a DTD should be provided for the sake of better adoption. The only reason is because of tools support.

I feel equally as strongly as Norm about RelaxNG and the 21st century. In fact, I've been bugging the major tools vendors to support RelaxNG for at least 6 years! I bug them at every conference I see them at. The response is always the same: "We don't have any paying customers that are requesting this feature." I guess we need a grassroots effort or pitchforks and torches to get them to add support!

I also want to echo Bob's sentiments that Norm is a saint for taking on this frustrating and monumental effort!

Due to the difficulty to create the DTD and the amount of elapsed time, the SC decided that providing an XSD from the RNG would be an acceptable substitute, so we do have options.

—Posted by Scott Hudson on 19 Mar 2010 @ 05:09 UTC #

Why is it ever unsafe to turn interleaves into choices? I suppose that by a choice you mean (a|b|c|...)* rather than just (a|b|c|...), which is clearly not equivalent.

—Posted by John Cowan on 19 Mar 2010 @ 06:46 UTC #

Hi Norm,

You may try to retain the documentation annotations for elements and add them back in the generated DTD as documentation comments (as in the DocBook 4 DTD).

Thanks, George

—Posted by George Bina on 19 Mar 2010 @ 09:11 UTC #

Norm, is your Definitive DocBook 5 book covering the RelaxNG only? if so, will you provide a bit of content about the DTD and how to get it on your O'Reilly page?

—Posted by Dorothy Hoskins on 09 Jun 2010 @ 09:17 UTC #