<?xml version='1.0' encoding='utf-8'?>
<?xml-stylesheet href="/style/browser.xsl" type="text/xsl"?>
<essay xmlns="http://docbook.org/ns/docbook"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
       xmlns:dc='http://purl.org/dc/elements/1.1/'
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:gal='http://norman.walsh.name/rdf/gallery#'
       xmlns:foaf="http://xmlns.com/foaf/0.1/"
       xml:lang="en"
       version='5.0'>
<info>
<title>Creating a DocBook V5.0 DTD</title>
<volumenum>13</volumenum>
<issuenum>10</issuenum>
<pubdate>2010-03-18T15:38:13-04:00</pubdate>
<author><personname>
<firstname>Norman</firstname><surname>Walsh</surname>
</personname></author>
<copyright><year>2010</year><holder>Norman Walsh</holder></copyright>
<abstract>
<para>Taking another stab at the long-standing problem of producing
DTD (and XSD) versions of the DocBook V5.0 family of schemas.</para>
</abstract>
</info>

<para xml:id='p1'>In the course of preparing the DocBook V5.0 schemas,
I devised a process that would convert the DocBook V5.0 RELAX NG
grammar into an XML DTD (we get the XSD by running
<link xlink:href="http://code.google.com/p/jing-trang/">Trang</link> over the
DTD). Closer inspection quickly reveals two flaws in this
process:</para>

<orderedlist>
<listitem>
<para xml:id='p2'>It's incredibly brittle; while it successfully converts the base
DocBook schema, it's utterly useless on even a simple customization
layer.</para>
</listitem>
<listitem>
<para xml:id='p3'>It produces utterly crap DTDs.</para>
</listitem>
</orderedlist>

<para xml:id='p4'>The former problem is the one that's really causing me pain, though
I admit I'm a little embarrassed by the second.</para>

<para xml:id='p5'>The
<link xlink:href="http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=docbook-publishers">Publishers Subcommittee</link>
identified an XML DTD as a
requirement (Come <emphasis>on</emphasis> tool vendors! Get your act
together. It's the twenty-first fscking century, already!). My hopes
that someone else would fix the problem before I got to it went
unfulfilled, so over the last few days I've turned my attention
seriously to the problem. A few observations:</para>

<itemizedlist>
<listitem>
<para xml:id='p6'>The general problem may be insoluble. It may simply be that
there's no algorithmic path from the simple and expressive constraints
of RELAX NG to the much less expressive constraints of DTDs. Or maybe
there are several paths, but the results will never look rational to a
human observer. Or maybe the solution is out there, just waiting for
some enterprising grad student to find it (nudge, nudge). I'm not
sure. I decided that I didn't have time to look for
<emphasis>that</emphasis> solution.</para>
</listitem>
<listitem>
<para xml:id='p7'>If the computer can't solve the whole problem, then we'll have to rely
on human intervention to solve the hardest parts. In the DocBook family of
schemas, the apparent hard parts are attribute co-constraints and
multiple patterns for the same element name.</para>
</listitem>
<listitem>
<para xml:id='p8'>The solution in both cases is to replace the problematic
patterns with a single pattern that is the union of the various
options. This creates content models that are too broad: they accept
all valid documents, but they also accept some invalid documents. The union
of the CALS and HTML table models, for example, is a model in which a
<tag>tbody</tag> element can contain a mixture of CALS <tag>row</tag> elements
and HTML <tag>tr</tag> elements, among other atrocities. Such was it always
with XML DTDs.</para>
</listitem>
<listitem>
<para xml:id='p9'>The constraints on mixed content in XML DTDs are a pain in the *ss.
In a DTD, mixed content <emphasis>must</emphasis> be expressed as
<code>(#PCDATA | a | b | c | …)*</code>. The <code>#PCDATA</code>
token must come first, the alternatives must be at the top level (no nested
parenthesis), and there must be no duplicates among the alternatives.</para>
</listitem>
<listitem>
<para xml:id='p10'>There are a few places where we allow extensions in “other
namespaces.” The <tag>info</tag> elements can contain arbitrary
additional metadata elements, for example, and equations can contain
any MathML markup. DTDs and namespaces do not play well together.
It might be possible to create a DTD that allowed MathML in the appropriate
places, but that's more than the minimum needed to declare victory.</para>
</listitem>
<listitem>
<para xml:id='p11'>Patterns are a little bit like parameter entities. (Ok, a really, really
little bit.) It would be nice, where possible, to represent the patterns as
parameter entities in the resulting schema. At worst, it does no harm, at
best it makes the DTD easier to read and may allow some small amount of
customization of the DTD, not that I'd recommend that! And in any event,
will solve at least a tiny part of the second problem mentioned above.
</para>
</listitem>
</itemizedlist>

<para xml:id='p12'>With these things in mind, I decided to adopt the following
approach:</para>

<orderedlist>
<listitem>
<para xml:id='p13'>Create a “DTD” customization layer in RELAX NG that removes the
most difficult problems: create unions for the attribute co-constraints,
create unions for the elements that are defined by several patterns,
remove the elements in “other namespaces” extension points, etc.</para>
</listitem>
<listitem>
<para xml:id='p14'>Create an “override” document for describing a few more operations.
For example, removing the <code>db._phrase</code> pattern and changing
all the patterns that use it so that they use <code>db.phrase</code>
instead. (If there's an easy way to accomplish that in the RELAX NG
customization layer, but it eluded me.)
</para>
</listitem>
<listitem>
<para xml:id='p15'>Massage the modified schema until it's possible to create a
DTD from it.</para>
</listitem>
</orderedlist>

<para xml:id='p16'>I'd like to say that there was some deep, theoretical insight in
the last step, but there wasn't. I just built a pipeline of
transformations that got me from A to B. I looked at the document,
found something that wouldn't work in a DTD, wrote a transformation to
remove it, and added that transformation to the pipeline. Repeat until done.
When the
result was accepted by an XML parser and accepted a small, valid
DocBook document, I called it done.</para>

<para xml:id='p17'>Here's a 10,000 foot summary of the process.</para>

<orderedlist>
<listitem>
<para xml:id='p18'>Starting with the “DTD” customization layer, perform some
simplifications. Discard documentation, schematron rules, the start
pattern, divisions, etc. Turn interleaves into choices; this is a
little risky, but seems to be ok in the DocBook family of schemas.
Extract the content of pattern definitions, producing a set of
elements and a set of “parameter entities”. Drop schema facet
constraints and not allowed content on the floor. Etc.</para>
</listitem>
<listitem>
<para xml:id='p19'>Apply the overrides, as described above.
</para>
</listitem>
<listitem>
<para xml:id='p20'>Remove “choice” wrappers from around attributes. Fiddle with how
“optional” is expressed. In RELAX NG, it's a wrapper, for the some steps
in this process, it's more convenient to make it an attribute.
</para>
</listitem>
<listitem>
<para xml:id='p21'>Remove parameter entities that are no longer referenced.
</para>
</listitem>
<listitem>
<para xml:id='p22'>Fiddle with “optional” again, moving the optionality down to the
references. (An optional reference to something is the same as a reference
to an optional something.)
</para>
</listitem>
<listitem>
<para xml:id='p23'>In subsequent steps, it's going to be convenient to be able to
distinguish references to attributes from other references, so turn
all <tag>ref</tag> elements that point exclusively to attributes into
<tag>attref</tag> elements.
</para>
</listitem>
<listitem>
<para xml:id='p24'>Flatten chains of references to attributes. (If A points to B points
to C points to D which is an attribute, then just make A point to D.)
</para>
</listitem>
<listitem>
<para xml:id='p25'>Fiddle with “optional” again. This time move the optionality up to
the attribute declaration. This may require splitting a declaration.
</para>
</listitem>
<listitem>
<para xml:id='p26'>Check for element names defined by more than one pattern. There better
not be any.
</para>
</listitem>
<listitem>
<para xml:id='p27'>Remove “empty” parameter entities and references to them.
</para>
</listitem>
<listitem>
<para xml:id='p28'>Pull “text” up. Replace any reference to a parameter entity that contains
<code>#PCDATA</code> with a copy of what the parameter entity contains.
</para>
</listitem>
<listitem>
<para xml:id='p29'>Unwrap nested “zero-or-more” elements.
</para>
</listitem>
<listitem>
<para xml:id='p30'>Sort the parameter entities so that we never attempt to use one before
it's been declared.
</para>
</listitem>
<listitem>
<para xml:id='p31'>Convert the resulting document into a DTD. Turn parameter entities into
<code>!ENTITY</code> declarations, turn elements into <code>!ATTLIST</code>
and <code>!ELEMENT</code> declarations, substitute DTD attribute types for
the specified types, expand mixed content models, etc.
</para>
</listitem>
</orderedlist>

<para xml:id='p32'>The most disappointing part of that last step is fully expanding the
content model of every element that contains mixed content. I'd been working
pretty hard all along to preserve as many pattern names as possible as
parameter entities.</para>

<para xml:id='p33'>The problem is that even though all the relevant parameter
entities are simple lists of elements (so they could appear in a mixed
content element declaration), sometimes the same element name appears
in more than one pattern. So I punted, expanded them all, and removed
duplicates. I still think it might be possible to do better.</para>

<para xml:id='p34'>In retrospect, this isn't too surprising. If you go back to the
DocBook V4.x DTDs and study the parameter entity structure
[“masochist” -ed], you'll find a few places where we twisted the
parameter entity structure pretty hard to avoid exactly this
problem.</para>

<para xml:id='p35'>In any event, the DTD that results from this process is an XML
DTD that appears to validate DocBook documents. With different
customization and overrides, the DTD version of the publishers schema
also seems to work.</para>

<para xml:id='p36'>I'll get it out in the next day or so for wider testing. It's
very likely that there are places where it's not quite right. But it's
definitely an improvement over the old process.</para>

<section xml:id="pipenotes">
<title>Pipeline Notes</title>

<para xml:id='p37'>The process described above is not wholly unlike what I did
before. One significant factor that made this attempt more successful
was
<wikipedia page="XML_pipeline">XProc</wikipedia>.
It's not impossible to chain together 14 transformations
with a big XSLT 2.0 stylesheet and a bunch of modes, but it's
<emphasis>a whole lot harder</emphasis> to manage.</para>

<para xml:id='p38'>Speaking of XProc, I cheated. The pipeline I'm using today will
only work in XML Calabash because it relies on a compound extension step:
<code>cx:until-unchanged</code>. That step is a little bit like
<code>p:for-each</code> except that after each iteration it compares the
input document to the result of applying the pipeline and repeats the process
(using the output of one iteration as the input of the next) until the
result is the same as the input.</para>

<para xml:id='p39'>It's not impossible to do this without extending XProc, but it
requires writing a different recursive pipeline for each looping step.
It was more interesting (for me) to see how hard it would be to write
a compound extension step. (So shoot me.)</para>

<para xml:id='p40'>By the way, if you're curious, converting the base DocBook schema
to a DTD
is a 40 step pipeline (more or less):
</para>

<screen>INFO: Running pipeline main
INFO: Running xslt rng2dtx
INFO: Running xslt override
INFO: Running cx:until-unchanged remove-choice
INFO: Running xslt attr-remove-choice
INFO: Running xslt attr-remove-choice
INFO: Running xslt attr-remove-choice
INFO: Running xslt attr-remove-choice
INFO: Running cx:until-unchanged remove-unused
INFO: Running xslt attr-remove-unused
INFO: Running xslt attr-remove-unused
INFO: Running xslt attr-remove-unused
INFO: Running xslt attr-optional-to-ref
INFO: Running cx:until-unchanged to-attref
INFO: Running xslt ref-to-attref
INFO: Running xslt ref-to-attref
INFO: Running cx:until-unchanged flatten
INFO: Running xslt flatten-attref
INFO: Running xslt flatten-attref
INFO: Running xslt flatten-attref
INFO: Running xslt attr-optional-to-decl
INFO: Running xslt multiple-gis
INFO: Running xslt remove-empty-pes
INFO: Running cx:until-unchanged pull-up
INFO: Running xslt pull-up-text
INFO: Running xslt pull-up-text
INFO: Running xslt pull-up-text
INFO: Running xslt pull-up-text
INFO: Running cx:until-unchanged unwrap
INFO: Running xslt unwrap-zeroormore
INFO: Running xslt unwrap-zeroormore
INFO: Running cx:until-unchanged sort
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt sort-pe
INFO: Running xslt dtx2dtd</screen>

</section>

<section xml:id="xsd11">
<title>What about a better W3C XML Schema?</title>

<para xml:id='p41'>The DTD that results from this conversion process is a little
bit unsatisfying. It just not what a human being would do if they
started from scratch. On the other hand, it doesn't matter much;
there's no widespread use for DTDs beyond validation and perhaps
guided authoring. (And you ought to be using RELAX NG for that, see
previous comment about the twenty-first century.)</para>

<para xml:id='p42'>The same is not true of W3C XML Schemas. There
<emphasis>would</emphasis> (just possibly, maybe) be value in having a
better XSD for DocBook. There are data binding tools and other applications that
would fare much, much better with DocBook if they were given something
that took proper advantage of XSD's native facilities.</para>

<para xml:id='p43'>I've never had much interest in writing an XSD for DocBook. I'm
not likely to ever be persuaded that “type inheritance” is a
satisfying abstraction for how content models are related. I never
felt that XSD 1.0 was a good foundation for the kind of “human prose”
schemas of which DocBook is a typical example. But I'm almost
convinced that XSD 1.1 has fixed some of the most inconvenient
deficiencies.</para>

<para xml:id='p44'>My bailing wire and duct tape solution for generating DTDs
doesn't seem like it's ever going to be up to the task of doing the
conversion properly.
I'd be delighted if the aforementioned enterprising grad student built
a tool to do the conversion automatically, but if not,
I just might (someday) take a crack at hand authoring a
proper XSD 1.1 schema for DocBook.</para>
</section>

</essay>
