<?xml version='1.0' encoding='utf-8'?>
<?xml-stylesheet href="/style/browser.xsl" type="text/xsl"?>
<essay xmlns="http://docbook.org/ns/docbook"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
       xmlns:dc='http://purl.org/dc/elements/1.1/'
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:gal='http://norman.walsh.name/rdf/gallery#'
       version="pto">
<info>
<title>Ruminations on DocBook V.next</title>
<volumenum>6</volumenum>
<issuenum>19</issuenum>
<pubdate>2003-05-21</pubdate>
<date>$Date: 2005-09-11 10:27:02 -0400 (Sun, 11 Sep 2005) $</date>
<author><personname>
<firstname>Norman</firstname><surname>Walsh</surname>
</personname></author>
<copyright><year>2003</year><holder>Norman Walsh</holder></copyright>
<abstract>
<para>There comes a point in the life cycle of any system when adding
one more patch is the wrong solution to every problem. Eventually,
it's time to rethink, refactor, and rewrite. For DocBook, I think that time
has come.</para>
</abstract>
</info>
<epigraph>
<attribution>Martin Fowler</attribution>
<para xml:id='p1'><indexterm><primary>Fowler</primary><secondary>Martin</secondary></indexterm>Any
fool can write code that a computer can understand. Good
programmers write code that humans can understand.</para>
</epigraph>

<para xml:id='p2'>The DocBook TC has been kicking the idea of DocBook V5.0 around
for a long time. I think I've figured out why.</para>

<para xml:id='p3'>There comes a point in the life cycle of any system when adding
one more patch is the wrong solution to every problem. Eventually,
it's time to rethink, refactor, and rewrite. For DocBook, I think that time
has come.</para>

<section xml:id="past">
<title>Considering the Past</title>

<para xml:id='p4'>These are my recollections of how DocBook developed. I do
not claim that these are all <emphasis>facts</emphasis>, only that they are the
most factual memories that I have.</para>

<section xml:id='s1'>
<title>It Was a Long Time Ago...</title>

<para xml:id='p5'>DocBook is more than ten years old; its design stretches back to
the early 90's. Back then, men were real men, women were real women,
and <application>SGML</application> applications were really rare and
expensive. (What about XML, you ask? I'm not talking just talking
pre-XML here, I'm talking pre-HTML.)
</para>

<para xml:id='p6'>Hampered by the dearth and cost of commerical SGML applications,
I eventually built my first publishing system with bailing wire and
duct tape instead (<application>SP</application> output and beta versions of Perl 5).
I recall struggling to get <application>SP</application> through
<command>gcc</command> so that I could get at the <acronym>ESIS</acronym> output of the
parser.</para>

</section>

<section xml:id='s2'>
<title>The Tools Were Weak</title>

<para xml:id='p7'>The limitations of tools, and the limitations of
<application>SGML</application> DTDs, were a constant influence on our
design.</para>

</section>

<section xml:id='s3'>
<title>DocBook was for Exchange</title>
<para xml:id='p8'>The original vision for DocBook was that it would be principally
an exchange DTD. Different vendors (of things like Unix and X Windows)
would all use DocBook to share content and build common documentation
libraries.</para>

</section>

<section xml:id='s4'>
<title>DocBook is a Victim of its Own Success</title>

<para xml:id='p9'>Over the years, DocBook has experienced <quote>growth by accretion.</quote>
Decisions that were made early on (like allowing some elements to have
<tag>title</tag>s both inside and outside of the info
wrappers), seemed fine at the time when there were probably only a
handful of elements that had titles. But now those choices seem
like inconsistent warts.</para>

</section>

<section xml:id='s5'>
<title>We Stumbled Once Before</title>

<para xml:id='p10'>We're also suffering from the consequences of an earlier refactoring
attempt. The first refactoring of docbook occurred between the
2.4.1 and 3.1 releases.
<personname><firstname>Eve</firstname><surname>Maler</surname></personname>
rationalized the parameter entity structure
and applied the methodology she developed with
<personname><firstname>Jeanne</firstname><surname>El Andaloussi</surname></personname>
for developing <application>SGML</application> DTDs<footnote>
<para xml:id='p11'><citetitle>Developing SGML DTDs: From Text to Model to Markup</citetitle>
published by Prentice-Hall PTR (1996, ISBN: 0-13-309881-8).
Out of print, but still a valuable resource if you can get your hands on one.
</para></footnote></para>

<para xml:id='p12'>This refactoring was necessary and valuable, but it was never
entirely complete. it left us with some pretty awkward content
models:</para>

<programlisting>&lt;!element glossterm - o
  (#PCDATA FootnoteRef|XRef|Abbrev
  |Acronym|Citation|CiteRefEntry
  |CiteTitle|Emphasis|FirstTerm
  |ForeignPhrase|GlossTerm|Footnote
  |Phrase|Quote|Trademark|WordAsWord
  |Link|OLink|ULink|Action|Application
  |ClassName|Command|ComputerOutput
  |Database|Email|EnVar|ErrorCode
  |ErrorName|ErrorType|Filename
  |Function|GUIButton|GUIIcon|GUILabel
  |GUIMenu|GUIMenuItem|GUISubmenu
  |Hardware|Interface|InterfaceDefinition
  |KeyCap|KeyCode|KeyCombo|KeySym
  |Literal|Constant|Markup|MediaLabel
  |MenuChoice|MouseButton|MsgText|Option
  |Optional|Parameter|Prompt|Property
  |Replaceable|ReturnValue|SGMLTag
  |StructField|StructName|Symbol
  |SystemItem|Token|Type|UserInput
  |VarName|Anchor|Author|AuthorInitials
  |CorpAuthor|ModeSpec|OtherCredit
  |ProductName|ProductNumber|RevHistory
  |Comment|Subscript|Superscript
  |InlineGraphic|InlineMediaObject
  |InlineEquation|Synopsis
  |CmdSynopsis|FuncSynopsis|IndexTerm)+&gt;</programlisting>

<para xml:id='p13'>(A command synopsis <emphasis>inside</emphasis> a glossary term? Unlikely.)</para>

<para xml:id='p14'>In the intervening years, we've talked many times about
<quote>reworking the parameter entities</quote>, but we've postponed
it indefinitely as we've fixed bugs and added features.</para>
</section>
</section>

<section xml:id="present">
<title>Considering the Present</title>

<para xml:id='p15'>Today, HTML exists. A lot more developers have gotten used to the idea
of writing structured documentation. (Say what you want about the structure of most
HTML, it did expose people to the idea of putting elements and attributes in their
documents and separating structure from presentation, at least a little bit.)</para>

<para xml:id='p16'>Today, XML exists. XML has supplanted
<application>SGML</application> in every significant way. XML parsers
are nearly ubiquitous. The state of the art in tools for manipulating
XML includes powerful technologies tools like
<application>SAX</application>, <application>StAX</application>,
various flavors of <application>DOM</application>, and things like
<application>JAXB</application>. On top of that platform, we have
<application>XSLT</application>, <application>XSL-FO</application>,
and support for transformation and rendering of XML in the
browser.</para>

<para xml:id='p17'>Today, a lot of people <emphasis>author</emphasis> in DocBook.
They do this for many reasons, and one of them is exchange, but
they aren't principally writing in some private tag set, or deep
customization of DocBook, and then converting to the standard to pass
documents to other interchange partners. They're writing directly in
standard DocBook.</para>

<section xml:id='s6'>
<title>A Modern Approach</title>

<para xml:id='p18'>If we were starting over, I think we'd approach the problem much
differently:</para>

<itemizedlist>
<listitem>
<para xml:id='p19'>We'd use XML.
</para>
</listitem>
<listitem>
<para xml:id='p20'>We'd use RELAX-NG<indexterm><primary>RELAX NG</primary></indexterm>.
</para>
</listitem>
<listitem>
<para xml:id='p21'>We'd design for the web.
</para>
</listitem>
<listitem>
<para xml:id='p22'>We'd design for regularity and consistency at the current scale.
(Designing a schema of roughly 400 elements is different than
designing a schema of roughly 100.)
</para>
</listitem>
<listitem>
<para xml:id='p23'>We'd almost certinaly put it in a namespace.
</para>
</listitem>
<listitem>
<para xml:id='p24'>Perhaps controversially, we might allow foreign namespace
elements to creep in. We might, for example allow
<link xlink:href="http://www.dublincore.org/">Dublin Core</link> in metadata.
</para>
</listitem>
</itemizedlist>
</section>

<section xml:id="principles">
<title>Design Principles</title>

<para xml:id='p25'>A good place to start would be some design principles. If 100
people are going to ask you to make a 100 different changes, it's nice
to have some rules for sorting out which ones make sense and which
ones don't.</para>

<itemizedlist>
<listitem>
<para xml:id='p26'>Whatever we do, it should still look and feel like DocBook. In
all fairness, when I said <quote>starting over</quote>, I wasn't
really thinking of going back to first principles and reinventing all
the elements and content models. I think one of the goals should be
that most valid DocBook documents can be transformed into new valid
<emphasis>V.next</emphasis> documents with XSLT.</para>
</listitem>
<listitem>
<para xml:id='p27'>There are only a few kinds of elements: <tag>set</tag> and
<tag>book</tag>; divisions (<tag>part</tag> and
<tag>reference</tag>); components (<tag>preface</tag>,
<tag>chapter</tag>, etc.); formal blocks (<tag>figure</tag>,
<tag>example</tag>, etc.); and blocks (<tag>para</tag>,
<tag>blockquote</tag>, etc.); and inlines.
</para>
</listitem>
<listitem>
<para xml:id='p28'>There are only three kinds of inlines: <quote>just text</quote>,
general inlines, and domain-specific inlines.
</para>
</listitem>
<listitem>
<para xml:id='p29'>All the metadata goes in an <tag>info</tag> wrapper. RELAX NG lets
us have different content models for <tag>info</tag> depending on the context.
(So it can have a required title for some elements, an optional title for others, and
a forbidden title for yet others).
</para>
</listitem>
</itemizedlist>
</section>

<section xml:id="questions">
<title>Some Open Questions</title>

<para xml:id='p30'>I expect this section to get longer as I fiddle with
instantiating an experimental <emphasis>V.next</emphasis>. These questions are in no
particular order.</para>

<qandaset>
<?dbhtml toc='0'?>
<qandaentry>
<question>
<para xml:id='p31'>Is the distinction between formal/informal useful anymore? I think it's
a holdover from the days when building a <quote>list of titles</quote> based on
whether or not the elements actually had titles was considred too hard. That's
hardly the case these days.</para>
</question>
</qandaentry>
<qandaentry>
<question>
<para xml:id='p32'>Are varying content models, such as described above for <tag>info</tag>,
harder for users to understand? My intuition is no, I don't think most users envision
things in terms of content models (<quote>Oh, this is an <tag>info</tag>
wrapper so it must (or must not) have a title.</quote>), I think they envision
things in terms of more semantic structures (<quote>figures must have titles,
titles go in the <tag>info</tag> wrapper.).</quote></para>
</question>
</qandaentry>
<qandaentry>
<question>
<para xml:id='p33'>Ubiquitous linking is a no brainer, at least on inlines. Does it make sense
on blocks too? If we're going to allow <literal>&lt;phrase href="..."&gt;</literal>,
is there any reason <emphasis>not</emphasis> to allow
<literal>&lt;chapter href="..."&gt;</literal>? And if you say <quote>yes</quote>,
what is the design principle that you use to distinguish between the two cases?
</para>
</question>
</qandaentry>
<qandaentry>
<question>
<para xml:id='p34'>Are inlines cheap? This is more of a long-term maintainance question, but
we have a large pool of inlines in DocBook and enough elements to make it hard for
new users to see what goes where. So, on one hand, adding new inlines gives better
semantic markup for the users that need those inlines. On the other hand, it's yet
more tags for new users to learn.
</para>
</question>
</qandaentry>
<qandaentry>
<question>
<para xml:id='p35'>Brace yourself. We're just about to slam squarely into the character entity
problem. No DTD means no named character entities. I think this just adds fuel to
the fire that says the right answer here is to publish some normative entity sets
separately from the DTD. Then you can include the sets you need directly:</para>

<programlisting><![CDATA[<!DOCTYPE article [
<!ENTITY iso-lat1.ent PUBLIC
 "ISO 8879:1986//ENTITIES Added Latin 1//EN//XML"
 "http://example.com/path/to/iso-lat1.ent">
]>
<article xmlns="...">
...
</essay>]]></programlisting>
</question>
</qandaentry>
</qandaset>
</section>

<section xml:id='s7'>
<title>Frankly, I Like the Timing</title>

<para xml:id='p36'>The imminent release of
XSLT<indexterm><primary>XSLT</primary><secondary>2.0</secondary></indexterm> 2.0
is an ideal opportunity to
refactor the DocBook XSL Stylesheets<indexterm><primary>Stylesheets</primary>
<secondary>DocBook XSLL</secondary></indexterm>. Supporting a refactored DocBook
schema at the same time makes good engineering sense.</para>
</section>
</section>

<section xml:id="future">
<title>Considering the Future</title>

<para xml:id='p37'>The XML world will continue to evolve. We should bear that in
mind. Designing so we can add new features incrementally will keep
DocBook stable and useful for another 10 years. Until the next
refactoring.</para>

<para xml:id='p38'>Herewith, some things to bear in mind.</para>

<itemizedlist>
<listitem>
<para xml:id='p39'>Using Schematron<indexterm><primary>Schematron</primary></indexterm>
assertions with existing RELAX NG grammars gives us
the ability to validate conditions that aren't easily modeled with grammar-based
languages. For example, typed links (<tag>glossterm</tag>s should only point
to <tag>glossentry</tag>s, etc.)</para>
</listitem>
<listitem>
<para xml:id='p40'>A future version of RELAX NG might give us back our exclusions.</para>
</listitem>
</itemizedlist>
</section>

<section xml:id='s8'><title>Are You On Crack?</title>

<para xml:id='p41'>It's certainly fair to ask: should we do this at all?</para>

<para xml:id='p42'>There's <emphasis>a lot</emphasis> of legacy out there. Of
course, nothing that's suggested here will ever break that legacy.
It'll still be valid DocBook and there will still be tools that
process it. The only concern I really have on this front is how painful
it will be for users of legacy systems to move forward.</para>

<para xml:id='p43'>Maybe it would it be better to just declare DocBook finished and
move on? I've pretty well convinced myself that piling on yet more
fixes is not practical.</para>

</section>

</essay>
