Implementing the Darwin Information Typing Architecture for DocBook.

I've been trying to get my head around DITA for a while now. The trouble is, DocBook isn't my day job, and the DITA spec is fairly hefty, so it's taken rather longer than I would have liked. I've also been struggling with an emotional impediment: DocBook has never really had any competition before and I don't relish the thought fighting with anyone about it. But like it or not, DocBook and DITA are competitors, at least to the extent that both are aimed at the technical documentation market. At the end of the day, though, the markup vocabulary you choose is your business and I don't suffer if you choose not to use DocBook.

That said, I think I've got my head around DITA now and if you line DocBook and DITA up, I think DITA can point to four technical differences that are arguably features in its favor:

  1. A topic-oriented authoring paradigm.

  2. A cross-referencing scheme that's more practical than XML's flat ID space.

  3. SGML's conref, reinvented.

  4. An extensibility model based on “specialization”.

Well, heck, if that's all DITA has going for it, DocBook can do those things. :-)

Topics 

DocBook's legacy is certainly big, linear documents: it even has the word “book” in it's name. But there's nothing that prevents you from writing modern, topic-oriented, highly modular documentation in DocBook. Nothing except, perhaps, the emotional weight of the tag names. “Book”, “Chapter”, “Section” all sound like monolithic, linear structures. Even “Article” feels a little bit like ink on dead trees.

Fine. I can invent a “Topic” element to fix that:

1dita.topic =
2   element topic {
3      dita.topic.attlist,
4      dita.topic.info,
5      db.all.blocks*,
6      db.section*,
7   }

The <topic> element is only half the story, though. DITA also has a complicated system for combining topics together based on map files. A map file identifies the topics that are part of a given deliverable (set of web pages, help system, etc., even a book).

No problemo. I can have map files too:

1dita.map =
2   element map {
3      dita.map.attlist,
4      dita.map.info,
5      dita.topicref+
6   }
7
8dita.topicref =
9   element topicref {
10      dita.topicref.attlist,
11      dita.topicref*
12   }

There's a bit more to DITA map files than <topicref>s, but I think that's the most significant part. Other parts, such as the mechanism for tabulating relationships between topics, are equally easy to construct.

Cross References 

XML IDs are required to be globally unique. In a system for reusable, modular documentation, that can be a real drag. Even assuming you can manage globally unique IDs across a large number of independent topic files, reuse can break the flat ID space.

Consider a unit of content that you might want to reuse, a <note> or <table> or something. If it has an ID, and if you pull that element into several different topics and those topics get pulled together by the map file, you're guaranteed to have the same ID appearing several times in your final, combined set of topics.

The DITA solution is clever: scoped IDs. Given that the topic is the unit of documentation, I can say that <topic>s must have globally unique IDs, but that every other element will be referenced within the scope of its containing topic. This is accomplished by inventing a fragment identifier syntax. Consider this topic:

1<topic xml:id="topic1">
2<info>
3<title>Example Topic 1</title>
4</info>
5<para>Some topic content.</para>
6<note xml:id="usefulnote">
7<para>This note isn't really useful, but pretend it is.
8</para>
9</note>
10</topic>

The ID/IDREF way of referring to that note would be with its ID value: <link linkend="usefulnote">this note</link>. But that's ambiguous if the <note> appears in more than one topic, so instead I use: <link xlink:href="#topic1/usefulnote">this note</link>. The semantics of this fragment identifier syntax are straightforward: find the second ID (usefulnote) inside the topic with the first ID (topic1). Then if I say that this is the fragment identifier syntax for documents in this system (i.e. with some media type that I still have to invent), I've closed the loop (web-) architecturally.

Now, in theory, I've still got the technical problem that I have multiple xml:id attributes with the same value in the combined set of topics. I could only avoid this by using a different attribute name. But I actually think it's better to ignore this theoretical problem. In practice, what this means is that the validator will check the uniqueness of IDs as long as I validate individual topics. That's going to catch cut-and-paste errors, and I think that's worth bending the rules slightly at build time.

I can implement this system by adjusting the stylesheets to understand these fragment identifiers and by turning off ID/IDREF linking:

1db.linkend.attribute = notAllowed
2db.linkends.attribute = notAllowed
3db.endterm.attribute = notAllowed

There. That was easy.

Conref 

Conceptually, “conref” (or content reference) is a kind of cross reference. But instead of pointing to its content, it transcludes it. The practical benefit of conref is that it replaces some uses of entities or XInclude. DITA's reinvention of conref has a couple of interesting features:

  • It transcludes the content of the element it points to, but not the element itself. This means you can reuse an element without reusing it's ID or other attribute values.

  • A conref must point to an element of the same type. In other words, you can conref from one <para> to another, but not from a <para> to a <note>.

Consider the useful note from above. If I wanted to reuse it in a new topic, how would I do it? I could put it in an entity and reference it in both places, or I could use XInclude. But neither of these would have the features above, so instead, I use a new conref attribute:

1<note conref="#topic1/usefulnote"/>

That's easy to add to DocBook:

1db.common.attributes &= dita.conref.attribute?
2db.common.idreq.attributes &= dita.conref.attribute?

An additional semantic of conref is that an element with a conref attribute must be empty. Although RELAX NG could be persuaded to enforce that constraint, it seems tedious to do so for a common attribute. Instead, I'll eventually rely on Schematron assertions to test for that (I haven't written them just yet). In the meantime, I've made the stylesheet that performs the transclusion enforce that constraint.

Specialization 

DITA's extensibility mechanism is perhaps its most clever invention. While it's easy to extend DocBook, for example, to add a new element, doing so introduces an interoperability problem.

Suppose you invent a new kind of list, a product list. Imagine that the important semantic of a product list on your system is that products named in a product list are automatically verified against a manifest. In all other respects, it's just a regular ordered list.

The DocBook way to do this in a portable manner is with the role attribute:

1<orderedlist role="productlist">
2<listitem><para>1 <productname>oscillation overthruster</productname>
3</para></listitem>
4<listitem><para>4 <productname>#11 screws</productname>
5</para></listitem>
6<listitem><para>1 <productname>watermelon</productname>
7</para></listitem>
8</orderedlist>

The problem is, if the reason you're inventing the new kind of element is to give it a slightly different content model, this approach doesn't really work. (In fact, you can make it work in RELAX NG, but it'd be really ugly for authors.)

What you'd like to do instead is just invent a new tag, <productlist>, and use that:

1<productlist>
2<listitem><para>1 <productname>oscillation overthruster</productname>
3</para></listitem>
4<listitem><para>4 <productname>#11 screws</productname>
5</para></listitem>
6<listitem><para>1 <productname>watermelon</productname>
7</para></listitem>
8</productlist>

Now the problem is, if you want to format that element, you have to modify the stylesheets and if you want to interchange your topics with others, they all have to have your stylesheet customizations too.

DITA overcomes this by describing extensions in terms of specialization or subtyping. When you invent a new element, you also say what kind of element it specializes. When the stylesheets (or other tools) don't know what to do with your special element, they can automatically treat it as if it was the more general element that it specializes.

The DITA mechanism for accomplishing this is an ingenious, if elaborate, system of fixed attribute values in the DTD. This leads to odd looking stylesheets that almost exclusively use patterns of the form:

1<xsl:template match="*[contains(@class, 'some/value ')]">
2  ...
3</xsl:template>

In addition to a sort of baroque scheme for implementing this in DTDs, DITA also appears to have the limitation that specializations must be isomorphic to something in the base system. That, in turn, forces some of the elements in the base system to have…interesting content models.

Consider DITA's <topic> for example. The content model of a topic body is “(p|note|...|section|example)*”. On the face of it, that allows topics to contain a free mixture of sections and paragraphs, which one wouldn't ordinarily consider a good thing. I gather that this is necessary so that some specialization of <topic> can have an element that's required to occur last (after <section>), that is itself a specialization of <p>. But I could be wrong about that.

Anyway, the idea of specialization is useful and interesting, and I can accomplish the same thing on top of DocBook by taking advantage of annotations in the schema. In broad strokes:

  1. I add annotations to the RELAX NG grammar for the extensions. These annotations describe how to transform each new element back to some base element in DocBook.

  2. I add a parameter to the stylesheets so that they can know what schema is being used for the document. This is conceptually no different than the DITA case where the DTD for the extension is, in practice, required.

  3. The stylesheets already have a “normalization” phase that adjusts content in the source document; I extended that phase to include handling “unknown” elements by mapping them back to DocBook as described by the annotations.

So all you have to do is add the annotation to your extension:

1dita.productlist =
2   [
3      r:remap [ db:orderedlist [] ]
4   ]
5   element productlist {
6      dita.productlist.attlist,
7      dita.productlist.info,
8      db.all.blocks*,
9      db.listitem+
10   }

And you're done. A <productlist> will be treated exactly like an <orderedlist>.

Although I didn't show it above, this technique is used in the definition of <topic> to map topics back to sections. And for my DITA experiment, where I created DocBook <task>, <concept>, and <reference> specializations of <topic>, I used exactly the same technique. A <task> is remapped to a <topic> if there's no template for <task>, which is, in turn, remapped to a <section>, if there's no template for <topic>.

Using the annotation technique, there's no requirement that extensions be isomorphic to something already in DocBook, though that's the simplest case. Consider the DITA <relatedlinks> tag that can occur at the end of a topic. Suppose you wanted to turn this list of links into a section with a default title? You can use a slightly more complicated remap annotation to accomplish that:

1dita.relatedlinks =
2   [
3      r:remap [
4      db:section [
5         role="dita-relatedlinks"
6         db:info [
7            db:title [ "Related Links" ]
8         ]
9         db:para [
10            r:content []
11         ]
12      ]
13      ]
14   ]
15   element relatedlinks {
16      dita.relatedlinks.attlist,
17      dita.relatedlinks.info,
18      db.link.inlines+
19   }

That annotation will wrap the body of the <relatedlinks> element inside a <para> inside a <section> with the <title> “Related Links”.

The extent of the transformations that you can do today is fairly limited (isomorphism or wrapping the content in some structure). But if I imagine a world in the not too distant future where there's a standard XML Pipeline language for processing a sequence of transformations, it's easy to imagine that XSLT templates could be used as annotations, giving extension writers almost complete freedom.

What's Left? 

With a couple of hours of hacking, I've implemented on top of DocBook the four key features of DITA that I could identify. (If there are more, bring them on, DocBook can do them too!) In doing so, I've attempted to remain true to the spirit of DocBook, so my content models aren't exactly the same as the DITA models, but I think the analogies are sound.

That means the choice of which vocabulary to use, DocBook or DITA, comes down simply to the actual terms in the vocabulary, the elements and attributes provided, their semantics, and their relationships to each other. On that score, I think DocBook is the hands-down winner.

But I was bound to say that, wasn't I?

The Source 

My experiment to implement DITA on top of DocBook includes:

A schema (RNG or RNC)

The schema is a DocBook 5.0 extension that defines a new top-level element, the <topic>. In the interest of modelling DITA, it also defines a <task> with the same general structure as a DITA task, a <concept>, and a <reference> as specializations of <topic>.

I really don't understand the structure of a DITA <task> with its body elements that are just like paragraphs. How a technical vocabulary could expect every task to have pre- and post-requisites, a context, a result, and (a single!) example such that each fits into a single paragraph is beyond me. If there's ever a move to standardize my DITA customizations of DocBook, I think <task> can be done better. (There's also the issue of the existing, distinct <task> element already in DocBook, but that's a different problem.)

A stylesheet

The stylesheet is a customization of the DocBook XSLT2 Stylesheets. It handles the semantics of the simple map files I outlined above, supports conref, and implements the DITA fragment identifier syntax. I incorporated the schema support into the base stylesheets.

An example

My example is just a toy, but it has several parts: a map, a “main” topic, a “subordinate” topic, and a task.

Run them through the stylesheets and you get a “normalized” document which is formatted as you'd expect.

Of all the pieces involved, supporting a more robust map file is probably the most interesting. But it wouldn't be difficult.

Comments:

i think you've got the main advantage of using DITA. its the conceptional approach (information based, specialization based)

having this in mind and applying this to docbook you see that docbook is the opposite of this. output based, generic based

in DITA you only get a small set of vocabulary and some example specialization from a software company (ibm). each company planning to use DITA have to focus on their particular, special requirments and enhancements on a limited set of basic concepts without loosing interoperability with other DITA users. in DocBook you have to focus on not needed requirments which is much harder to do in my point of view.

the specialization provided by DITA OT is very related to software domain of course thats why default task looks like it looks like ;-) but you have the choice to create your own task based on default topic....

in detail there are many more differences between DITA and DocBook and there are of course use-cases for both of them. but to be honest the DITA approach is much more scalable by design than DocBook ever was.

Posted by alex witzigmann on 21 Oct 2005 @ 08:30pm UTC #

A way of handling hierarchical IDs, less hairy than creating your own media type and fragment id standard, would be to make the IDs hierarchical at the point of definition as well as the point of use. Using _, -, ., or · (MIDDLE DOT) as the hierarchy delimiter would make it straightforward to verify the correctness of the hierarchy using XSLT.

Posted by John Cowan on 21 Oct 2005 @ 10:44pm UTC #

Oh, sure, there are lots of ways to manage IDs. Part of the pleasure of this exercise was working out how to implement some of the features of DITA. If DocBook wants to adopt some or any of these ideas in principle, then we can look at the technical solutions on their own merits.

Posted by Norman Walsh on 21 Oct 2005 @ 10:54pm UTC #

In addition to conref, isn't DITA re-inventing architectural forms? I'll also have to bring up Information Mapping, the Unification Church of granular content architecture. (Sorry, I just read Andrew Orlowski pointing out that if Web 2.0 is people, it has a lot in common with Soylent Green, so I'm free associating.)

It was interesting to see how much DITA comes up in the schedule for XML 2005, so we'll all be talking about it there.

Bob

Posted by Bob DuCharme on 24 Oct 2005 @ 01:14am UTC #

Yes, the system of fixed attributes in the DTD is, if not actually architectural forms, very much like architectural forms.

Posted by Norman Walsh on 24 Oct 2005 @ 02:50am UTC #

That's an impressive plunge into some of the core DITA principles, Norm. Also, that's a good demonstration of DocBook's well-designed customization mechanisms — that it's possible to create a hybrid with some DocBook features and some DITA features.

To fully embrace specialization and the topic paradigm in DocBook would take more effort (of course) than adding an attribute that's equivalent to the DITA class attribute and a tree structure that's equivalent to the DITA map. The DocBook committee would want to check each of the DocBook elements to make sure that the assumptions are still valid when topics are reused in many contexts. For instance, the committee would want to think hard about managing links outside of the topic content so embedded links don't become a constraint on reuse. The committee would want to look seriously at loosening up the content models to enable specialization. The committee would want to look at reorganizing the DocBook schemas as pluggable modules, possibly refactoring some of the existing elements (for instance, the inline and reference elements) as specializations. The DocBook transforms would have to be rewritten so the processing of base elements applies to specialized elements. In every change, of course, the committee would want to manage backward incompatibility.

The DocBook committee certainly has the ability and could take the time to do that work, but I'd ask whether adding topic orientation to DocBook really applies the limited resources of the DocBook (and for that matter DITA) committees to the best advantage of our communities.

Instead, would it be better to implement processing strategies that encourage interoperability between DocBook and DITA?

* Specializing the DITA elements with DocBook element names so people can create topics that look a lot like simplified DocBook sections and refsections, are valid DITA specializations, and are 100% interoperable between the two vocabularies (meaning that a roundtripping transform is possible for the simplified content).

* Supporting DocBook books that include content by reference from a DITA map. If we had a high-fidelity DITA-to-DocBook output transform, we could preprocess DITA topics to produce DocBook content and then process the result with DocBook tools.

* Supporting references from a DITA map to DocBook articles. If we had a high-fidelity DocBook-to-DITA output transform, we could preprocess the DocBook articles to produce DITA and then process the result with DITA tools.

That way, users could author DocBook-like topics, still take advantage of other DITA specializations, pull DocBook articles into DITA outputs, and pull DITA topics (including DocBook-like topics) into DocBook outputs. Better for everyone, no?

Posted by Erik Hennum on 24 Oct 2005 @ 02:58am UTC #

Even though it seems feasible to customize DocBook to cover most of DITA features, it seems to me that DocBook is not particularly suited for a topic-driven content. As indicated above, the "DocBook's legacy" (coming from TeX perhaps) makes it not a natural fit for a "topic driven" content.

Posted by Javad K. Heshmati on 10 Mar 2006 @ 12:27pm UTC #

I just flatly, unflinchingly, completely and entirely disagree. Given a <topic> element that lacks the traditional semantics of occurring in a sequential flow (as chapter and section have, for example), there's absolutely nothing that I can think of that makes DocBook less suitable.

In fact, DocBook already has such an element, <article>, but that doesn't seem to suit the topic-oriented proponents. Perhaps because it can't obviously be subclassed into specific types of article.

The idea of building non-linear structures with DocBook and pulling them together with an explicit map file has existed, in practice, since at least 2001 when the "Website" doctype was released.

Posted by Norman Walsh on 10 Mar 2006 @ 12:51pm UTC #

I've been struggling with this question for a real implementation for some time. While I agree that DocBook can be made to behave like DITA, I really think there is a fundamental philosophical difference between these two ways of creating and thinking about technical content. To use some crude analogies, you can eat pasta with your fingers or go off roading in a sports car too. But these are not the best tools for the job.

The tag names in DocBook carry more than emotional weight; they carry meaning, and that meaning is very specifically tied to the ancient, venerable, honorable and glorious book. DITA is designed for the creation of modular topics, not as an overlay but as its core purpose. It is designed to move us away from the book as the primary entity through which technical knowledge is conceived, created and disseminated.

I was involved in a pilot in the late 90s where we invented something very similar to DITA, and have spoken to writers from other companies who did the same thing. People were independently inventing these things because DITA addresses a real and pressing need. And as we move increasingly into an online world where we digest information in screen-sized chunks, I believe the logic behind structuring information in a way that is not tied to the book and that intrinsically addresses the non-linearity of online navigation will become increasingly compelling.

All that said, for my real world implementation, I am leaning towards DocBook, simply because my company’s existing technical library is … books. So, in my view, the adoption of DITA over DocBook will be a slow process. But then, so was the introduction of utensils for eating.

Posted by Bob Murray on 26 May 2006 @ 08:06pm UTC #
Comments on this essay are closed. Thank you, spammers.