One Namespace or Many?

Volume 6, Issue 35; 11 Jun 2003

One approach to simplifying a markup vocabulary is to divide it into discrete pieces. Rather than defining a single large vocabulary that is the union of all things, define a set of modules that can be combined into conformant variations. Some would suggest that the right implementation of that design is to put the modules in separate namespaces. I'm not convinced.

Nearly every complex solution to a programming problem that I have looked at carefully has turned out to be wrong.

—Brent Welch

I'm slowly sifting through the responses to my earlier comments about DocBook. One of the consistent themes is that DocBook should be smaller and simpler. As it currently stands, DocBook defines a large number of elements and most users, for most documents, use only a small subset of those elements. Rather, the argument goes, than having something big from which it is possible to remove components, why not have something small to which it is possible to add components?

Part of the reason is that subsetting is always “safe”. If DocBook consists of A, B, C, D and E and you remove C and E, you know that your new schema that consists of A, B, and D is still valid DocBook: it is not possible to write a document that conforms to your schema that doesn't also conform to DocBook. And if DTDs are your normative representation, subtraction is easy to define. Parameter entities don't provide any kind of constraints, they're really just string substitution, so if you attempt something additive, there's nothing to help you make sure that you aren't breaking something. I'm not sure that you couldn't devise an additive approach with DTDs, but in any event, DocBook didn't.

A little experimentation with RELAX NG suggests that the additive approach is feasible.

Consider this schema for documents:


    # A simple document schema

start = doc

doc      = element doc { blocks+ }

blocks   = para|verbatim|ext.blocks
para     = element para     { (text|inlines|blocks)* }
verbatim = element verbatim { (text|inlines)* }
ext.blocks = notAllowed

inlines  = text|emphasis|literal|ext.inlines
emphasis = element emphasis { (inlines)* }
literal  = element literal  { (inlines)* }
ext.inlines = notAllowed

It says informally that a document is a doc that contains blocks (para or verbatim elements), blocks contain blocks or inlines (emphasis or literal elements) and inlines contain text or inlines.

The extra definitions, ext.blocks and ext.inlines, are hooks for future extensions. In this schema, they're defined as “not allowed” so they never match anything.

Here's a test document in our simple document schema:


    <?xml version="1.0" encoding="UTF-8"?>
<doc xml:lang="en">
<para>Some text.</para>
<para>Some <emphasis>more</emphasis>
text.</para>
</doc>

Now suppose that you wanted to add a new kind of markup to your documents. For the sake of argument, EBNF-style diagrams like the productions in the XML Recommendation.

You could define a separate vocabulary for those elements. It might look something like this: :

# A simple EBNF-style production schema

prod = element prod { lhs, rhs }
lhs  = element lhs { text }
rhs  = element rhs { (stuff|nt)* }
nt   = element nt { text }

At this point, several readers suggest that the correct thing to do is put this production markup in a separate namespace . Apparently they imagine a world where there is a core DocBook schema and additional expansion modules each defined in their own namespace.

Technically, that's entirely feasible. But it raises at least two issues.

It adds significantly to the burden of authors. Not only are they required to learn the names of the elements that they are going to use, but also the namespaces that they're in. This becomes a pervasive problem and not just a matter of declaring the right namespaces at the top of the document.

In a mixed namespace world, we end up with markup like this:
```
	<?xml version="1.0" encoding="UTF-8"?>
<doc xml:lang="en" xmlns="http://example.org/xmlns/doc" xmlns:p="http://example.org/xmlns/prod">
  <para>Some text.</para>
  <p:prod>
    <p:lhs>expr</p:lhs>
    <p:rhs>
      <p:nt>number</p:nt> + <p:nt>number</p:nt>
    </p:rhs>
  </p:prod>
</doc>
      
```
Alternatively, we could declare and redeclare the default namespace several times to avoid the prefixes. But with either approach, we've made the author's job harder.
How are we going to define the content models of elements in which we want to allow DocBook elements? What is the stuff in the production schema above? The requirement is to allow authors to write DocBook inlines there, but how are we going to define it in the schema?

I see two possibilities:
1. Use an “any” content model. This is the generic approach. It says that productions have this general size and shape, but you're allowed to put your own stuff here.
  
  I think this is entirely unacceptable. If we say that anything can go there, authors will be able to put DocBook elements there, that's true, but they'll also be able to put any other element there (not to mention any DocBook element, even book, not just inlines). Proper interpretation of documents requires that the reader know what to expect. A schema for document markup should be tightly constrained.
2. The other possibility is to explicitly put the DocBook inlines there. In fact, I think this is the only practical answer. But then, what have we gained from changing namespaces? We've got our productions in a separate namespace, but they have DocBook around them and DocBook inside them, so what have we gained? I suppose we've gained the ability to define a “ prod ” element in some other extension module without any name collisions. But we aren't really going to do that anyway, are we?

I think the right answer, in this case, is to leave the modules in the DocBook namespace. For example, here's the production schema againSome readers will no doubt observe that the “caller must define ‘inlines’” technique would work equally well with the productions in a separate namespace. That's why I said it was technically feasible. I just don't think the benefits outweigh the costs.:


    # A simple EBNF-style production schema
#
# This modules is designed to be included.
# It is the responsibility of the including
# grammar to define the "inlines". 

prod = element prod { lhs, rhs }
lhs  = element lhs { text }
rhs  = element rhs { (inlines|nt)* }
nt   = element nt { text }

And here's a driver that combines productions with the core document schema to produce a new extended schema:


    # A schema that combines doc and prod

include "doc.rnc"
include "prod.rnc"

ext.blocks |= prod
ext.inlines |= nt

By redefining the extension hooks, we allow prod to occur anywhere that blocks can occur and nt to occur anywhere that inlines can occur. This works by making the definition of these hooks a choice between notAllowed and an element.

This test document conforms to the combined schema:


    <?xml version="1.0" encoding="UTF-8"?>
<doc xml:lang="en">
<para>Some text.</para>
<prod>
  <lhs>expr</lhs>
  <rhs><nt>number</nt> +
       <nt>number</nt>
  </rhs>
</prod>
<para>Some <emphasis>more</emphasis>
text.</para>
</doc>

One of (yet another of) the nice features of RELAX NG is that this module can be further combined with other modules to build the “core document plus the extension modules we need” schema. For example, if there was some additional “function synopsis” schema:


    # A simple function schema
#
# This modules is designed to be included.
# It is the responsibility of the including
# grammar to define the "inlines". 

funcsyn   = element funcsyn  { function, param* }
function  = element function { (inlines)* }
param     = element param    { (inlines)* }

We could build a “core document plus function synopsis schema” analagous to the core document plus productions schema:


    # A schema that combines doc and func

include "doc.rnc"
include "func.rnc"

ext.blocks |= funcsyn
ext.inlines |= function

This schema validates documents that use functions. But the important feature of this approach is that we can build a “core document plus productions plus function synopsis schema” this way:


    # A schema that combines doc and prod and func

include "doc+prod.rnc"
include "func.rnc"

ext.blocks |= funcsyn
ext.inlines |= function

To validate documents like this one:


    <?xml version="1.0" encoding="UTF-8"?>
<doc xml:lang="en">
<para>Some text.</para>
<prod>
  <lhs>expr</lhs>
  <rhs><nt>number</nt> +
       <nt>number</nt>
  </rhs>
</prod>
<para>A <function>function</function>.</para>
<funcsyn>
    <function>function</function>
<param>param1</param>
  </funcsyn>
<para>Some <emphasis>more</emphasis>
 text.</para>
</doc>

The extension mechanisms in RELAX NG appear to make the problem of maintaining consistency across the various modular extensions managable.

I think using a single namespace, instead of many, will make the problem of authoring in a modular world managable as well.