Drop the <!DOCTYPE>

Volume 9, Issue 4; 06 Jan 2006; last modified 08 Oct 2010

If we're going to drop the document type declaration, we need to provide something that behaves like entity expansion. With a little XSLT 2.0, that's not hard. With a pipeline language, we could even do it in a standard way.

Tim says we should drop it. I've expressed sympathy for that position. But Richard fired right back: “Absolutely not!”

Richard goes on to point out that until “there's some other lightweight macro-like facility, DTDs are essential.” I'm not sure I'd go so far as “essential”, I think I could live without it, but it wouldn't be pleasant. For an example of why, take a look at the top of your average W3C specification authored in XML Spec:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE spec SYSTEM "http://www.w3.org/2002/xmlspec/dtd/2.10/xmlspec.dtd" [
<!ENTITY draft.DD "05">
<!ENTITY draft.MM "01">
<!ENTITY draft.day "5">
<!ENTITY draft.month "January">
<!ENTITY draft.year "2006">
<!ENTITY iso6.doc.date "&draft.year;-&draft.MM;-&draft.DD;">
<!ENTITY http-ident "http://example.org/TR/NOTE-example">
]>
<spec w3c-doctype='note'>
<header>
<title>Example Specification</title>
<version>Version 1.0</version>
<w3c-designation>&http-ident;-&iso6.doc.date;</w3c-designation>
<w3c-doctype>W3C NOTE</w3c-doctype>
<pubdate>
<day>&draft.day;</day>
<month>&draft.month;</month>
<year>&draft.year;</year>
</pubdate>
<publoc>
  <loc href="&http-ident;-&iso6.doc.date;">&http-ident;-&iso6.doc.date;</loc>
</publoc>
<altlocs>
  <loc href="&http-ident;.XML">XML</loc>
</altlocs>
<latestloc>
  <loc href="&http-ident;">&http-ident;</loc>
</latestloc>
…

You don't need all those entities, but keeping all the date-related URIs and publication metadata accurate sure would be more tedious without them. Especially when you consider that as a specification develops it gathers a collection of “previous locations” which all have dates too, so the header becomes a real date soup.

And, of course, you don't need entity expansion to accomplish this. You could use m4 or cpp or any other text replacement tool, even simply sed. But those tools aren't XML-aware and really, you'd like to do this in an XML-aware fashion. (You don't want to do the replacement in the middle of an element name or produce well-formedness errors.)

My solution to this problem was to whip up a little XSLT to do the substitution. The stylesheet ml-macro.xslI was very tempted to use “xml-macro”, but “xml” is a reserved prefix and my ego isn't quite big enough to willfully break that rule., searches for macro names, delimited by two regular expressions, in attribute values and text content, and (recursively) expands them. Macros can be defined in the source document, in an external macro file, or directly in the stylesheet. The latter can be used to build dynamic replacement text, for example, the current date and time.

For my document collection “[[” and “]]” are reasonable delimiters, so I made them the default. You can change them, even on a per-document basis.

The stylesheet recognizes the following constructs:

<?ml-macro name="macroname" text="replacement text"?> Yes, I'm using processing instructions. I think they're the right tool for this job. If PIs offend your aesthetic sensibilities, get over it.: Defines the macro “macroname” with the replacement text “replacement text”. The replacement text may contain other macros, but they must not be used recursively.
<?ml-macro href="someURI"?>: Loads macros defined externally in “someURI”. That document should consist of an ml:collection element containing one or more ml:macro elements. Each ml:macro element has a mandatory name attribute containing the name of the macro. The content of the element is the replacement text. In this case, the replacement text can be any well-formed XML fragment, including element content. The replacement text may contain other macros, but they must not be used recursively.
<?ml-macro-odre?>: Defines the open delimiter regular expression. The default is effectively <?ml-macro-odre \[\[?>.
<?ml-macro-cdre?>: Defines the close delimiter regular expression. The default is effectively <?ml-macro-odre \]\]?>.

Using this approach, the specification shown above becomes:

<?xml version="1.0" encoding="utf-8"?>
<?ml-macro name="draft.DD"    text="05"?>
<?ml-macro name="draft.MM"    text="01"?>
<?ml-macro name="draft.day"   text="5"?>
<?ml-macro name="draft.month" text="January"?>
<?ml-macro name="draft.year"  text="2006"?>
<?ml-macro name="iso6.doc.date" text="[[draft.year]]-[[draft.MM]]-[[draft.DD]]"?>
<?ml-macro name="http-ident"  text="http://example.org/TR/NOTE-example"?>
<spec w3c-doctype='note'>
<header>
<title>Example Specification</title>
<version>Version 1.0</version>
<w3c-designation>[[http-ident]]-[[iso6.doc.date]]</w3c-designation>
<w3c-doctype>W3C NOTE</w3c-doctype>
<pubdate>
<day>[[draft.day]]</day>
<month>[[draft.month]]</month>
<year>[[draft.year]]</year>
</pubdate>
<publoc>
  <loc href="[[http-ident]]-[[iso6.doc.date]]">[[http-ident]]-[[iso6.doc.date]]</loc>
</publoc>
<altlocs>
  <loc href="[[http-ident]].XML">XML</loc>
</altlocs>
<latestloc>
  <loc href="[[http-ident]]">[[http-ident]]</loc>
</latestloc>
…

This works and I'm going to start using it. With the addition of XInclude to replace external parsed entities (and some uses of external unparsed entities), this approach seems to satisfy the requirements met by entity expansion. Except, of course, for the fact that it uses a new syntax, requires two passes, and isn't supported in any standard way.

On the last point, I hope that when the work of the XML Processing Model Working Group is finished, there will be a standard way to request this kind of processing.

So do I really think we should drop the <!DOCTYPE>? Yeah, probably. Tim's got some pretty good arguments to support his position that it's not only unnecessary, it's actively harmful. But I'm not entirely convinced. I don't think we can drop it yet. Maybe in another few years we can; with a widely deployed pipeline language, I think the stage would be set.

Comments

The StructuredBlogging folks have run into validation issues when including foreign-namespaced XML in a <script> block in XHTML. Might dropping the DOCTYPE sidestep those?

Incidentally, due to the angle brackets in your title, Bloglines does silently drop the <!DOCTYPE> (Phil Ringnalda now includes markup in his titles as a matter of course, waiting for aggregator developers to do the right thing with Atom).

Danny, without the <!DOCTYPE, no parser is going to attempt DTD-based validation, so in that sense it would sidestep the issues. On the other hand, you won't find out if you've got <p>'s inside your <h3>'s either. For data entry, I'd probably build a RELAX NG grammar and validate with that. I certainly would want some sort of validation.

With respect to Bloglines, yeah, I noticed that too. Add it to the pile of rendering bugs in Bloglines, I guess. About once a week now, I'm irritated enough by their bugs to consider switching. But never (yet!) enough to overcome the inertia.

In addition to the broken rendering, I bet my RSS feed is busted as written because I refuse to double-escape the markup. That moves me one step closer to redirecting the RSS feeds to the equivalent Atom feeds and pulling the plug on the RSS.

Oh, the irony!

I read this post in Sage from Planet XML. It was titled "Drop the".

Micah Dubinko adds the post to his bookmarks, under the same title "Drop the", but the funny thing is if you view source there's still a hidden <!DOCTYPE> in the link.

Wouldn't it be nice if in 2006 at least Planet XML could get this stuff right?

Bart: they can’t, not if they consume RSS. Silent data loss. It’s impossible to fix because the spec is not unambiguous enough. They’d have to consume Atom to be able to stick a fork in this issue.

Norm: IMHO processing instructions are exactly the right tool for this scenario. They’re not at all aesthetically offensive; they’re just widely misused. But not here. (Which I am guessing is exactly what you’re thinking.)

If it's a fixed grammar over which you have no control, then a pre-processing expansion stage may be necessary but for xmlspec I think the reason we all have so many messy date entities is that the header XML elements are not sufficiently structured and the thing to do is fix that and the update the main xmlspec stylesheet to match, rather than to switch the entity processing from dtd to a new PI based syntax. For example if there were elements in xmlspec that allowed you to specify
draft.day = "05" , draft.month = "1" draft.year="2006", draft.class="NOTE" and draft.name="example"
Then the xmlspec stylesheet should be able to figure out for itself what the current-location should be (so this needn't be specified in the XML) similarly having to specify the month in different places a "1" "01" and "January" is something that the stylesheet cold take care of, and if the date is only specified in one place in the XML file there is no need to make an entity for it.

For the general case of a DTD-replacement macro system I'm not sure that PI's are quite up to the job, unfortunately. Or more exactly a syntax which has the replacement text as PI content rather than between two PIs is not up to the job. Given that character references aren't recognised as such in a PIContent (and neither are element start tags) what does text="<x>a <x>" expand to, just the character string, or an x element with a space? I wasn't sure from your description. Either answer has some benefits and some potential confusions....

In case anyone is interested, I modified this script to work under XSLT 1.0 and EXSLT. Norm, thanks very much for the original idea and implementation.

After I modified your script, I thought it would be a good idea to have some public macros available for use. So I put together a set of macro equivalents of the common character entities.

Also, did you mean for that first <xsl:variable/> to be an <xsl:param/>, so that people can pass in a reference to a set of default macros? I modified my version to work that way.