Inventing XML Languages

Volume 9, Issue 9; 17 Jan 2006; last modified 08 Oct 2010

My two cents on the controversy Tim recently stirred up on XML language creation.

Language is by its very nature a communal thing; that is, it expresses never the exact thing but a compromise—that which is common to you, me, and everybody.

T. E. Hulme

<foaf:name>Tim Bray</foaf:name>
’s Don't Invent XML Languages (and its companion essay, On XML Language Design) reflect mostly the content of his presentation at XML 2005, so there wasn't anything in them that surprised me. And basically, I'm inclined to agree with Tim.

However, in the past week or so I've read several essay's critical of Tim's position (from bright folks like

,
<foaf:name>Dare Obasanjo</foaf:name>
,
<foaf:name>Danny Ayers</foaf:name>
, and probably others). As someone who, at one level or another, makes a living inventing new XML languages, I wonder if I shouldn't be more critical too?

Well, first off, Tim is pretty clearly talking about inventing global standard XML languages, those are the one's that are “boring, political, time-consuming, [and] unglamorous” to create. I don't think he's talking about the XML format you use for configuring your application or the one that your application uses internally as a normal form for some data. Developing those is mostly boring, usually non-political, pretty fast, and unglamorous. But I don't think anyone's trying to dissuade you from inventing them. If you have the choice between storing some application data in XML or something else, I want you to use XML.

Tim goes on to argue for a bit about why it's a painful, risky, and expensive to invent a new standard. Been there, done that. He's right.

Finally, there's a list of “the big five”: well established standards that you'd be better off using than reinventing. DocBook made the list, so on that level I'm a happy camper. (heh!)

It struck me on reading this list that it was a pretty good cross-section of the sorts of things likely to be the targets of reinvention. (It helps, I'm sure, that Tim and I have similar sorts of backgrounds). I can easily imagine that someone writing an authoring tool might consider attempting to develop a new standard for documentation or someone, writing a business package, a new purchase order standard, or some social software startup trying to create a new format for sending little bits of information around that will be updated regularly. The big five cover those areas so use them instead.

I think the odds of someone considering a reinvention of XSLT or MathML or SVG or RDF or Topic Maps (or any of a wide range of other specialized vocabularies) are a lot smaller if for no other reason, simply because the market for those sorts of languages is a lot smaller. But the same lesson would apply: don't. If there's a standard out there that is already widely accepted and fits your needs reasonably well, try really hard to use it, taking advantage of its extension mechanisms if you can, before you reinvent it. “Not invented here” is never an acceptable reason to reinvent.

Microformats

If I have a bone to pick with Tim, it's his hearty endorsement of microformats, as if putting the angle brackets around the names was the hard part in language design. Designing a microformat is designing a language and is subject to all the same pitfalls. It just happens to be that the microformats developed so far have been relatively small and developed by small groups of like-minded individuals. Both of those conditions mitigate the problems of language design.

I don't want to come across as some sort of curmudgeon opposed to microformats for opposition's sake, but I do have some concerns.

Must ignore

One of the driving forces behind microformats, as I see it, is the browsers' implementation of “must ignore” semantics on unrecognized markup. This is generally praised as an almost universally good thing, and I have no doubt that it is often a good thing. It was an important stepping stone in the development of the modern web browser.

But let's also recognize that it has forced us into a culdesac. It's frightfully difficult to embed new markup in HTML because it just gets ignored. We have to shoe-horn our extensions into existing markup.

Consider the proposal for a “geo” microformat (not because I think there's anything wrong with it, just because I happen to have it open in a browser tab at the moment).

Suppose I wanted to identify the location of the Eiffel Tower. In an ideal world, I could say:

<p>The Eiffel Tower is located at
<geo xmlns="http://example.org/geo" lat="48.8589" long="2.2958"/>.
</p>

But this isn't an ideal world and any browser which didn't use my stylesheet would display that like this:

The Eiffel Tower is located at

Not terribly useful. If instead the browser could be coerced (through some form of “must understand” perhaps) to display the markup it didn't recognize :

The Eiffel Tower is located at <geo lat="49.8589" long="2.2958"/>

it would be at least practical to use the new markup. But that's not the case, so instead we have to resort to markup like this:

<p>The Eiffel Tower is located at 
<span class="geo">
 <abbr class="latitude" title="48.8589">48° 51' 10" N</abbr> 
 <abbr class="longitude" title="2.2958">002° 20' 59" E</abbr>
</span>.</p>

Which is hard to validate and prone to error. I stand by what I said before, if you want to embed data in your documents, embed data. Transforming to a microformat for presentation is a good thing, but I can't recommend them for authoring.

I can't recommend inventing new XML languages either, unless you have to. And sometimes you have to.

Comments

Norm, all three of the options you present are suboptimal. HTML must-ignore enables us to do this:
<p>The Eiffel Tower is located at
 <geo xmlns="http://example.org/geo">
  <latitude>48.8589</latitude>° latitude 
  <longitude>2.2958</longitude>° longitude
</geo>.
</p>
This will display correctly in essentially all existing browsers, no stylesheet required. I've been using markup like this for years on Cafe con Leche and Cafe au Lait. It works. It doesn't cause any problems. It's easy to validate. It's easy to process. Just don't stuff content you want humans to read in attribute values and you'll be fine.
—Posted by Elliotte Rusty Harold on 17 Jan 2006 @ 05:24 UTC #

Yes, Elliotte, you're right. I chose the example to illustrate a problem, but I should equally have illustrated the workaround. In fact, I had initially done what you suggest, but changed it to highlight the point I wanted to make about must ignore.

—Posted by Norman Walsh on 17 Jan 2006 @ 05:34 UTC #
The problem of what to do with unknown tags has existed since HTTP0.9. The early NeXT HTTP browser ignored unknown stuff, with consequences that were noted way back in 1993: http://catless.ncl.ac.uk/Risks/14.75.html#subj2.1

It was probably incidents like these that led to the current policy, one that caused problems the moment the script tag was added, hence most javascript element bodies hide the content in a comment, to stop the script ignorant browsers from displaying it.

It really does need improving. Maybe an attribute that set the notunderstood to either "skip" or "display". I don't approve of "fail" on unknown tags, as it would make too much stuff unreadable by odd browsers.

—Posted by Steve Loughran on 17 Jan 2006 @ 06:55 UTC #

The script element is the flip side of what I was complaining about in Norm's examples. The script should have been placed in an attribute value because it's not intended for human readers. Then the problems wouldn't have arisen. Java got this right. It's surprising JavaScript didn't.

Here's the rule: if you want people to read the text make it element content. If you want them not to see it, put it in an attribute value.

MustIgnore, MustUnderstand, and so forth are fundamentally bad ideas. The author of the document does not get to tell the reader what they do or do not or must or must not understand. The author does not get to specify what processing the reader does. Authorial intent is bogus and irrelevant.

—Posted by Elliotte Rusty Harold on 18 Jan 2006 @ 04:46 UTC #

Elliotte, your solution works fine if both human readers and software are happy with the same representation of the data. But what if you want to show today’s date to your human readers as 平成18年1月18日, while feeding 2006-01-18 to software?

—Posted by Norbert Lindenberg on 18 Jan 2006 @ 09:13 UTC #