Version Identifiers and XML

Volume 7, Issue 213; 15 Dec 2004; last modified 08 Oct 2010

David Orchard says XML blew it. He's talking about XML 1.1, but his beef isn't with the technical changes, it's with the version number.

David has been working on a TAG finding on Versioning XML Languages , so we've talked a lot about many aspects of versioning. (Ostensibly, I'm working on the finding too, but I've been busy on other things and recent technical progress is all due to David.)

But working together doesn't mean we always agree and I take exception to a few of the things David says. Full disclosure: I'm on the XML Core Working Group and I may be taking this all a bit personally.

The first thing I note is a simple, factual error. David saysI'm excerpting bits of his essay, you'll probably want to read it all in context first. My essay started as an email message to David, but he persuaded me that there was value in carrying on the discussion “in public”.:

XML 1.1 adds allowed characters - particularly control characters - to the name production.

That's just not true. Some of the C0 control characters (0x01-0x1F) are now allowed (in numeric-escaped form only) in character data where they were not previously allowed. They aren't allowed in names. Additional alphabetic and ideographic characters are allowed in names, however, so his point is still valid.

And, alas, XML 1.1 is not backwards compatible with XML 1.0. Not perfectly anyway. The C1 control characters (0x80-0x9F) were accidentally allowed in character data in XML 1.0. They are now forbidden except in their numeric-escaped forms.

To a great extent, XML 1.1 was called 1.1 because it was hoped that this would help foster adoption, rather than xml 1.01 or xml 2.0. I think it's particularly sad that XML has resorted to this kind of marketing effort in it's version identifiers.

To which I am inclined to reply, “Oh, come on!” None of the technical arguments David is making would change if it was called XML 1.01 (or XML 1.0.1 which was actually proposed).

One can argue that it should have been called 2.0 because of its tiny backwards incompatibility, and that's an entirely valid argument, except that it ignores the fact that all specifications are developed in both a social and a technical context.

Here's a rule that could help: Version identifiers should be rigorously used to identify compatible or incompatible changes.

Yep, that would have made it 2.0. But users have non-technical expectations about version numbers and it's not clear to me that the community would have benefitted from the larger number. Maybe the lesson here is that putting version numbers in the title of your specification muddies the waters.

The whole issue of backwards and forwards compatibility in a vocabulary that has an explicit version number is an interesting one anyway. Suppose this is an instance of the first version of some vocabulary:

<transaction>
  <buy shares="1000">SUNW</buy>
</transaction>

If a second version of this vocabulary adds some optional element, we can say that it's backwards compatible because the preceding instance is still a valid, understandable instance.

But can we say the same thing about this instance?

<transaction version="1.0">
  <buy shares="1000">SUNW</buy>
</transaction>

If the second version mandates “version="2.0"”, then the preceding message isn't really a valid second version instance. It's trivial to transform it into one, and the semantics are exactly what you'd expect, but that's not quite the same thing, is it?

It's trivial to transform XML 1.0 documents into XML 1.1 documents too, and the semantics are exactly what you'd expect:

If there's an XML declaration in the version 1.0 document, replace “version="1.0"” with “version="1.1"”. I there isn't an XML declaration, add one with “version="1.1"”.
If the document contains any unescaped C1 control characters, escape them.

Given that you have to touch the document to make it an XML 1.1 document (because XML 1.0 does mandate version 1.0 in the XML declaration, even if the declaration is implicit) and given that no real world documents actually use the C1 control characters (outside of test suites), I think it's a stretch to say XML blew it.

David goes on to explain:

The reason is that the XML 1.0 has very few extensibility points that allow for compatibility, and name characters are not one of these extensibility points. XML 1.0 decided that any extension, like a control character, results in a fault. It does not have any way of dealing with the extensions that doesn't result in a fault. If it had a substitution model for name extensions, then the XML 1.1 names could be understood by XML 1.0 processors.

I don't believe any practical, sensible substitution model could have been devised. The fact that the changes made were not forward compatible is just a fact of life. The fact that they're not backwards compatible is unfortunate. That was probably a mistake: a very minor technical one and an apparently larger political one.

There's an axiom that emerges: Forward compatible extensions can only be done if a substitution model for the extensions exist.

If XML 1.1 (sic) had provided a substitution model, like the must ignore unknowns, then XML 1.1 truly would be compatible with XML 1.0.

David, are you suggesting that “must ignore” would have been the slightest bit reasonable in this context? That suggests that an XML 1.0 processor should treat <ab> as if it were <ab> which strikes me as ludicrous. (That's an Ethiopic ሰ in the first name.)