Do not condemn the judgement of another because it differs from your own. You may both be wrong.
I supported XML 1.1. I thought it was a good thing. Naively, I thought it was going to be relatively straightforward to deploy. I didn’t believe the doom and gloom predictions of some that it would bifurcate the XML standard into incompatible versions. While it is not completely backwards compatible, I didn’t think it was going to be that big a deal. I’ll explain why in a moment.
Whether I was right or not, XML 1.1 is dead. The working group leading RELAX NG through the ISO standardization process has ruled that “an XML [1.1] document…can never be valid against a RELAX NG schema.” I expect the W3C XML Schema working group to conclude similarly that XML 1.1 documents cannot be validated with XML Schema 1.0 or 1.1.
If I can’t validate XML 1.1 documents, I can’t use them. (I can, of course, validate them with XML 1.1 DTDs, but that’s bitter consolation in the twenty-first century.)
I consider myself fairly conservative when it comes to notions of what constitutes well-formedness or validity. Nevertheless, I expected a simple erratum to allow implementors to support XML 1.1 in RELAX NG and W3C XML Schema:
All implementations of this specification must support XML 1.0. Implementations may, at user option, support XML 1.1. An implementation that supports XML 1.1 must …
Now, the interesting question is what must it do? To answer that, we’ll have to look a little more closely at XML 1.1. There are just a few changes in XML 1.1:
- New Text Characters
More Unicode characters are allowed in textText in this context refers not only to element content but also to attribute values and the content of processing instructions and comments: places other than names.. The big feature of XML 1.1 is support for new versions of Unicode. (XML 1.0 is defined on top of Unicode 2.0 which is no longer the current version.) This is really significant. One of the virtues of XML is that it’s been internationalized from the very beginning. It does not discriminate against languages that are less economically important. Without XML 1.1, that’s no longer true. I think that sucks.
The C0 control characters (0x01-0x1F) are allowed if they’re escaped. In XML 1.0, presence of the C0 control characters is a good indicator that the document’s encoding has been incorrectly determined. As a compromise for allowing the C0 controls, the C1 control characters are no longer allowed unless they are escaped. This is the single backwards-incompatible change in XML 1.1.
NELcharacter (0x85) is normalized to a linefeed in text. Basically, IBM mainframe newlines get treated like PC and Mac newlines.
- New Name Characters
The current version of Unicode supports more languages than Unicode 2.0. As a result, there are more “name” characters now. XML 1.1 allows authors writing in Ethiopic (and a bunch of other languages) to write tag names (and attribute names, processing instruction targets, etc.) in their native language.
XML 1.1 encourages implementors to check the character normalization of documents. This has no effect on validation.
What bearing does this have on validity? Let’s take a look. Imagine that we have two processors: I and II. Processor I understands XML 1.0 only. Processor II understands XML 1.0 and XML 1.1.
Consider the following documents:
Is an XML 1.0 document.
Is an XML 1.1 document that uses none of the new features of XML 1.1 (it would be a well-formed XML 1.0 document if it was labelled as 1.0).
Is an XML 1.1 document with
Is an XML 1.1 document with C0 control characters encoded in it.
Is an XML 1.1 document with C1 control characters encoded in it.
Is an XML 1.1 document with new text characters.
Is an XML 1.1 document with new name characters.
What happens when we validate each of these documents? First, we parse the input documents to build an Infoset. We’ll use an XML 1.0 parser for the 1.0 documents and a 1.1 parser for the 1.1 documents.
Now, a processor that only understands 1.0 might check the [version] property on the Document Information Item and reject the 1.1 documents out of hand. To make our discussion more interesting, let’s assume our parsers don’t provide the [version] property (it’s not required).
Validation, as far as I can see, produces the following results:
I think documents D and F are technically invalid to processor I. They each contain text characters that are not allowed by XML 1.0. That said, I can’t actually confirm that either W3C XML Schema or RELAX NG actually requires a processor to validate the characters. The RELAX NG specification says explicitly that attribute values must consist of XML 1.0 characters, but I don’t see anything about characters in elements. I can’t find any mention of it at all in XML Schema Part 1: Structures. (Both specifications say they operate on XML 1.0 documents, so they implicitly forbid the extra characters unless they are amended.)
Document G is clearly invalid to processor I because it has invalid name characters. Both specifications are careful to check this case because they validate names in contexts where the XML parser allows non-name characters.
So, now we can answer our earlier question:
… An implementation that supports XML 1.1 must allow a suite of additional characters in content and it must allow a different suite of additional characters in names.
That’s not so hard is it?
Perhaps the real question is, what are the consequences of allowing processor II to be conformant? I can think of two:
- Reduced Interoperability
Interoperability is important; it would be wrong to reduce interoperability without a compelling reason. I think internationalization is a compelling reason. (The rest of the XML 1.1 changes were either unnecessary or feature creep, IMHO, but they’re harmless.)
XML already has interoperability problems associated with character encodings. Just because my parser understands Shift-JIS or some other locally important encoding, doesn’t mean that yours does. The interoperability problems of XML 1.1 seem similar to me. If I’m using XML 1.1 when I don’t have to, I can transcode to XML 1.0, just like I can transcode to utf8. If I’m using XML 1.1 because I need too, well, I need to. If your tools can’t support XML 1.1, I can’t use them. That seems reasonable.
- Pipeline Issues
Validation isn’t generally an end in itself. We validate documents because we want to do something else with them, because we want to pass them along to some further stage in a pipeline (transformation, business processes, whatever, whether it’s explicitly a pipeline or not). Downstream processes might have made all sorts of assumptions based on the fact that the input was XML 1.0. Perhaps they’re using
BELcharacters (0x07) as delimiters; perhaps they’re relying on bit patterns that aren’t legitimate text characters; etc.
Those would be a good reasons not to support XML 1.1. I’m not suggesting that you should be compelled to support XML 1.1, I’m not even saying you should support it.
I’m just saying it would be nice if you could support XML 1.1
But you can’t, at least not until V2.0 of the schema languages. I suspect that’s roughly equivalent to saying “not until hell freezes over.”
Which is a shame, because XML 1.1 was a good thing.