XML 1.1: Dead on Arrival

Volume 7, Issue 171; 30 Sep 2004; last modified 08 Oct 2010

XML 1.1 was a fruitless exercise. We shouldn’t have bothered.

Do not condemn the judgement of another because it differs from your own. You may both be wrong.

— Dandemis

I supported XML 1.1. I thought it was a good thing. Naively, I thought it was going to be relatively straightforward to deploy. I didn’t believe the doom and gloom predictions of some that it would bifurcate the XML standard into incompatible versions. While it is not completely backwards compatible, I didn’t think it was going to be that big a deal. I’ll explain why in a moment.

Whether I was right or not, XML 1.1 is dead. The working group leading RELAX NG through the ISO standardization process has ruled that “an XML [1.1] document…can never be valid against a RELAX NG schema.” I expect the W3C XML Schema working group to conclude similarly that XML 1.1 documents cannot be validated with XML Schema 1.0 or 1.1.

Game Over.

If I can’t validate XML 1.1 documents, I can’t use them. (I can, of course, validate them with XML 1.1 DTDs, but that’s bitter consolation in the twenty-first century.)

I consider myself fairly conservative when it comes to notions of what constitutes well-formedness or validity. Nevertheless, I expected a simple erratum to allow implementors to support XML 1.1 in RELAX NG and W3C XML Schema:

All implementations of this specification must support XML 1.0. Implementations may, at user option, support XML 1.1. An implementation that supports XML 1.1 must …

Now, the interesting question is what must it do? To answer that, we’ll have to look a little more closely at XML 1.1. There are just a few changes in XML 1.1:

New Text Characters

More Unicode characters are allowed in textText in this context refers not only to element content but also to attribute values and the content of processing instructions and comments: places other than names.. The big feature of XML 1.1 is support for new versions of Unicode. (XML 1.0 is defined on top of Unicode 2.0 which is no longer the current version.) This is really significant. One of the virtues of XML is that it’s been internationalized from the very beginning. It does not discriminate against languages that are less economically important. Without XML 1.1, that’s no longer true. I think that sucks.

The C0 control characters (0x01-0x1F) are allowed if they’re escaped. In XML 1.0, presence of the C0 control characters is a good indicator that the document’s encoding has been incorrectly determined. As a compromise for allowing the C0 controls, the C1 control characters are no longer allowed unless they are escaped. This is the single backwards-incompatible change in XML 1.1.

The NEL character (0x85) is normalized to a linefeed in text. Basically, IBM mainframe newlines get treated like PC and Mac newlines.

New Name Characters

The current version of Unicode supports more languages than Unicode 2.0. As a result, there are more “name” characters now. XML 1.1 allows authors writing in Ethiopic (and a bunch of other languages) to write tag names (and attribute names, processing instruction targets, etc.) in their native language.

Normalization

XML 1.1 encourages implementors to check the character normalization of documents. This has no effect on validation.

What bearing does this have on validity? Let’s take a look. Imagine that we have two processors: I and II. Processor I understands XML 1.0 only. Processor II understands XML 1.0 and XML 1.1.

Consider the following documents:

Is an XML 1.0 document.
Is an XML 1.1 document that uses none of the new features of XML 1.1 (it would be a well-formed XML 1.0 document if it was labelled as 1.0).
Is an XML 1.1 document with NEL line breaks.
Is an XML 1.1 document with C0 control characters encoded in it.
Is an XML 1.1 document with C1 control characters encoded in it.
Is an XML 1.1 document with new text characters.
Is an XML 1.1 document with new name characters.

What happens when we validate each of these documents? First, we parse the input documents to build an Infoset. We’ll use an XML 1.0 parser for the 1.0 documents and a 1.1 parser for the 1.1 documents.

Now, a processor that only understands 1.0 might check the [version] property on the Document Information Item and reject the 1.1 documents out of hand. To make our discussion more interesting, let’s assume our parsers don’t provide the [version] property (it’s not required).

Validation, as far as I can see, produces the following results:

Table 1. Validation of XML 1.0 and XML 1.1 Documents

Proc.\Doc.	A	B	C	D	E	F	G
I	valid	valid	valid	invalid But I wouldn’t be surprised if there are implementations that don’t notice.	valid	invalid ^[a]	invalid
II	valid	valid	valid	valid	valid	valid	valid
^[a] But I wouldn’t be surprised if there are implementations that don’t notice.

I think documents D and F are technically invalid to processor I. They each contain text characters that are not allowed by XML 1.0. That said, I can’t actually confirm that either W3C XML Schema or RELAX NG actually requires a processor to validate the characters. The RELAX NG specification says explicitly that attribute values must consist of XML 1.0 characters, but I don’t see anything about characters in elements. I can’t find any mention of it at all in XML Schema Part 1: Structures. (Both specifications say they operate on XML 1.0 documents, so they implicitly forbid the extra characters unless they are amended.)

Document G is clearly invalid to processor I because it has invalid name characters. Both specifications are careful to check this case because they validate names in contexts where the XML parser allows non-name characters.

So, now we can answer our earlier question:

… An implementation that supports XML 1.1 must allow a suite of additional characters in content and it must allow a different suite of additional characters in names.

That’s not so hard is it?

Perhaps the real question is, what are the consequences of allowing processor II to be conformant? I can think of two:

Reduced Interoperability

Interoperability is important; it would be wrong to reduce interoperability without a compelling reason. I think internationalization is a compelling reason. (The rest of the XML 1.1 changes were either unnecessary or feature creep, IMHO, but they’re harmless.)

XML already has interoperability problems associated with character encodings. Just because my parser understands Shift-JIS or some other locally important encoding, doesn’t mean that yours does. The interoperability problems of XML 1.1 seem similar to me. If I’m using XML 1.1 when I don’t have to, I can transcode to XML 1.0, just like I can transcode to utf8. If I’m using XML 1.1 because I need too, well, I need to. If your tools can’t support XML 1.1, I can’t use them. That seems reasonable.

Pipeline Issues

Validation isn’t generally an end in itself. We validate documents because we want to do something else with them, because we want to pass them along to some further stage in a pipeline (transformation, business processes, whatever, whether it’s explicitly a pipeline or not). Downstream processes might have made all sorts of assumptions based on the fact that the input was XML 1.0. Perhaps they’re using BEL characters (0x07) as delimiters; perhaps they’re relying on bit patterns that aren’t legitimate text characters; etc.

Those would be a good reasons not to support XML 1.1. I’m not suggesting that you should be compelled to support XML 1.1, I’m not even saying you should support it.

I’m just saying it would be nice if you could support XML 1.1

But you can’t, at least not until V2.0 of the schema languages. I suspect that’s roughly equivalent to saying “not until hell freezes over.”

Which is a shame, because XML 1.1 was a good thing.

Comments

Norm, I'm really shocked at your statement that "More Unicode characters are allowed in text" in XML 1.1. I really thought you knew better than this. I expected the footnote would clarify your point, but it just made it worse. Sadly there has been a lot of FUD spewed on this issue by people trying to justify XML 1.1. The benefits of XML 1.1 are so minscule and the costs so high, that the only way to justify it is by pretending it solves a problem that doesn't actually exist.

Let's be clear about one thing: XML 1.1 enables NO new genuine characters in XML text. The only new characters added are a few C0 controls. Every single human legible character in Unicode 3.0, 4.0, and any future version is legal in text (element content, attribute values, processing instruction data, and comments) in XML 1.0. There is nothing in XML 1.0 that makes it inadequate for writing Amharic, Burmese, Thaana, Yi, Tengwar, or any other language that can be written with Unicode 3.0, 4.0, or later.

The advances in XML 1.1 are solely about XML names. They have nothing to do with XML text, except for one tiny intersection of DTD validated ID-type attributes, but you don't think DTDs are very relevant so that's not a huge win. No languages are discriminated against by XML text. XML 1.1 might be useful to someone who wants to write ther markup (not their text but their markup) in Amharic, Burmese, Mongolian, Cambodian or any of a few other more obscure languages. Anybody who doesn't need to do this has nothing to gain from XML 1.1. And anybody who wants to write a Mongolian web page in XHTML or a Burmese technical manual in DocBook can do it just fine with XML 1.0.

"The rest of the XML 1.1 changes were either unnecessary or feature creep, IMHO, but they’re harmless."

The trouble is, they are not harmless, changing the white space rules in what was supposed to be a minor point increase was always going to inflict real pain. especially adding characters outside the ascii range to the white space set means that even if you know your doc is in utf8 you can't really use standard 8bit text processing tools very easily on the file (which after all is one of the main points of utf8).

If 1.1 had stuck to changing the name char rules to being workable with all future unicode versions not just 2.0 it would (or might) have had a better chance of success.

You may consider me duly chastised, Elliotte. And a little red in the face. I shouldn't have gotten that wrong.

You're absolutely right, of course.

It is not hard to "allow a different suite of additional characters in names" for XML 1.1. But this behaviour cannot be introduced to RELAX NG 1.0 without possibly breaking conformant implementations.

It is certainly possible to create RELAX NG 1.1 to address XML 1.1. This is good for I18N and may be bad for the promotion of RELAX NG. How do you feel?

Allowing non-ASCII in element and attribute names seems politically correct, but are there known cases of people actually using non-ASCII element and attribute names with XML 1.0? All widely-used well-known vocabularies have all-ASCII element/attribute names.

With many European languages there's the issue of NFC and NFD representations being different XML names. A bug induced by NFC vs. NFD difference in element/attribute names is not a bug I'd like to spend time tracking down.

By the way, I am very sure that a lot of Japanese XML users heavily use Japanese characters for element and attribute names. In fact, an XML project of a Japanese ministry (which I am involved in) will very heavily use Japanese names and will use almost no ASCII names.

So, I am not saying XML 1.1 is useless. But, for RELAX NG to use XML 1.1, we need a new version of RELAX NG.

Now, I may be jumping into the discussion a little late, but I'd like someone to clarify something. What the hell is RELAX NG and why should the advancement of XML to version 1.1 have such a marked impact? I understand that if one used the newly supported Unicode characters in the XML names where they are restricted such as "element type names, attribute names, enumerated attribute values, processing instruction targets, and so on"* they documents would not be backwards compatible with an XML 1.0 processor. The only thing the Relax NG web site really told be was that it was a schema language for XML. I can understand why the new features of XML 1.1 would not allow it to validate with the current versions of RELAX NG or XML Schema 1.0; however, as the language evolves I see no reason why the schema languages built around it shouldn't evolve as well.

I can really relate to what MURATA Makoto said. I have worked with developers from Japan on a number of occasions and they are very adamant about using their native characters in XML names and anywhere else they can for that matter. Localization is becoming a real issue as more and more people move toward electronic documents. I don't see why anyone should be forced to work outside their native language and character set because the people who designed XML 1.0, XML Schema 1.0, and Relax NG didn't have the foresight to see beyond Unicode 2.0.

On Aaron Winters comments: when XML 1 was designed, it allowed all characters in data (except for characters that are special purpose characters used for driving teletype printers {such as the backspace control character} or with obscure Unicode semantics. However, it restricts the characters allowed in names to a sensible set, based on the version of Unicode available at the time.

As Unicode is updated, there are several approaches possible for updating XML:

1) leave it alone

2) update the detailed lists of allowed characters

3) move to excluding characters that are unwanted and allowing all the others

4) move to use the Unicode properties for each character to decide which are good

The trouble with 1) is that XML needs to have best practise internationalization: the more central it is to computing, the more that any limitations increase the burdon's on people who are left out. On the other hand, there is a time for everything: I would prefer if W3C went onto a timed release strategy, and told everyone to expect an updated spec every 5 years and to expect to change to it.

The trouble with 2) is that then the XML WG has to put out updates to XML to track Unicode, and they have better things to do than maintaining XML, apparantly.

The trouble with 3) is that you lose some fine-grain ability to detect transmission problems. And who is to say that Unicode may not add an inapropriate character that you have to cater for independently anyway? On the other hand, it lets you work in ranges which may be more efficient for computation. 3) is pretty much the route taken with XML 1.1.

4) is probably the way I would have liked. It takes decisions out of the hands of the XML Working Group. All that needs to happen is a slight change in understanding that a well-formed name means "well-formed according to the Unicode library of the particular platform we are using", which is certainly good enough now that Unicode 4 is so widely deployed.

On top of this comes the question of whether such a change should be enough force a version-up of the XML number: this version-up alone is enough to break almost all XML processing software independent of anything else.

The first thing the XML WG (or indeed, the W3C) should do IMHO is to have a workable version number policy for implementers: such as "A version m.n processor must reject any document with an m greater than its own. A version m.n processor may reject any document with m less than its own. For documents with the same m as the processor, a version m.n processor must accept all documents with an n less than or equal its own, and will accept any document documents with an n greater than its own unless some other error is found." In other words, allow an XML 1.0 processor to read an XML 1.1 document an only barf when something is actually wrong. At the moment, the version numbers is a bar to well-managed updates of standards.