The fifth edition of XML 1.0 is now a “proposed edited recommendation”. New editions do little more than incorporate errata, hardly newsworthy. This one is different.

The proposed edited recommendation of Extensible Markup Language (XML) 1.0 (Fifth Edition) is now out for review. The review period is long, lasting until 16 May, because one of the proposed changes is significant.

A couple of weeks ago, I poked a little fun at the SGML specification for introducing new appendixes and new parsing rules as “corrigenda”. Now it's my turn to be on the poking end. The XML Core WG is proposing to change the repertoire of characters allowed in XML names as an “erratum”.

Before the fifth edition, XML 1.0 was explicitly based on Unicode 2.0. As of the fifth edition, it is based on Unicode 5.0.0 or later. This effectively allows not only characters used today, but also characters that will be used tomorrow.

One of the real strengths of XML from the very beginning was that it required processors to support Unicode. This made XML, and all XML processors, international. But as Unicode has been extended to support languages written in Cherokee, Ethiopic, Khmer, Mongolian, Canadian Syllabics, and other scripts, XML 1.0's explicit use of Unicode 2.0 has prevented it from growing as well. That's a problem that XML must fix if it wants to continue to be regarded as a universal text format.

The working group's first attempt to address this problem, XML 1.1, has been largely unsuccessful. For a variety of reasons, XML 1.1 did more than the minimum needed to declare victory and some of that “more” makes it backwards incompatible with XML 1.0. So it was D.O.A.

The fifth edition does not change the status of any existing XML 1.0 document with respect to well-formedness or validity. Nor does it introduce any of the backwards-incompatible changes introduced in XML 1.1.

It isn't entirely without pain, unfortunately. Even if we imagine that all parsers will be updated to reflect the fifth edition (and it's possible to be optimistic on this point as it actually makes parsers smaller and simpler) eventually, there will be some period of time in which your (fourth edition) parser might reject my (fifth edition) document.

The XML Core WG is taking the position that the benefits of extending XML 1.0 in this way outweigh the costs imposed by the change. It remains to be seen if the community will agree. Bear in mind that this sort of change isn't entirely unprecedented, we previously decoupled xml:lang attributes from the relevent RFCs and we tinkered with the specific version of Unicode 3 referenced. That said, this is still a much more substantial change.

Personally, I'm concerned about making this large a change as an erratum. But I'm persuaded that our other options: do nothing or attempt to introduce some other, new version of XML are worse.

Are you? Tell us.

Comments:

Isn't it a bad idea to refer to what's essentially a moving target?

Let's imagine that implementer X produces a parser based upon Unicode 5.0.0, and it gets *really* popular; it's the dominant parser shipped in OSs, used in lots of code around the world, and so on.

Then, implementer Y ships a XML generation implementation (maybe an editor) based upon Unicode 5.1.0 or something, and they produce documents that contain characters which X's implementation knows nothing about.

Both conform to 5th edition, but it doesn't look like they'll interoperate. I'm hardly a Unicode expert -- what am I missing here?

Posted by Mark Nottingham on 07 Feb 2008 @ 07:12pm UTC #

Norm: Technically, certain well-formed but invalid documents become valid: where the DTD says an attribute is of type NMTOKEN, it couldn't validly contain a character not in Unicode 2.0, whereas under the 5th Edition rules that would (for the most part) be valid.

Mark: X and Y will interoperate, because in 5th Edition (like 1.1 before it), any Unicode code point not explicitly forbidden is permitted, including many that are not assigned. So even if X's parser does not recognize the characters used in a document shipped by Y's software, it knows whether or not they are legal in names.

I've just blogged about the details of which characters are forbidden and why.

Posted by John Cowan on 07 Feb 2008 @ 09:11pm UTC #

Thank you, John. Good point. I was thinking mostly of documents becoming invalid, but that's not what I said.

Thank you also for answering Mark's question before I got to it :-)

Posted by Norman Walsh on 07 Feb 2008 @ 09:35pm UTC #

It's not just xml parsers that need to change, it's any software that implements a spec that references the Name production. XPath (1 and 2) XQuery, XSD and Relax NG Systems would all need to change. Worse if any of them (such as Xpath2 at least) reference undated versions of the XML spec then the fact that the W3C is planning to make an incompatible change _in place_ at http://www.w3.org/TR/REC-xml/ means that currently conforming XPath parsers that reject Unicode 5 characters in QNames, or XSLT systems that reject these characters in template names, will become non conforming, even though the software and the xpath/xslt specifications have not changed.

This is bad.

The change is in fact a good one, but you should call it what it is, XML 1.2 (or 2.0) and deprecate XML 1.1. If the world is not willing to follow you this time, so be it, but don't try to trick them into following you by passing this off as an editorial "edition".

Yes I know I should moan to the offical comment list not your blog, but it's your blog entry that popped up on my screen first...

David

Posted by David Carlisle on 08 Feb 2008 @ 12:11am UTC #

Yay, I think this is a brilliant solution, so much so that I proposed it last year:

The only part of XML 1.1 that everyone agrees is reasonable is the support for more scripts in name characters, and avoiding explicitly specifying the set of UNICODE characters that can be used in names.

So, why not introduce a fifth edition of XML 1.0 that expands the set of characters that may be used as name characters. This would be backwards compatible, as every well-formed XML 1.0 document would remain legal. It would allow the use of more scripts in XML markup, like Mongolian. Parsers could support it quite easily, and the change would be unlikely to confuse applications.

Then admit that XML 1.1 was an experiment that failed, and has not achieved widespread interoperable implementation, and deprecate it.

(By the way Norm, why can't I use <blockquote> in comments? :)

Posted by Michael Day on 08 Feb 2008 @ 01:29am UTC #

OK, so the risk seems like it's restricted to implementation Z (of, say, the 3rd edition) coming across a 5th edition document and blowing up, correct?

I agree that this doesn't seem like an erratum...

Posted by Mark Nottingham on 08 Feb 2008 @ 01:44am UTC #

Thanks for highlighting this change. Submitted feedback to the working group, but I'm pretty aligned with David Carlisle (and yourself?) on this one.

Posted by James Abley on 10 Feb 2008 @ 12:21am UTC #

All my years of software engineering and quality assurance training tell me this is bad, bad, bad. If you want to change something in a significant way, you should give it a new version number.

And yet: all my years of experience tell me that sometimes it's in everyone's best interests to ignore the rules; don't let process get in the way of giving users what they want.

XML 1.1 failed because the cost of the change was too high for the community as a whole, given that 99% of the community gets zero benefit. This proposal means that the suffering will fall more squarely on the shoulders of those who actually need the change.

So, with deep reluctance, I think I'm not going to stand in front of the tanks.

Posted by Michael Kay on 15 Feb 2008 @ 05:38pm UTC #
This is a blatant abuse of the W3C process, though sadly not without precedent. The XML Core working group got away with this chicanery before when the namespaces spec was revised to map the xmlns prefix to a namespace.

I didn't like that, but this is just beyond the pale. If this goes through, I suspect I will completely lose faith in the W3V as a fair and honest maintainer of standards. Frankly, if we can't rely on the stability of the base specs, then I think it may be time to give up on XML completely. :-(

Posted by Elliotte Rusty Harold on 16 Feb 2008 @ 02:43am UTC #

David, Elliotte:

It's about justice, not about process. I'll use any process, fair or foul, to eliminate the blatant injustice that says people who speak English or French get to use their languages (or the official languages of their countries) to define XML schemas (lower-case "s"), and people who speak Cantonese or Khmer don't, because of an accident of Unicode history. (Romanians do get to use their language, but only if they are willing to misspell it just slightly.) That is what would be unfair and dishonest.

Saying "If the world won't follow you, so be it" is as much as to say "The majority rules, and devil take the minorities." That's a world I'm not willing to live in without trying to change.

As for giving up on XML, where you gonna go? Like democracy, it is a terrible idea, but better than any alternative.

Posted by John Cowan on 16 Feb 2008 @ 11:23pm UTC #

Norm, seems to me the reality check here is that for 98%+ of the world this is a big snooze. I cannot see them changing from their cozy UTF-8 land and XML that uses that explicitly.

If someone is using UTF-xxx that supports this UNICODE 5,+ stuff - they are going to take special means to alert their partners that they need software capable of supporting this.

I mean I cannot see eBay, Amazon, Google and Wal-Mart suddenly sending out UNICODE 5+ content that noone can open in their browsers, right?!?

So I expect this one to pass muster - without too much fuss and mush - but it does make for good headlines and copy material.

Cheers, DW

Posted by David Webber on 18 Feb 2008 @ 05:53am UTC #

The perceived process violation may be illusory, depending on whether the change is regarded as a bug fix or as a new feature. I'll be a bit lengthy here because I think the discussion might remove the the procedural issue from the table.

On the fair vs. foul means issue, it should be recognized that a norm once bent to achieve a particular result is no longer a norm. There is a bit of text borrowed by the U.S. Supreme Court in its landmark Tennessee Valley Authority v. Hill Endangered Species Act decision regarding Tellico Dam and the snail darter that eloquently explains the danger of bending behavioral norms to achieve a desired result:

"The law, Roper, the law. I know what's legal, not what's right. And I'll stick to what's legal. . . . I'm not God. The currents and eddies of right and wrong, which you find such plain-sailing, I can't navigate, I'm no voyager. But in the thickets of the law, oh there I'm a forester. . . . What would you do? Cut a great road through the law to get after the Devil? . . . And when the last law was down, and the Devil turned round on you - where would you hide, Roper, the laws all being flat? . . . This country's planted thick with laws from coast to coast - Man's laws, not God's - and if you cut them down . . . d'you really think you could stand upright in the winds that would blow them? . . . Yes, I'd give the Devil benefit of law, for my own safety's sake." R. Bolt, A Man for All Seasons, Act I, p. 147 (Three Plays, Heinemann ed. 1967)."

So I think it important to inquire whether or not the subject procedural norm of a version change being required for major new features is the last word in the particular circumstance presented. I suggest that it is not.

The preparation, adoption, and application of technical specifications by standards development organizations is governed globally by the Agreement on Technical Barriers to Trade. I'll nominate its article 2 section 2.1 as the final word on whether the subject change proposed by XML 1.0 Fifth Edition is a bug fix or a new feature (ignoring the fact that the WTO Appellate Body would actually have the final say):

"Members shall ensure that in respect of technical regulations, products imported from the territory of any Member shall be accorded treatment no less favourable than that accorded to like products of national origin and to like products originating in any other country."

ISO/IEC have interpreted that section and section 2.2 read together as requiring that international standards must exhibit the "strategic characteristic" of "cultural and linguistic adapatability." ISO/IEC JTC 1 Directives, 5th ed., v. 3.0, pg. 11 (PDF). I have thoroughly studied the law in this area and am of the firm opinion that the ISO/IEC interpretation is sound.

I'll omit the analysis that explains the ATBT's direct applicability to W3C recommendations because it is complex. But it might suffice for now to observe that W3C recommendations are ineligible for the status of government technical regulations or government procurement specifications if they have "the effect of creating unnecessary obstacles to international trade." ATBT section 2.2; Agreement on Government Procurement article VI section 1. A cultural and linguistic adaptability barrier would beyond question fit in that category.

The procedural norm under discussion is subservient to substantive international law, which takes precedence to the extent of the inconsistency. A legal exception to the procedural norm of a version change for new features is created by application of superior law to the extent a markup language standard is not culturally and linguistically adaptable.

Law is hierarchical, not flat. If you do not like the result of applying one norm, before transforming the norm into a non-norm, it is prudent to determine whether or not there is a superior norm applicable to the particular circumstance whose application supplies a different result . In this situation, applying the superior norm definitively classifies the proposed change to XML 1.0 as a bug fix rather than a new feature.

Bug fixes are appropriately proclaimed in an erratum rather than requiring a new version of the specification. From a legal standpoint, the only remaining question I see is whether the bug fix proposed is the least trade-restrictive means of repair. See ATBT article 2, supra, sections 2.2 and 2.3. The legal requirement of interoperability also flows from those two sections, so the least-trade restrictive means requirement supplies the guiding light in balancing the requirements of interoperability and cultural and linguistic adaptability that are apparently competing in this particular situation.

The issues presented thus devolve into two questions of mixed law and fact, whether: [i] there is in fact an apple and orange choice that must be made between [a] cultural and linguistic adaptability and [b] backward and forward compatibility; and [ii] if so, the proposed change produces the best apple and orange that can be feasibly achieved.

In other words, I suggest that folks talk about the technical merits of the proposed change and alternatives to it, if any, rather than discussing the procedural issue. Anyone have a better bug fix? :-)

My 2 cents.

Posted by Buck "Marbux" Martin on 18 Feb 2008 @ 08:29am UTC #
Norm, seems to me the reality check here is that for 98%+ of the
 world this is a big snooze. I cannot see them changing from their 
cozy UTF-8 land and XML that uses that explicitly.
If someone is using UTF-xxx that supports this UNICODE 5,+ 
stuff - they are going to take special means to alert their 
partners that they need software capable of supporting this.
Pardon my possible ignorance, but does anything in Unicode 5 require an encoding other than UTF-8? I thought all higher plane characters can be represented just fine in UTF-8?
Posted by Andrew Thompson on 20 Feb 2008 @ 08:33pm UTC #

John,

You argue that it is an issue of justice but in what sense is justice improved by making the change in place at version 1.0. It just makes systems that claim to implement the same version (of xml or xpath or whatever) incompatible. XML 1.1 is unpopular not because its version is not 1.0 but because it contained incompatible changes; notably the white space rules and C1 characters. You can fix the problems in 1.1 by having a 1.2 not by rewriting history and changing 1.0.

David Webber,

there appears to be some confusion in your comment as utf8 encoding isn't relevant here utf8 can encode all characters in any version of unicode. there is no proposed change in encodings, or in the characters allowed in element or attribute content, only a change in the characters allowed in element attribute and id names

Buck,

Anyone have a better bug fix? :-)

yes, make the same change but call it version 1.2

Posted by David Carlisle on 25 Feb 2008 @ 04:23pm UTC #

The reason XML 1.1 failed is not merely that it's backwardly incompatible with XML 1.0. It's that nobody needs it and nobody wants it. Asserting that there are roughly 15 million people in the world who speak a language not representable in Unicode 2.0 is not the same as proving the existence of significant number of people who want to write XML element and attribute names (and validated name tokens) in these languages. The lack of a target market is the real reason XML 1.1 failed. The upgrade costs weren't large; but small as they were, they were noticeably large than the benefits of migrating to pretty much everyone who actually writes XML at a tag level.

The process abuse is a real issue. That much could be cured by renaming this spec XML 1.2, but that still wouldn't make it a success. It would still flop just as badly as 1.1 did.

The only way a new version of XML could successfully introduce this change would be if it provided significant additional benefits to enough users to outweigh the costs of migrating. The working group has consistently avoided making substantive changes that might affect large numbers of developers. They believe that introducing other changes would make it harder to push through the one change they really care about, but they're wrong. Consumers will not accept changes that offer them no benefit unless they're accompanied by changes that actually benefit them. Whether you call this XML 1.1., 1.0 fifth edition, 1.2, 2.0 or 2000, this proposal will continue to fail as long as it offers no reason for the vast majority of XML developers to switch.

P.S. The XML 1.1/5th edition rules are broken irrespective of the cost of conversion. There is no good reason to allow undefined Unicode characters in XML names or to allow musical symbols and other non-alpha-numeric characters in names. The proposed fifth edition rules cure some of the technical objections to XML 1.1, but by no means all of them.

Posted by Elliotte Rusty Harold on 23 May 2008 @ 02:35pm UTC #
Comments on this essay are closed. Thank you, spammers.