XML 2.0? No, seriously.

Volume 11, Issue 23; 20 Feb 2008; last modified 08 Oct 2010

Maybe its madness to consider XML 2.0 seriously. The cost of deployment would be significant. Simultaneously convincing a critical mass of users to switch without turning the design process into a farce would be very difficult. And yet, the alternatives look a little like madness too.

Design and programming are human activities; forget that and all is lost.

B. Stroustrup

I found three topics on my desk simultaneously last week.

  1. The proposal to amend the character set of XML 1.0 identifiers by erratum.

  2. the proposal to deploy CURIEs, an awkward, confusing extension of the QName concept.

  3. A thread of discussion suggesting that we consider allowing prefix undeclaration in Namespaces in XML 1.0. That's right 1.0.

We're in an odd place.

XML has been more successful, and in more and more different arenas, than could have been imagined. But…

XML 1.0 is seriously broken in the area of internationalization, one of its key strengths, because it hasn't kept pace with changes to Unicode.

QNames, originally designed as a way of creating qualified element and attribute names have also been used in more and more different arenas than could have been imagined. Unfortunately, the constraints that make sense for XML element and attribute names, don't make sense, are unacceptable, in many of the other arenas.

And in XML, we learned that it is sometimes useful to be able to take a namespace binding out of scope.

XML 1.1 addressed some of these concerns, but also introduced backwards incompatibilities. Those incompatibilities seemed justified at the time, although they seem so obviously unnecessary and foolish now. In short, we botched our opportunity to fix the problem “right”.

What to do?

I think I could just about (have, even) accept any one of the items on that list above. Fixing the Unicode problem in XML 1.0 by erratum is stretching the definition of erratum to the breaking point, but by itself is probably an acceptable compromise. Adding pseudo-QName identifiers to the world is confusing and ugly, but by itself probably not the worst thing that could be done. And allowing XML 1.0 documents to undeclare namespace prefixes, by itself, seems sensible in retrospect.

But all three? Really?

Perhaps, dare I say it, it is time to consider XML 2.0 instead. Trouble is, if XML 2.0 gets spun up as an open-ended design exercise, it'll be crushed by the second-system effect. And if XML 2.0 gets spun up as “only” a simplification of XML 1.0, it won't get any traction. If XML 2.0 is to be a success, it has to offer enough in the way of new functionality to convince people with successful XML 1.0 deployments (that's everyone, right?) that it's worth switching. At the same time, it has to be about the same size and shape as XML 1.0 when it's done or it'll be perceived as too big, too complicated, too much work.

With that in mind, here are some candidate requirements for XML 2.0.

  1. All well-formed XML 1.0 documents that do not include an internal or external subset shall be well-formed XML 2.0 documents.

    In other words, backwards compatibility for well-formed XML documents! But it's time to move all that DTD stuff off into another specification. Maybe we can even add <!NAMESPACE in XML 2.0 DTDs. If that spec ever gets written.

  2. The XML 2.0 specification shall be no longer than the XML 1.0 specification.

    In other words, you can't add seventy-three new whiz-bang features. You can't do anything that will require more prose to explain than you can remove by taking out DTD syntax.

  3. All XML 2.0 documents shall support XML Namespaces.

    In other words, what most of the XML world already requires. The experiment is over, namespaces won. Like it or not.

  4. XML 2.0 shall define a mapping from QNames to URIs.

    In other words, db:para ≡ (http://docbook.org/ns/docbook, para) ≡ http://docbook.org/ns/docbook#para, by definition. (For xmlns:db="http://docbook.org/ns/docbook"; and we can argue about the precise mapping rules later.)

  5. XML 2.0 shall allow QNames to represent a broader range of values.

    In other words, isbn:1234 is too useful to forbid. But we're still not allowing it as the name of an element or attribute.

  6. XML 2.0 shall provide an unambiguous, context-insensitive lexical form for QNames.

    In other words, it will be possible to represent any XML 2.0 document without any namespace declarations at all. I've given some thought to how I think this might be done.

  7. XML 2.0 shall do away with the requirement that documents can have only a single root element.

    In other words, make document = extParsedEnt. Perhaps this is only a plausible requirement, but the fact is that many tools, like XSLT, are already comfortable with such instances and I'm going to take advantage of it in the next item.

  8. XML 2.0 shall address the problem of named character references.

    In other words, making it possible to write &nbsp; or &Exists even in documents that don't have any entity declarations. The actual notation wouldn't have to use “&” but it might as well.

    I have in mind a proposal for this:

    <xml:entity name="nbsp" text="&#160;"/>
    <xml:entity name="Exists" text="∃"/>
    <xml:entity href="myentities.xml"/>

    As a matter of simplicity, I'm pretty confident I want to treat these new entities like the old ones, and like CDATA sections, and say that they are purely an authoring convenience; they don't survive parsing. In fact, I'm not even sure the parser has to report those elements, it can consume them as it goes.

    That means you have to have a facility like XSLT 2.0's character maps to put them back at serialization time, if you want them back. Yes, I know this is still an inconvenience for some, but the alternative would require that all XML tools grow support for entity reference objects and that seems inconvenient for far more people.

I think it is possible to address the requirements I've outlined without doing undue violence to existing applications. From an API perspective, I think the worst part will be dealing with QNames as first-class objects. It will mean, for example, that attribute values become lists. In the simple case, a list of one text node, but for attributes that contain QNames (in their context-insensitive format), a list of (text|QName)*.

In my optimistic moments, I imagine that XML 2.0 could thread the needle between insufficient value to motivate transition and so much complexity that it can't possibily succeed. Though whether a committee could thread this particular needle (with this particular camel) is an open question.


9. XML 2.0 shall do away with insignificant whitespace (in content). Pretty-printing by putting the insignificant whitespace inside the tags and having all element content whitespace be significant would be nicer than having to rely on dtd or xml:space.

Maybe pushing it too far? Or is this just something that most people don't care about?

—Posted by Steven on 20 Feb 2008 @ 04:46 UTC #

1, 2, 3, 6, 7, 9 : yes

4, 5, 8: maybe

7: hmm, this would be considerable barrier to adoption, especially if mixing of text and qnames is allowed in the attributes

—Posted by Mark on 20 Feb 2008 @ 06:35 UTC #

Doesn't (3) conflict with (1)?

DTDs should definitely be evicted from XML.

XML should embrace the Infoset, it should be a syntax for trees rather than a syntax that implicitly happens to only allow trees. (This would make it much less painful to explain to people that in reality 95% of the XML technologies out there are in fact are Infoset technologies. IPath. ISLT. IQuery. SAI. AJAI.)

maybe add XML Base and XInclude, which could be optional, but at least would deserve better support than they have today.

—Posted by dret on 20 Feb 2008 @ 07:06 UTC #

Good catch, I guess 3 does conflict with 1, just a bit. I'm willing to reject both DTDs and documents that have names with colons in them that aren't conformant to Namespaces in XML.

—Posted by Norman Walsh on 20 Feb 2008 @ 07:22 UTC #

Unfortunately, I think that XML 1.0 has already done its damage, so I'm not sure what purpose XML 2.0 would serve. Or, to put it metaphorically, after having drunk a bottle of XML 1.0 and awakened groggy in a dumpster with a few days of beard growth and no wallet, I'm not that eager to see how XML 2.0 tastes...

If anything is to be done other than to move on, less may be more, e.g., going back to the namespaces spec and removing any and all ambiguities.

—Posted by Paul Brown on 20 Feb 2008 @ 07:50 UTC #

See XML Skunk Works for a good base document for this effort. It is XML - DTD + XML Base + XML Infoset + XML Namespaces, and only 41 pages long (vs. 47 pages for XML 1.0). Removing the 4th edition naming rules would probably save another page or two. So that's all good. We could also yank attribute normalization, which Tim has admitted was a mistake. Adding prefix undeclaration and CURIEs would be cheap.

What else would we want? I'd like to see elements within start-tags, where <a href="foo"> is just shorthand for <a <href>foo</href>>, where the href sub-element can contain perfectly general XML. That would require some extensions to SAX, to be sure.

—Posted by John Cowan on 20 Feb 2008 @ 10:25 UTC #

Yes, once you've opened up to the idea that attributes have structured values, then you can go the whole way and let them contain full-blown XML. The logical thing to do is treat elements as if they have some number of XML-content children, exactly one of which is anonymous.

Credit where it's due: James explained that to me in a bar in Bangkok in '99. Blew my mind at the time.

Still, I think I'd file that as one of the seventy-three whiz-bang features we don't do. In XML 2.0, anyway.

—Posted by Norman Walsh on 20 Feb 2008 @ 10:36 UTC #

7, 7, 7 !



No, Seriously. 7.

—Posted by Terris Linenbach on 21 Feb 2008 @ 07:51 UTC #

"Yes, once you've opened up to the idea that attributes have structured values, then you can go the whole way and let them contain full-blown XML."

Hmmm... or json-like...

<foo list=['a', 'b', 'c'] num=5, str="5">...</foo>

etc... Seriously, I think this is stretching it. It would be nice for simple cases maybe, but excessive nesting, and certainly full-blown XML kind of defeats the purpose of having attributes in the first place.

after previewing this comment: ouch, this does look tempting (but then again: don't a lot of bad ideas? ;-) But I'd go for this instead of full-blown XML if I had to choose)

—Posted by Steven on 21 Feb 2008 @ 10:05 UTC #

Hi Norm.

Just to add to the discussion, I am surprised about 5). A QName is the name of an element or attribute. One can use it for other purposes, but I don't really see the point to allow a QName to start with a digit, but to still avoid element names to do so.

What about allowing prefix:* as a QName? And *:local? XPath uses QNames a lot, as well as other things, even things that look and smell like QName, but those are not.

Maybe those usages that are often used might be defined in a common dictionary, but I wonder if this is really the point of the XML recommendation?

-- Florent Georges

—Posted by Florent Georges on 21 Feb 2008 @ 01:33 UTC #

I always liked the idea that once an XML parser sees a document's first tag, it has a simple, unambiguous way to know when it's reached the end of the document, and 7 would do away with this. (Right?)

I was also going to ask what you'd cut if you were going to add these features without increasing the length of the spec, but I'm glad to see that John Cowan already has some good ideas there.


—Posted by Bob DuCharme on 21 Feb 2008 @ 03:35 UTC #

4. URI + local name to get longer URI. Fine, you do say "and we can argue about the precise mapping rules later" but wouldn't picking any mapping rules other than those of RDF (straight concatenation without extra '#'s or '/'s) cause more confusion than they're worth?

—Posted by Ed Davies on 21 Feb 2008 @ 04:52 UTC #

Blindness error:

Please swap 6 and 7 in my above post (#2)

7: maybe

6: is complexity worth it?

—Posted by Mark on 21 Feb 2008 @ 06:43 UTC #

Some thoughts on 4 and 6:

- in the linked article (6), you use <{uri}name>. While I'm a fan of the James Clark notation and api's that let you use it (python elementtree), if you're having the fixed QName to URI mapping (4) anyway, can't you do away with the special notation? (as you can just use the full URI now?)

- if that is true, <uri> is very close to an unambiguous form not only for QNames, but for URI's in general (which is already a quite common notation, it's even in RFC2396 itself...). I think URI's ARE important enough to get special treatment... (although I'm generally against datatyping/binding in XML, contrary to what you might think from my previous comment). And if allowed for relative URI's, it would make xml:base much more useful...

- there's something nagging me about your QName to URI mapping, CURIE's, namespace prefixes+localnames, relative URI's and xml:base, but I can't quite put my finger on it. It has to do with all coming down to absolute URI's, and that this should be used somehow. Maybe it's just throwing away xml:base and using prefix declarations as a more general/powerful mechanism everywhere, I don't know... Can you go as far as saying that QNames ARE URI's?

—Posted by Steven on 21 Feb 2008 @ 11:56 UTC #

XML 2.0 is like Olympic Games, it comes back every two to four years :)

John : 6 pages is huge room in you Skunk ! * What about adding xml:id ? * and what about a more pragmatic way to bing stylesheets (XSLT, XQuery, CSS, FO, etc.) and schemas (as DTD, XSD, RNG, RNC, Schematron, etc.)

My two euro-cents

—Posted by Innovimax on 22 Feb 2008 @ 06:39 UTC #

<xml:entity name="nbsp" text=" "/>

You describe this as a solution for character references, but if people can do that, they will probably also do this

<xml:entity name="name" text="David Carlisle"/>

In which case hadn't you may as well let them do

<xml:entity name="name"><name><given>David</given> <family>carlisle</family></name></xml:entity>

That is, provide a (namespace aware) mechanism for all parsed entities not just character entities and document includes.

—Posted by David Carlisle on 28 Feb 2008 @ 11:12 UTC #

Clarifying that 'namespaces in xml, 1.1' means "version 1.1 of namespaces", not "namespaces in xml 1.1" and thus, that spec can be used with xml 1.0, and thus, namespace undeclaration is already possible, might help. IMHO.

—Posted by Chris Lilley on 08 Apr 2008 @ 03:55 UTC #

I'm with you on #1-3.

Not sure about #4 and #5. Why are you picking that syntax over something like <http://docbook.org/ns/docbook::para>

As for #7, what if they created a compound XML document that consisted of multiple individual XML documents? Each individual root element would be considered a seperate document, but they could be contained in a single file?

Re: #8, I don't understand what the advantage would be. Isn't this just a new syntax for entities?

Re: structured attribute values... this looks like it could complicate things needlessly. I would hate to read a document with large amounts of XML in atrributes. If you need to stuff structure into attributes couldn't you use wiki-like or bbcode delimiters? Or the CSS-like notation of tyle elements? In fact this could be standardized and could be referenced from other parts of the XML document... but it would not be treated as ordinary XML.

—Posted by Rob on 17 May 2008 @ 04:51 UTC #

XML 1.0 is broken with respect to internationalization, but not at all seriously. I've never once seen an actual instance of a problem caused by XML's internationalization problems.

There are far more serious issues in XML that could and perhaps should be resolved, mostly revolving around DTDs, the internal DTD subset, and entities. Namespaces and document fragments are also arguably broken.

If one were to resolve these issues in a new version of the spec, then one might as well fix up the trivial internationalization problems too. However XML 1.1 proved that internationalization is not a problem worth fixing in isolation.

—Posted by Elliotte Rusty Harold on 23 May 2008 @ 02:40 UTC #