I think the goal for XML 2.0, if there ever is one, should be to simplify XML in the same way that the goal for XML was to simplify SGML.

XML 2.0. In any of several flavors, it's been the subject of hundreds of messages on xml-dev. Lots of folks have written about it; I've kept track of at least six essays on the topic, going all the way back to 2000:

There are big gaps in that list; surely someone wrote about it in 2001 and 2003. I don't pay that much attention because I'm not convinced that XML 2.0 is a good idea. The complete failure of XML 1.1 doesn't leave me very optimistic, but maybe a big change would be more palatable than an incremental one. Certainly the potential payoff is larger.

But what is that payoff? I mean, what's wrong with XML 1.x?

Depending on your perspective, the answer to that question is probably somewhere between almost nothing and almost everything. I fall more towards the former end of the spectrum, but a lot has changed since 1998.

Change is a big part of the problem. XML 1.0 has some oddities, many the result of SGML legacy, but taken by itself isn't too bad. For better or worse, though, we don't take it by itself anymore, we take it with namespaces and inclusions and a choice of schema languages, a little bit of querying and some transformation, all sometimes wrapped up in a fancy web service. We've built up a big stack:

 WS-* 
XSLTXML Query 
 XPathRDF/XML
 RELAX NGXML Schema 
XML Basexml:idXInclude
 XML Namespaces 
  XML Infoset  
  XML  

That sure is an awful lot of…stuff heaped on top of those three little letters. I think the goal for XML 2.0, if there ever is one, should be to simplify XML in the same way that the goal for XML was to simplify SGML.

So, what do I think that would look like?

One simplification we would make is editorial: an XML 2.0 specification would unify XML, XML Namespaces, XML Infoset, XML Base, and xml:id into a single document.

Next, we'd tackle a significant bit of SGML legacy: removing the syntactic privileges afforded DTDs. In XML 2.0, there would be no “<!DOCTYPE>” declaration, no entities (except the built in entities and their close cousins, numeric character references), no attribute or element types of any kind, and no fixed or default values for attributes. In XML 2.0, documents would be either well-formed, or the wouldn't be XML.

I'd like to be clear: I've got nothing against DTDs. I'd be happy to work on a DTD V2.0 specification that described DTD validation of XML 2.0 documents. You just wouldn't have a <!DOCTYPE> declaration, so you'd have to associate the DTDs with documents in some other way, just like you associate RELAX NG Grammars and W3C XML Schemas in some other way.

Now, I've just screwed all the mathematicians (and other folks) by taking away their named character entities and I can see David Carlisle wincing out there in the audience. Bear with me, I have an answer for that problem this time (unlike last time).

My proposal for solving the entity problem is going to involve namespaces, so let's make some simplifications there, too. A radical simplification would be to simply throw them all out, declare defeat and try to invent something new to solve the naming problems. Or maybe try to convince the world that the naming problem doesn't exist, that the fact that <p> is sometimes TEI and sometimes HTML isn't a problem in practice. I'm not going to start out that radical. I'm just going to try to round off some of namespace's sharper corners.

In XML 2.0, all documents would be namespace aware. Furthermore, the “null namespace,” the namespace in which elements appear if there is no namespace declaration, would have an explicit URI (and could, consequently, be associated with a prefix). This reduces all of the magic of the “null namespace” to simply a question of a default declaration. We could go a step further and simply outlaw the null namespace, but that seems a bit extreme to me.

Ignoring <!DOCTYPE> declarations and a few wrinkles between XML 1.0 and XML 1.1, so far, all well-formed, namespace-aware, XML 1.x documents would be XML 2.0 documents, simply by changing the version in the XML declaration. If the null namespace was outlawed, you'd have to add a namespace declaration to the top of all the documents. That seems cumbersome. On the other hand, the Web Architecture document says that all elements should be in a namespace.

Anyway, for the moment, I'm not going that far.

So that means:

  1<?xml version='2.0'?>
  2<doc/>

and

  1<?xml version='2.0'?>
  2<doc xmlns="http://the-uri-for-the-default-namespace/"/>

and

  1<?xml version='1.0'?>
  2<x:doc xmlns:x="http://the-uri-for-the-default-namespace/"/>

are all logically the same document.

That's a bunch of simplification. Now let's tackle a real technical challenge: QNames in content. I think the right answer here is to raise the stature of QNames so that they're first class objects in XML 2.0. XML 2.0 would have Document, Element, Attribute, Processing Instruction, Character, Comment, Namespace, and QName Information Items.

For legacy (and authoring!) convenience, we'd keep the existing QName forms for element and attribute names, but we'd also introduce unambiguous lexical forms for QNames: in XML 2.0, <{uri}name> would be a well-formed serialization of a QName with the namespace name “uri” and the local name “name”.

What does this really mean? The big problem with QNames in content is that the parser can't tell where the QNames are. Consider the following example, where the intent is that “a:localname” is a QName:

  1<?xml version="1.0"?>
  2<doc xmlns:a="http://example.com/xmlns/a">
  3What about the QName a:localname?
  4</doc>

An XML 1.0 parser can't actually determine that “a:localnameis a QName. In XML 2.0, we would fix that:

  1<?xml version="2.0"?>
  2<doc xmlns:a="http://example.com/xmlns/a">
  3What about the QName <{http://example.com/xmlns/a}localname>?
  4</doc>

The Infoset for this document consists of a Document Information Item containing a single Element Information Item containing 22 Character Information Items followed by a QName Information Item followed by 2 more Character Information Items.

The “<{uri}name>” form is unambiguous, but it's awfully tedious for the author, so we'd provide a prefix form as well. As a convenience, <:p:name> would be a well-formed serialization of a QName with the namespace name currently bound to the prefix “p” and the local name “name”. So this would be equivalent:

  1<?xml version="2.0"?>
  2<doc xmlns:a="http://example.com/xmlns/a">
  3What about the QName <:a:localname>?
  4</doc>

These forms are allowed in element content and attribute values. This means that attribute values don't consist only of Character Information Items, they consist of Character and QName Information Items.

What's gained here is that the QNames in content can be recognized by the parser, so we aren't “hiding” QName values, making general tools blind to which namespace declarations are actually used.

It's this syntactic form that provides an answer to the character entity problem. Now we can define a namespace with the semantics that QNames in that namespace represent characters. For example http://www.w3.org/2003/entities/iso8879/isonum for the ISO Numeric and Special Graphic characters.

To write an “·” (middle dot) where I don't have a glyph for it, or a convenient way to insert that glyph, I can write <:num:middot> (or <{http://www.w3.org/2003/entities/iso8879/isonum}middot> if I don't have a prefix bound). And because these lexical forms are recognized in both element and attribute values, I can put them anywhere I want. I concede that “<:num:middot>” isn't quite as easy to type as “&middot;”, but it's not a lot harder and I don't think it's more difficult to read.

We could take this even farther, allow these QName forms not only in attribute values and element content, but also in “Names”. In other words, this document:

  1<?xml version="2.0"?>
  2<doc xmlns="http://example.com/xmlns/doc"
  3     xmlns:a="http://example.com/xmlns/a"
  4     xmlns:b="http://example.com/xmlns/b">
  5  <p a:att="value" b:att="value"/>
  6</doc>

Could be serialized like this:

  1<?xml version="2.0"?>
  2<<{http://example.com/xmlns/doc}doc>
  3  xmlns:a="http://example.com/xmlns/a">
  4  <<{http://example.com/xmlns/doc}p> a:att="value"
  5   <{http://example.com/xmlns/b}:att>=”value”/>
  6</<{http://example.com/xmlns/doc}doc>>

I wouldn't recommend that serialization and I certainly wouldn't want to author in it, but it would allow applications to serialize any document or document fragment.

Michael Sperberg-McQueen pointed out that a slight syntactic extension would allow you to specify the prefix as well. This would be handy, for example to deal with the way the XQuery 1.0 and XPath 2.0 Data Model has implemented QNames as triples. I'm not sure this is necessary, but it might be a good thing.

On the whole, I think these proposals are a net simplification. I have some reservations about adding QName Information Items, and particularly about allowing them in attribute values, but I haven't thought of a better solution to the QName mess. And if XML 2.0 is worth doing at all, I think it's only worth doing if it is simpler than XML 1.0 and solves the QName mess.

There's some more work we can do around the margins: clarify the semantics of xml:lang and xml:space attributes, perhaps allow documents to have multiple top-level elements, removing the distinction between documents and external parsed entities (which don't exist anymore), and maybe something about a binary format, depending on how that work plays out.

If you're an XML grease monkey, you can probably think of a few more things, but let your mantra be “simplify”. Repeat after me: no new features.


[1]I've been thinking about this for a while. My thoughts aren't really any more coherent today, but it occurs to me that there's really no better time to publish this than the week before XML 2004 where we'll all be hanging out in the bar looking for things to chat about anyway.

Comments:

What about the QName ? The only thing that smells iffy is the form Norm. How about to keep the parser writers happy? I can see why the first colon is needed, but for aethetics and authoring habits, keeping qnames a bit like empty elements seems reasonable? I'll look forward to more articles on this.

Posted by Dave Pawson on 11 Nov 2004 @ 04:59am UTC #

I don't like the syntax. It looks too much like an unbalanced start tag. Sure, it's not syntactically ambiguous to a computer, but humans' ability to check the syntax is more important. Since entities are going away anyway, I suggest &{uri}name; and &prefix:name; instead. If that's a dumb idea, I'd be happy to hear why.

Other than that, I think the proposals are great.

Posted by Jeffrey Yasskin on 11 Nov 2004 @ 05:00am UTC #

I avoided the "&" escape character because it's already used for the built in entities (&lt;, etc.) and numeric character references (&#160; etc.). I suppose that &{uri}localname; and &:p:localname; are still possibilities, it just seemed like a lot of overloading on that character.

I was aware that the forms I chose look a lot like start tags. I even toyed with <{uri}localname/> but I decided that made the problem even worse.

I think I could go either way.

Posted by Norman Walsh on 11 Nov 2004 @ 06:09am UTC #

Nice. Some obs.
"In XML 2.0, all documents would be namespace aware. Furthermore, the “null namespace,” the namespace in which elements appear if there is no namespace declaration, would have an explicit URI (and could, consequently, be associated with a prefix). This reduces all of the magic of the “null namespace” to simply a question of a default declaration. We could go a step further and simply outlaw the null namespace, but that seems a bit extreme to me."
xmlns="" would do fine for the null namespace - think of the months you could save not arguing about which scheme to use.
And drop default namespaces too - they suck.
I could live with the 'QClark' syntax, but maybe you should use a closing tag form for consistency. The truth is that allowing macros to work over markup and content is always going to be messy.
What you haven't done is verify that this will allow QNames and namespaces to roundtrip. That's a very important thing to get right this time.

Posted by Bill de hÓra on 11 Nov 2004 @ 09:47am UTC #

The problem with using "" for the null namespace is that it would break the ability (important to some, I'm sure) to undeclare a namespace. As for dropping default namespaces, I could live with that, but I bet lots of folks would object. Lots of documents are only in a single namespace and using names without colons for those cases is appealing.

As for round-tripping, I'm pretty confident my proposal manages that. The parser can recognize all the QNames and it can serialize all the QNames. At least if they're in one of the new syntaxes.

Posted by Norman Walsh on 11 Nov 2004 @ 03:57pm UTC #

I'm honou?red, you want to force the whole world through the pain of a version transition, just to inflict pain on me!

Mike Kay suggested on xml-dev recently (and before) that the {uri}local name ought to be allowed _everywhere_ that Qnames are recognised including element start and end tags, schema declarations, XPath steps etc so that you could write out everything without carrying namespace context if you need to. That would clash with your suggested Qname in content markup as M's suggestion would make that a legal start tag. (But that's just syntax....)

Posted by David Carlisle on 11 Nov 2004 @ 04:18pm UTC #

Woa: déjà vu [1,2,3]! :-) I agree that the entity syntax is a bit harsh [4], and I think consistancy with other entity sytaxes is important. However your solution for defining all entities seems pretty good. A few possible issues with your idea may include:

• How are the mappings between entities and XML content defined? That is, how is &#xB7; defined to be <{http://www.w3.org/2003/entities/iso8879/isonum}middot>?
• Many XML processors won't necessarily have the ability to access the internet and download the definition files - a way to specify a local copy (public id?) is helpful.
• What happens if the entity can't be resolved for some reason or another? XML 1.x's fatal error behavor is often quite annoying: a fallback mechanism would be nice.

In addition, a possible alternative syntax that doesn't hurt existing XML 1.x processors (and looks nicer, IMHO) looks like [based on 3]:

      <animals
    xmlns="http://example.com/animals/"
    xmlns:ent="http://example.com/entity/ns/"
    ent:cls="http://purl.org/stuff/colors/"
    example="%cls:rainbow;"
    >
    <dog
        xmlns="http://purl.org/stuff/dogs/"
        paw="%cls:golden;"
        >
        <nose>%cls:brown;</nose>
    </dog>
</animals>
    

Of course, that's an ugly hack and requires an additional %pc; escape sequence. However, it could be used to test your approach with existing XML processors via XSLT or some pipe.

Notes:

P.S. Your software barfed on the % symbol - I had to use an entity for it. Odd, since it is valid xml.

[1] http://norman.walsh.name/2004/11/10/xml20#p32
[2] http://dannyayers.com/archives/2004/11/05/exorcising-qnames/
[3] http://slashdot.org/~Quantum Jim/journal/89855
[4] http://norman.walsh.name/2004/11/10/xml20#comment0002

Posted by Jimmy Cerra on 11 Nov 2004 @ 04:37pm UTC #

For character entities, just add to the predefined set. Take everything from HTML 4 and others if needed (such as MathML?). That list really isn't big since the representation can be very simple. Voila. No need to have external references at all, and I don't have to wonder how to say &middot; today.

Of course, this doesn't address the general QName problem, but I'd really rather not put these two issues together.

Everyone with extra needs could just use better editors. Or, as you suggest, specialized applications could still use Elements or QNames with special significance, but there would be no need to confuse them with character data.

Posted by Tom on 11 Nov 2004 @ 09:00pm UTC #

Hello,
There are some nice ideas, but there a long long way...and the form could become better, so let's wait ...
But the good points are :
* merging {XML, XML Infoset, XML Base, XML:id, XML Namespace}
* removing DTD Infoset (fixed attribute value, entity definition, DOCTYPE call)

The brace notation {} looks good, but we need to introduce two new predefined entities ( &lb; { and &rb; } )
But why throwing entity notation ?
&{http...}foo; looks good
And why not changing the http dummy protocol ?
Why not introducing the xml:// protocol to really make the difference ?
But I saw a bug
You wrote this

<?xml version="2.0"?>
<<{http://example.com/xmlns/doc}doc>
xmlns:a="http://example.com/xmlns/a">
<<{http://example.com/xmlns/doc}p> a:att="value"
<{http://example.com/xmlns/b}:att>=”value”/>
<<{http://example.com/xmlns/doc}doc>>

but I think you mean this

<?xml version="2.0"?>
<<{http://example.com/xmlns/doc}doc>
xmlns:a="http://example.com/xmlns/a">
<<{http://example.com/xmlns/doc}p> a:att="value"
<{http://example.com/xmlns/b}:att>=”value”/>
</<{http://example.com/xmlns/doc}doc>>
 ^
 +---- here is the missing slash

Cheers, and keep going

Posted by Xmlizer on 12 Nov 2004 @ 12:18am UTC #

We don't need new built in entities for curly braces because they aren't recognized except immediately after "<" (or perhaps "&"). There's nothing special about { this }.

Posted by Norman Walsh on 12 Nov 2004 @ 07:07am UTC #

Good Luck getting rid of Maths entities!

When people are writing XPaths etc, they will now need to know which part is a Qname and do something extra? Sounds complicated.

And wouldn't it mean that a language with a Qname could never be an XML Schemas simple type, because it has structure?

Everyone also would have to abandon any idea that attributes are always simple types too. This seems an enormous change, in practice. Surely the point of having a layered system of specifications is to allow one layer to change without requiring change in all the other layers?

And would it mean that every existing XML language that uses Qnames somewhere in data values would have to be redefined?

So I wouldn't characterize something that requires a change to the XML Infoset, the PSVI, XML itself, XML Schemas type system and components and syntax, XQuery, XSLT1, XSLT2, etc, as well as changes to many other languages and systems as being minimal.

A goal of XML 2.0 should be that one can replace a current XML parser that generates SAX with an XML 2.0 parser and also generate SAX. That is layering. Everything hinges on building in the Maths and standard entity references into the basic language. Then you can get rid of DOCTYPE declarations. Without that first step, no progress can happen. The people who have trouble with Qnames are a tiny minority: leave that for XML 3.0 please!

May I recommend Schematron's way again? See http://lists.w3.org/Archives/Public/www-tag/2002Jun/0183.html

By the way, XML 1.1 cannot be a failure unless it has no eventual uptake from the people who need its facilities. Have the Ethopians etc. complained? If the goal of XML 1.1 was to attract people from XML 1.0, then it would have addressed more popular issues. In fact, XML 1.1 was designed to prevent a potential handicap for some underdeveloped countries more than meet a current worldwide demand for Unicode 4. Cheers Rick

Posted by Rick Jelliffe on 12 Nov 2004 @ 10:54am UTC #

For the record: I am not trying to remove character entities out of some perverse desire to make life harder for the people who need them :-)

Posted by Norman Walsh on 12 Nov 2004 @ 02:34pm UTC #

While we are at it, can we please remove NOTATION from the XML syntax...?

Posted by Damian Cugley on 12 Nov 2004 @ 02:46pm UTC #

Is there any reason why the :p:localname syntax couldn't be used to reference entities that contained text and element content? I mean, not just for referencing single characters?

(I realize of course that some standard format would need to be used for declarations that associate the names with content, and processing apps would need to know how to parse it.)

Anyway, I don't personally care so much about the problem of how to use ISO named characters in DTD-free doc instances -- because I think there already some great authoring tools, like your "XML Unicode" package, that obviate the need for it.

But I would really really like to have a DTD-free way to declare and reference external text/element content -- regardless of what delimiters it may use (that is, whether it's angle brackets as in your proposal or the "looks more like classic SGML entity refs" style somebody else suggested in a comment.)

And as far as the problem that somebody mentioned of how to handle getting stuff from the URI locations if you're not connected to the Net -- that's already solved: Have local copies and make your tools use a URI resolver to remap the URIs to local system paths.

Posted by Michael Smith on 18 Nov 2004 @ 09:09am UTC #

I thought Tim Bray had an elegant solution to this by moving entity processing out of the XML processing and into the encoding with his UTF-8+names proposal.

http://www.tbray.org/ongoing/When/200x/2003/10/17/UTF8-plus

And I second the removal of Notations, though I think that without DTDs they drop out anyway.

--Dethe
Posted by Dethe Elza on 22 Nov 2004 @ 09:14pm UTC #

A good article. Lots of detailed debate ahead though.

A couple of warts-on-a-wart on the QNames-in-content problem:

(a) prefixes are used in content independently of QNames, e.g. XPath uses prefix:* and XSLT has attributes containing a list of prefixes

(b) if QNames-in-content are to be understood at the parser level then you need to be able to distinguish whether absent prefix means null namespace (as in XSLT) or absent prefix means default namespace (as in XML Schema).

Posted by Michael Kay on 03 Dec 2004 @ 06:25pm UTC #

Great article, but I think you are still not addressing what I find to be one of the chief pain-points of namespaces in XML 1.x: they look like attributes, they quack like attributes, some software thinks they are attributes, and some doesn't, they look like part of the content but they aren't so you get this weird effect of them applying to their own start tag, which means you have to do arbitrary buffering just to dispatch a start tag, and its a messy swamp all the way around.

So, first up against the wall when the revolution comes, for me, is overloading the attribute syntax for namespace declarations. I want ns decls to also be first class syntactic objects, and I want their scoping to be manifest in that syntax.

And then I consider seriously the question of namespaces and the default namespace and non-namespaces vis-a-vis attributes.

World peace to follow as an exercise for the reader.

Posted by Mary Holstege on 03 Dec 2004 @ 06:55pm UTC #

Do you suggest a common URI for the NULL namespace, wherever it is used?

I read the NULL namespace like a database NULL: That we don't know if the element has any semantics, and we don't know what the semantics are, if they exist.

The implication of a URI for the NULL namespace, is that my <foo/>and your <foo/> carry the same meaning, I don't think that's what we want from a NULL namespace?

Or should the URI of the NULL namespace be the URI of the document where the NULL namespace is used? E.g., if I use a Team element with NULL namespace in http://heima.olivant.fo/~styrheim/gallery/stadium/stadium.rdf, then you could use the same element un your document, not with the NULL namespace, but with namepace http://heima.olivant.fo/~styrheim/gallery/stadium/stadium.rdf# ?

Posted by Jan Egil Kristiansen on 13 Mar 2005 @ 05:01pm UTC #

Hi I want to know whether there is any difference in XML 1.0 & 2.0 coding stucture. When i am running an XML code with version changed to 2.0 i am getting an error in internet Explorer. Can you guide me how to run an Xml versin 2.0 example and also i want to know which browser supports XML 2.0 language

Regards

Posted by Rameshwari on 20 Dec 2005 @ 11:12am UTC #

I'm not sure but it seems that there are no such browsers released yet. Right me if I'm wrong.

Posted by Michael Klishin on 22 Dec 2005 @ 06:22pm UTC #

As someone who has published documents using MathML, I'd be very happy to see named entity references for symbols go away. That's why we have Unicode and that is what I used.

The Unicode folks have worked very hard to continue to include more and more symbols for Mathematics and Science. In the end, if you invent some symbol, you'd be better off with some special piece of markup that indicates it is a non-standard character rather than some character reference that isn't really true.

In the end, I don't the MathML folks are completely screwed as I'm one of them. We've all just got a transition to make--and we have XML 1.0 for backwards compatibility if we get stuck.

In the end, we need to push the use of unicode and add or change the specifications to meet all our needs. A tall order, but the right direction.

Posted by Alex Milowski on 11 Apr 2006 @ 03:10am UTC #

I think you need &copy; etc if you wan't to support non unicode encodings, and those are likely to be in use for many years yet.

On the encoding of namespaced tags how about:

<?xml version="2.0"?>
<"http://example.com/xmlns/doc":doc>
<"http://example.com/xmlns/doc":p "http://example.com/xmlns/a":att="value" "http://example.com/xmlns/b":att="value"/>
</"http://example.com/xmlns/doc":doc>

To me that seems much easier to read, and it should be fairly easy to parse as well. I would remove the namespace from the end tag, as it's redundant information anyway.

Posted by Henrik on 19 Jun 2006 @ 08:50pm UTC #
Comments on this essay are closed. Thank you, spammers.