<?xml version='1.0' encoding='utf-8'?>
<?xml-stylesheet href="/style/browser.xsl" type="text/xsl"?>
<essay xmlns="http://docbook.org/ns/docbook"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
       xmlns:dc='http://purl.org/dc/elements/1.1/'
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:gal='http://norman.walsh.name/rdf/gallery#'
       version="pto">
<info>
<title>XML 2.0</title>
<volumenum>7</volumenum>
<issuenum>198</issuenum>
<pubdate>2004-11-10T14:34:53-08:00</pubdate>
<date>$Date: 2005-09-11 10:27:02 -0400 (Sun, 11 Sep 2005) $</date>
<author><personname>
<firstname>Norman</firstname><surname>Walsh</surname>
</personname></author>
<copyright><year>2004</year><holder>Norman Walsh</holder></copyright>
<abstract>
<para>I think the goal for XML 2.0, if there ever is one, should be to
simplify XML in the same way that the goal for XML was to simplify
SGML.
</para>
</abstract>
</info>

<para xml:id='p1'>XML 2.0. In any of several flavors, it's been the subject of
hundreds of messages on 
<link xlink:href="http://lists.xml.org/archives/xml-dev/">xml-dev</link>.
Lots of folks have written about it; I've kept
track of at least six essays on the topic, going all the way back to
2000:</para>

<itemizedlist spacing="compact">

<listitem><simpara xml:id='p2'><personname>
<firstname>Edd</firstname><surname>Dumbill</surname>
</personname> summarized recent xml-dev discussion, taking place even
as this essay was knocking around in my “ideas” bucket<footnote>
<para xml:id='p3'>I've been thinking about this for a while. My thoughts aren't
really any more coherent today, but it occurs to me that there's really no
better time to publish this than the week before XML 2004 where we'll
all be hanging out in the bar looking for things to chat about anyway.
</para></footnote>, in
<link xlink:href="http://www.xml.com/pub/a/2004/11/03/deviant.html">How Do
I Hate Thee?</link>
on 03 Nov 2004.</simpara>
</listitem>

<listitem><simpara xml:id='p4'><personname>
<firstname>Derek</firstname><surname>Denny-Brown</surname>
</personname> proposed
<link xlink:href="http://nothing-more.blogspot.com/2004/10/where-xml-goes-astray.html">Where
XML goes astray…</link>
on 12 Oct 2004.</simpara>
</listitem>

<listitem><simpara xml:id='p5'><personname>
<firstname>Liam</firstname><surname>Quin</surname>
</personname> was thinking about the
<link xlink:href="http://www.advogato.org/person/Ankh/diary.html?start=123">future of XML</link>
on 26 Sep 2004.</simpara>
</listitem>

<listitem><simpara xml:id='p6'><personname>
<firstname>Kendall Grant</firstname><surname>Clark</surname>
</personname> asked
<link xlink:href="http://www.xml.com/pub/a/2002/02/20/deviant.html">Can We
Get There From Here?</link> on 20 Feb 2002.</simpara>
</listitem>

<listitem><simpara xml:id='p7'><personname>
<firstname>Tim</firstname><surname>Bray</surname>
</personname> drafted
<link xlink:href="http://www.textuality.com/xml/xmlSW.html">XML-SW</link>
on 10 Feb 2002.</simpara>
</listitem>

<listitem><simpara xml:id='p8'><personname>
<firstname>Simon</firstname><surname>St. Laurent</surname>
</personname> pointed out
<link xlink:href="http://www.simonstl.com/articles/interop/index.html">XML's
Interoperability Problems</link>
in Jun 2000.</simpara>
</listitem>
</itemizedlist>

<para xml:id='p9'>There are big gaps in that list; surely someone wrote about it in
2001 and 2003. I don't pay that much attention because I'm not
convinced that XML 2.0 is a good idea. The
<link xlink:href="/2004/09/30/xml11">complete failure</link> of XML 1.1 doesn't
leave me very optimistic, but maybe a big change would be more palatable
than an incremental one. Certainly the potential payoff is larger.</para>

<para xml:id='p10'>But what is that payoff? I mean, what's wrong with XML 1.x?</para>

<para xml:id='p11'>Depending on your perspective, the answer to that question is
probably somewhere between almost nothing and almost everything. I fall
more towards the former end of the spectrum, but a lot has changed
since 1998.</para>

<para xml:id='p12'>Change is a big part of the problem. XML 1.0 has some oddities,
many the result of SGML legacy, but taken by itself isn't too bad. For
better or worse, though, we don't take it by itself anymore,
we take it with namespaces and inclusions and a choice of schema
languages, a little bit of querying and some transformation, all
sometimes wrapped up in a fancy web service. We've built up a
big stack:</para>

<informaltable frame="none">
<textobject><phrase>Stack o' Specs</phrase></textobject>
<tgroup cols="6" align="center" rowsep="0" colsep="0">
<colspec colname="c1"/>
<colspec colname="c2"/>
<colspec colname="c3"/>
<colspec colname="c4"/>
<colspec colname="c5"/>
<colspec colname="c6"/>
<tbody>
<row>
  <entry></entry>
  <entry namest="c2" nameend="c5">WS-*</entry>
  <entry></entry>
</row>
<row>
  <entry namest="c1" nameend="c3">XSLT</entry>
  <entry namest="c4" nameend="c5">XML&#160;Query</entry>
  <entry></entry>
</row>
<row>
  <entry></entry>
  <entry namest="c2" nameend="c3">XPath</entry>
  <entry namest="c4" nameend="c6">RDF/XML</entry>
</row>
<row>
  <entry></entry>
  <entry namest="c2" nameend="c3">RELAX&#160;NG</entry>
  <entry namest="c4" nameend="c5">XML&#160;Schema</entry>
  <entry></entry>
</row>
<row>
  <entry namest="c1" nameend="c2" align="right">XML&#160;Base</entry>
  <entry namest="c3" nameend="c4">xml:id</entry>
  <entry namest="c5" nameend="c6" align="left">XInclude</entry>
</row>
<row>
  <entry></entry>
  <entry namest="c2" nameend="c5">XML&#160;Namespaces</entry>
  <entry></entry>
</row>
<row>
  <entry></entry>
  <entry></entry>
  <entry namest="c3" nameend="c4">XML&#160;Infoset</entry>
  <entry></entry>
  <entry></entry>
</row>
<row>
  <entry></entry>
  <entry></entry>
  <entry namest="c3" nameend="c4">XML</entry>
  <entry></entry>
  <entry></entry>
</row>
</tbody>
</tgroup>
</informaltable>

<para xml:id='p13'>That sure is an awful lot of…stuff heaped on top of those three little
letters. I think the goal for XML 2.0, if there ever is one, should be to
simplify XML in the same way that the goal for XML was to simplify
SGML.</para>

<para xml:id='p14'>So, what do I think that would look like?</para>

<para xml:id='p15'>One simplification we would make is editorial: an XML 2.0
specification would unify
<link xlink:href="http://www.w3.org/TR/xml11/">XML</link>,
<link xlink:href="http://www.w3.org/TR/xml-names11/">XML Namespaces</link>,
<link xlink:href="http://www.w3.org/TR/xml-infoset/">XML Infoset</link>,
<link xlink:href="http://www.w3.org/TR/xmlbase/">XML Base</link>,
and <link xlink:href="http://www.w3.org/TR/xml-id/">xml:id</link>
into a single document.</para>

<para xml:id='p16'>Next, we'd tackle a significant bit of SGML legacy: 
removing the syntactic privileges afforded DTDs. In XML 2.0, there would
be no “&lt;!DOCTYPE&gt;” declaration, no entities (except the built in
entities and their close cousins, numeric character references), no
attribute or element types of any kind, and no fixed or default values
for attributes. In XML 2.0, documents would be either well-formed, or
the wouldn't be XML.</para>

<para xml:id='p17'>I'd like to be clear: I've got nothing against DTDs.
I'd be happy to work on a DTD V2.0 specification that described
DTD validation of XML 2.0 documents. You just wouldn't have a
&lt;!DOCTYPE&gt; declaration, so you'd have to associate
the DTDs with documents in some other way, just like you associate RELAX
NG Grammars and W3C XML Schemas in some other way.</para>

<para xml:id='p18'>Now, I've just screwed all the mathematicians (and other folks) by
taking away their named character entities and I can see
<personname><firstname>David</firstname>
<surname>Carlisle</surname></personname> wincing
out there in the audience. Bear with me, I have an answer for that
problem this time (unlike
<link xlink:href="http://www.w3.org/XML/Core/2002/10/charents-20021023">last
time</link>).</para>

<para xml:id='p19'>My proposal for solving the entity problem is going to involve
namespaces, so let's make some simplifications there, too. A radical
simplification would be to simply throw them all out, declare defeat
and try to invent something new to solve the naming problems. Or maybe
try to convince the world that the naming problem doesn't exist, that
the fact that <code>&lt;p&gt;</code> is sometimes TEI and sometimes
HTML <emphasis>isn't</emphasis> a problem in practice. I'm not going
to start out that radical. I'm just going to try to round off some of
namespace's sharper corners.</para>

<para xml:id='p20'>In XML 2.0, all documents
would be namespace aware. Furthermore, the “null namespace,” the
namespace in which elements appear if there is no namespace
declaration, would have an explicit URI (and could, consequently, be
associated with a prefix). This reduces all of the magic of the “null
namespace” to simply a question of a default declaration. We could go
a step further and simply outlaw the null namespace, but that seems a
bit extreme to me.</para>

<para xml:id='p21'>Ignoring &lt;!DOCTYPE&gt; declarations and a few wrinkles
between XML 1.0 and XML 1.1, so far, all well-formed, namespace-aware,
XML 1.x documents would be XML 2.0 documents, simply by changing the
version in the XML declaration.
If the null namespace was outlawed, you'd have to add a namespace
declaration to the top of all the documents. That seems cumbersome. On
the other hand, the <link xlink:href="http://www.w3.org/TR/webarch">Web
Architecture</link> document
<link xlink:href="http://www.w3.org/TR/webarch#use-namespaces">says</link> that
all elements should be in a namespace.</para>

<para xml:id='p22'>Anyway, for the moment, I'm not going that far.</para>

<para xml:id='p23'>So that means:</para>

<programlisting><![CDATA[<?xml version='2.0'?>
<doc/>]]></programlisting>

<para xml:id='p24'>and</para>

<programlisting><![CDATA[<?xml version='2.0'?>
<doc xmlns="http://the-uri-for-the-default-namespace/"/>]]></programlisting>

<para xml:id='p25'>and</para>

<programlisting><![CDATA[<?xml version='1.0'?>
<x:doc xmlns:x="http://the-uri-for-the-default-namespace/"/>]]></programlisting>

<para xml:id='p26'>are all logically <emphasis>the same document</emphasis>.</para>

<para xml:id='p27'>That's a bunch of simplification. Now let's tackle a
real technical challenge: QNames in content. I think the right answer
here is to raise the stature of QNames so that they're first class
objects in XML 2.0. XML 2.0 would have Document, Element,
Attribute, Processing Instruction, Character, Comment, Namespace,
<emphasis>and</emphasis> QName Information Items.</para>

<para xml:id='p28'>For legacy (and authoring!) convenience, we'd keep the
existing QName forms for element and attribute names, but we'd also
introduce unambiguous lexical forms for QNames:
in XML 2.0, <code><![CDATA[<{uri}name>]]></code> would be a well-formed
serialization of a QName with the namespace name “uri” and the local
name “name”.</para>

<para xml:id='p29'>What does this really mean?
The big problem with
<link xlink:href="http://www.w3.org/2001/tag/doc/qnameids.html">QNames in
content</link> is that the parser can't tell where the QNames are.
Consider the following
example, where the intent is that “<code>a:localname</code>” is
a QName:</para>

<programlisting><![CDATA[<?xml version="1.0"?>
<doc xmlns:a="http://example.com/xmlns/a">
What about the QName a:localname?
</doc>]]></programlisting>

<para xml:id='p30'>An XML 1.0 parser
can't actually determine that “<code>a:localname</code>” <emphasis>is</emphasis>
a QName. In XML 2.0, we would fix that:</para>

<programlisting><![CDATA[<?xml version="2.0"?>
<doc xmlns:a="http://example.com/xmlns/a">
What about the QName <{http://example.com/xmlns/a}localname>?
</doc>]]></programlisting>

<para xml:id='p31'>The Infoset for this document
consists of a Document Information Item containing a single
Element Information Item containing 22 Character Information Items
followed by a QName Information Item followed by 2 more Character
Information Items.</para>

<para xml:id='p32'>The “<code><![CDATA[<{uri}name>]]></code>” form is unambiguous,
but it's awfully tedious for the author, so we'd
provide a prefix form as well.
As a convenience, <code><![CDATA[<:p:name>]]></code> would
be a well-formed serialization of a QName with the namespace name currently
bound to the prefix “p” and the local name “name”. So this would
be equivalent:</para>

<programlisting><![CDATA[<?xml version="2.0"?>
<doc xmlns:a="http://example.com/xmlns/a">
What about the QName <:a:localname>?
</doc>]]></programlisting>

<para xml:id='p33'>These forms are allowed in element content <emphasis>and</emphasis>
attribute values. This means that attribute values don't consist only
of Character Information Items, they consist of Character and QName
Information Items.</para> 

<para xml:id='p34'>What's gained here is that the QNames in content can be recognized
by the parser, so we aren't “hiding” QName values, making general tools blind
to which namespace declarations are actually used.</para>

<para xml:id='p35'>It's this syntactic form that provides an answer to the character
entity problem. Now we can define a namespace with the semantics that
QNames in that namespace represent characters. For example
<uri>http://www.w3.org/2003/entities/iso8879/isonum</uri> for the 
ISO Numeric
and Special Graphic characters.</para>

<para xml:id='p36'>To write an “·” (middle dot) where I don't have a glyph for it,
or a convenient way to insert that glyph, I can write
<code><![CDATA[<:num:middot>]]></code> (or
<code><![CDATA[<{http://www.w3.org/2003/entities/iso8879/isonum}middot>]]></code>
if I don't have a prefix bound). And because these
lexical forms are recognized in both element and attribute values, I can
put them anywhere I want. I concede that
“<code>&lt;:num:middot&gt;</code>” isn't quite
as easy to type as “<code>&amp;middot;</code>”, but it's not a lot harder
and I don't think it's more difficult to read.
 </para>

<para xml:id='p37'>We could take this even farther, allow these QName forms not
only in attribute values and element content, but also in “Names”. In
other words, this document:</para>

<programlisting><![CDATA[<?xml version="2.0"?>
<doc xmlns="http://example.com/xmlns/doc"
     xmlns:a="http://example.com/xmlns/a"
     xmlns:b="http://example.com/xmlns/b">
  <p a:att="value" b:att="value"/>
</doc>]]></programlisting>

<para xml:id='p38'>Could be serialized like this:</para>

<programlisting><![CDATA[<?xml version="2.0"?>
<<{http://example.com/xmlns/doc}doc>
  xmlns:a="http://example.com/xmlns/a">
  <<{http://example.com/xmlns/doc}p> a:att="value"
   <{http://example.com/xmlns/b}:att>=”value”/>
</<{http://example.com/xmlns/doc}doc>>]]></programlisting>

<para xml:id='p39'>I wouldn't recommend that serialization and I certainly wouldn't
want to author in it, but it would allow applications to
serialize <emphasis>any</emphasis> document or document fragment.</para>

<para xml:id='p40'><personname><firstname>Michael</firstname>
<surname>Sperberg-McQueen</surname></personname> pointed out that
a slight syntactic extension would allow you to specify the prefix
as well. This would be handy, for example to deal with the way
the
<link xlink:href="http://www.w3.org/TR/xpath-datamodel/#qnames-and-notations">XQuery
1.0 and XPath 2.0 Data Model</link> has implemented
<link xlink:href="http://www.w3.org/TR/xpath-datamodel/#qnames-and-notations">QNames
as triples</link>. I'm not sure this is necessary, but it might be
a good thing.</para>

<para xml:id='p41'>On the whole, I think these proposals are a net simplification.
I have some
reservations about adding QName Information Items, and particularly
about allowing them in attribute values, but I haven't thought of a
better solution to the QName mess. And if XML 2.0 is worth doing at
all, I think it's only worth doing if it is simpler than XML 1.0
<emphasis>and</emphasis> solves the QName mess.</para>

<para xml:id='p42'>There's some more work we can do around the margins: clarify the
semantics of <code>xml:lang</code> and <code>xml:space</code>
attributes, perhaps allow documents to have multiple top-level
elements, removing the distinction between documents and external
parsed entities (which don't exist anymore), and <emphasis>maybe</emphasis>
something about a binary format, depending on how that work plays
out.</para>

<para xml:id='p43'>If you're an XML grease monkey, you can probably think of a few
more things, but let your mantra be “simplify”. Repeat after me: no
new features.</para>

</essay>
