<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" version="pto" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:gal="http://norman.walsh.name/rdf/gallery#">
<info>
    
    
    
    
    
    
    
<title>XML 1.1: Dead on Arrival</title><biblioid class="uri">http://norman.walsh.name/2004/09/30/xml11</biblioid>
<volumenum>7</volumenum>
<issuenum>171</issuenum>
<pubdate>2004-09-30T02:40:24+01:00</pubdate>
<date>$Date: 2005-09-11 10:27:02 -0400 (Sun, 11 Sep 2005) $</date>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2004</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>XML 1.1 was a fruitless exercise. We shouldn’t have bothered.</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#XML"/>
</info>

<epigraph>
<attribution>
      <personname>
	<firstname/>
<surname>Dandemis</surname>
      </personname>
    </attribution>
<para xml:id="p1">Do not condemn the judgement of another because
it differs from your own. You may both be wrong.</para>
</epigraph>

<para xml:id="p2">I supported XML 1.1. I thought it was a good thing. Naively, I
thought it was going to be relatively straightforward to deploy. I
didn’t believe the doom and gloom predictions of some that it would
bifurcate the XML standard into incompatible versions. While
it is not completely backwards compatible, I didn’t think it was going
to be that big a deal. I’ll explain why in a moment.</para>

<para xml:id="p3">Whether I was right or not, XML 1.1 <emphasis>is dead</emphasis>.
The working group leading RELAX NG through the ISO standardization
process
<link xlink:href="http://lists.oasis-open.org/archives/relax-ng/200409/msg00000.html">has ruled</link>
that “an XML [1.1] document…can never be valid against a RELAX NG schema.”
I expect the W3C XML Schema working group to conclude similarly that XML 1.1
documents cannot be validated with XML Schema 1.0 or 1.1.</para>

<para xml:id="p4">Game Over.</para>

<para xml:id="p5">If I can’t validate XML 1.1 documents, I can’t use them. (I can,
of course, validate them with XML 1.1 DTDs, but that’s bitter
consolation in the twenty-first century.)</para>

<para xml:id="p6">I consider myself fairly conservative when it comes to
notions of what constitutes well-formedness or validity. Nevertheless,
I expected a simple erratum to allow implementors to support XML 1.1 in
RELAX NG and W3C XML Schema:</para>

<blockquote>
<para xml:id="p7">All implementations of this specification must support XML 1.0.
Implementations may, at user option, support XML 1.1. An implementation
that supports XML 1.1 <emphasis>must</emphasis> …</para>
</blockquote>

<para xml:id="p8">Now, the interesting question is what must it do? To answer that,
we’ll have to look a little more closely at XML 1.1.
There are just
<link xlink:href="http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-xml11">a few
changes</link>
in XML 1.1:</para>

<variablelist>
<?dbfo list-presentation="blocks"?>
<varlistentry>
<term>New Text Characters</term>
<listitem>
<para xml:id="p9">More Unicode characters are allowed in text<footnote>
	    <para xml:id="p10">Text in
this context refers not only to element content but also to
attribute values and the content of processing instructions and comments:
places other than names.</para>
</footnote>. The big feature of
XML 1.1 is support for new versions of Unicode. (XML 1.0 is defined
on top of Unicode 2.0 which is no longer the current version.)
This is really significant. One of the virtues of XML is that it’s
been internationalized from the very beginning. It does not discriminate
against languages that are less economically important. Without XML 1.1,
that’s no longer true. I think that sucks.
</para>
<para xml:id="p11">The C0 control characters (0x01-0x1F) are allowed if they’re
escaped. In XML 1.0, presence of the C0 control characters is a good
indicator that the document’s encoding has been incorrectly determined.
As a compromise for allowing the C0 controls, the C1 control characters
are no longer allowed <emphasis>unless</emphasis> they are escaped. This
is the single backwards-incompatible change in XML 1.1.</para>
<para xml:id="p12">The <code>NEL</code> character (0x85) is normalized to a linefeed
in text. Basically, IBM mainframe newlines get treated like PC and Mac
newlines.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>New Name Characters</term>
<listitem>
<para xml:id="p13">The current version of Unicode supports more languages than
Unicode 2.0. As a result, there are more “name” characters now. XML
1.1 allows authors writing in Ethiopic (and a bunch of other
languages) to write tag names (and attribute names, processing instruction
targets, etc.) in their <emphasis>native language</emphasis>.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Normalization</term>
<listitem>
<para xml:id="p14">XML 1.1 encourages implementors to check the character normalization
of documents. This has no effect on validation.</para>
</listitem>
</varlistentry>
</variablelist>

<para xml:id="p15">What bearing does this have on validity? Let’s take a look.
Imagine that we have two processors: I and II. Processor I understands
XML 1.0 only. Processor II understands XML 1.0 and XML 1.1.</para>

<para xml:id="p16">Consider the following documents:</para>

<orderedlist numeration="upperalpha">
<listitem xml:id="docA">
      <!--A-->
<para xml:id="p17">Is an XML 1.0 document.</para>
</listitem>
<listitem xml:id="docB">
      <!--B-->
<para xml:id="p18">Is an XML 1.1 document that uses none of the new features of XML 1.1
(it would be a well-formed XML 1.0 document if it was labelled as 1.0).</para>
</listitem>
<listitem xml:id="docC">
      <!--C-->
<para xml:id="p19">Is an XML 1.1 document with <code>NEL</code> line breaks.</para>
</listitem>
<listitem xml:id="docD">
      <!--D-->
<para xml:id="p20">Is an XML 1.1 document with C0 control characters encoded in it.</para>
</listitem>
<listitem xml:id="docE">
      <!--E-->
<para xml:id="p21">Is an XML 1.1 document with C1 control characters encoded in it.</para>
</listitem>
<listitem xml:id="docF">
      <!--F-->
<para xml:id="p22">Is an XML 1.1 document with new text characters.</para>
</listitem>
<listitem xml:id="docG">
      <!--G-->
<para xml:id="p23">Is an XML 1.1 document with new name characters.</para>
</listitem>
</orderedlist>

<para xml:id="p24">What happens when we validate each of these documents? First, we
parse the input documents to build an Infoset. We’ll use an XML 1.0
parser for the 1.0 documents and a 1.1 parser for the 1.1
documents.</para>

<para xml:id="p25">Now, a processor that only understands 1.0 might check the
<emphasis role="bold">[version]</emphasis>
property on the Document Information Item and reject the 1.1 documents
out of hand. To make our discussion more interesting, let’s assume our parsers
don’t provide the 
<emphasis role="bold">[version]</emphasis> property (it’s
<link xlink:href="http://www.w3.org/TR/xmlschema-1/#infoset">not required</link>).
</para>

<para xml:id="p26">Validation, as far as I can see, produces the following
results:</para>

<table>
<title>Validation of XML 1.0 and XML 1.1 Documents</title>
<tgroup cols="8" align="center">
<?dbhtml table-summary="Table showing validation results"?>
<thead>
  <row>
    <entry>Proc.\Doc.</entry>
    <entry>A</entry>
    <entry>B</entry>
    <entry>C</entry>
    <entry>D</entry>
    <entry>E</entry>
    <entry>F</entry>
    <entry>G</entry>
  </row>
</thead>
<tbody>
  <row>
    <entry>
	    <emphasis role="bold">I</emphasis>
	  </entry>
    <entry><!--A-->valid</entry>
    <entry><!--B-->valid</entry>
    <entry><!--C-->valid</entry>
    <entry>
	    <!--D-->
	    <emphasis role="bold">invalid</emphasis>
	    <footnote xml:id="fnb">
<para xml:id="p27">But I wouldn’t be
surprised if there are implementations that don’t notice.</para>
</footnote>
	  </entry>
    <entry><!--E-->valid</entry>
    <entry>
	    <!--F-->
	    <emphasis role="bold">invalid</emphasis>
	    <footnoteref linkend="fnb"/>
	  </entry>
    <entry>
	    <!--G-->
	    <emphasis role="bold">invalid</emphasis>
	  </entry>
  </row>
  <row>
    <entry>
	    <emphasis role="bold">II</emphasis>
	  </entry>
    <entry><!--A-->valid</entry>
    <entry><!--B-->valid</entry>
    <entry><!--C-->valid</entry>
    <entry><!--D-->valid</entry>
    <entry><!--E-->valid</entry>
    <entry><!--F-->valid</entry>
    <entry><!--G-->valid</entry>
  </row>
</tbody>
</tgroup>
</table>

<para xml:id="p28">I think documents <link linkend="docD">D</link> and
<link linkend="docF">F</link> are technically invalid to processor I.
They each contain text characters that are not allowed by XML 1.0.
That said, I can’t actually confirm that either W3C XML Schema or RELAX NG
actually requires a processor to validate the characters. The
RELAX NG specification says explicitly that attribute values must
consist of XML 1.0 characters, but I don’t see anything about
characters in elements. I can’t find any mention of it at all in
<link xlink:href="http://www.w3.org/TR/xmlschema-1/">XML Schema Part 1:
Structures</link>. (Both specifications say they operate on XML 1.0
documents, so they implicitly forbid the extra characters unless they
are amended.)</para>

<para xml:id="p29">Document <link linkend="docG">G</link> is clearly invalid to
processor I because it has invalid name characters. Both specifications
are careful to check this case because they validate names in
contexts where the XML parser allows non-name characters.</para>

<para xml:id="p30">So, now we can answer our earlier question:</para>

<blockquote>
<para xml:id="p31">… An implementation
that supports XML 1.1 must allow a suite of additional characters in
content and it must allow a different suite of additional characters
in names.
</para>
</blockquote>

<para xml:id="p32">That’s not so hard is it?</para>

<para xml:id="p33">Perhaps the real question is, what are the consequences of allowing
processor II to be conformant? I can think of two:</para>

<variablelist>
<?dbfo list-presentation="blocks"?>
<varlistentry>
<term>Reduced Interoperability</term>
<listitem>
<para xml:id="p34">Interoperability is <emphasis>important</emphasis>; it would be
wrong to reduce interoperability without a compelling reason. I think
internationalization is a compelling reason. (The rest of the XML 1.1 changes
were either unnecessary or feature creep,
<acronym>IMHO<alt>In My Humble Opinion</alt></acronym>,
but they’re harmless.)
</para>
<para xml:id="p35">XML already has interoperability problems associated with character
encodings. Just because my parser understands Shift-JIS or some other
locally important encoding, doesn’t mean that yours does. The
interoperability problems of XML 1.1 seem similar to me. If I’m using
XML 1.1 when I don’t have to, I can transcode to XML 1.0, just like I can
transcode to utf8. If I’m using XML 1.1 because I need too, well,
I <emphasis>need</emphasis> to. If your tools can’t support XML 1.1, I can’t
use them. That seems reasonable.</para>
</listitem>
</varlistentry>

<varlistentry>
<term>Pipeline Issues</term>
<listitem>
<para xml:id="p36">Validation isn’t generally an end in itself. We validate
documents because we want to do something else with them, because we
want to pass them along to some further stage in a pipeline
(transformation, business processes, whatever, whether it’s explicitly
a pipeline or not). Downstream processes might have made all sorts of
assumptions based on the fact that the input was XML 1.0. Perhaps
they’re using <code>BEL</code> characters (0x07) as delimiters;
perhaps they’re relying on bit patterns that aren’t legitimate text
characters; etc.</para>
<para xml:id="p37">Those would be a good reasons not to support XML 1.1. I’m not suggesting
that you should be <emphasis>compelled</emphasis>
to support XML 1.1, I’m not even saying you
<emphasis>should</emphasis> support it.</para>
</listitem>
</varlistentry>
</variablelist>

<para xml:id="p38">I’m just saying it would be nice if you <emphasis>could</emphasis> support
XML 1.1</para>

<para xml:id="p39">But you can’t, at least not until V2.0 of the schema languages.
I suspect that’s roughly equivalent to saying “not until hell freezes
over.”</para>

<para xml:id="p40">Which is a shame, because XML 1.1 was a good thing.</para>

</essay>

