<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:gal="http://norman.walsh.name/rdf/gallery#">
<info>
    
    
    
    
    
    
    
    
    
    
    
    
<title>Thoughts on Character Entities</title><biblioid class="uri">http://norman.walsh.name/2003/11/13/charent</biblioid>
<volumenum>6</volumenum>
<issuenum>110</issuenum>
<pubdate>2003-11-13</pubdate>
<date>$Date: 2005-09-11 10:27:02 -0400 (Sun, 11 Sep 2005) $</date>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2003</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>After much consideration,
I don’t think XML should try to solve the character entity problem.
</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#XML"/>
</info>

<epigraph>
<attribution>James Baldwin</attribution>
<para xml:id="p1"><indexterm>
	<primary>Baldwin</primary>
	<secondary>James</secondary>
</indexterm>The price one pays for pursuing any
profession or calling is an intimate knowledge of its ugly
side.</para>
</epigraph>

<para xml:id="p2">I don’t think XML should try to solve the character entity problem.
There, I’ve said it. I wonder where I left my flame-retardant underwear?</para>

<para xml:id="p3">The character entity problem, if you aren’t familiar with it
(and if you aren’t, this essay might be a good one to skip) is the fact
that entity references, often used to name characters that can’t conveniently
be typed, aren’t part of well-formed XML documents that don’t have a DTD.</para>

<para xml:id="p4">The following document is not well-formed:</para>

<screen>&lt;?xml version='1.0'?&gt;
&lt;doc&gt;
&lt;p&gt;Les &amp;eacute;tudiants sont tr&amp;egrave;s
s&amp;eacute;rieux.&lt;/p&gt;
&lt;/doc&gt;</screen>

<para xml:id="p5">XML only has five predefined entities (<literal>&amp;lt;</literal>,
<literal>&amp;amp;</literal>, etc.), you
have to define the rest in the DTD (either the internal or external
subset). The problem is that well-formed documents don’t have to have
a DTD. There are some good reasons that you might want to avoid a DTD (you
may be using another schema language, for example, or you may want to
use your documents in environments like SOAP which forbid the document
type declaration).</para>

<para xml:id="p6">So what can you do? That, as they say, is the sixty-four
dollar question.</para>

<para xml:id="p7">For a long time, the answer was “not much,” and we’ve been
wrestling with this one for years. I think everyone accepts that the
best answer is to just use the characters you want. XML is
Unicode<indexterm>
      <primary>Unicode</primary>
    </indexterm>, so
if you want to say “the students are very serious” in French, say it:
“les étudiants sont très sérieux.”</para>

<para xml:id="p8">But what if you can’t type in those characters? I’ve
<link xlink:href="/2003/10/03/xmlunicode">hacked up</link>
pretty good Unicode support in
<application>emacs</application>, but I still can’t even
pretend to support interesting mathematics. An equation as simple as
this one stymies me:</para>

<gal:graphic rdf:resource="images/epii"/>

<para xml:id="p9">One solution is to use numeric character references. These refer
to characters by their Unicode codepoint, so no declarations are
necessary. They work, but they’re hard to read.
Now, in point of fact,
<link xlink:href="http://groups.yahoo.com/group/emacs-nxml-mode/">emacs-nxml-mode</link>
does a pretty decent job of displaying numeric character references,
which reinforces my feeling that this is really a user-interface
issue.</para>

<gal:graphic rdf:resource="images/emacs"/>

<para xml:id="p10">With a better font, that’d be much more readable. It would still
be ugly, but I wouldn’t have to remember all those codepoints to read
it. Of course, with a better font, I wouldn’t have bothered with the
numeric character references, I would have just stuck in the
characters.</para>

<gal:graphic rdf:resource="images/epii-uc"/>

<section xml:id="s1">
<title>What are the other options?</title>

<para xml:id="p11"><personname>
	<firstname>Tim</firstname>
	<surname>Bray</surname>
</personname> recently
<link xlink:href="http://www.tbray.org/ongoing/When/200x/2003/10/17/UTF8-plus">proposed</link>
a solution that involved a new encoding. On balance, I think the proposal would
introduce so much confusion that it’s probably not the right answer, but it does
have a significant benefit over all the other proposals I’ve seen: it’s a lexical
hack. The more I think about it, the more I think that a standard solution to
this problem should address it at the lexical level.</para>

<para xml:id="p12">By lexical level, I mean that the solution should not intrude into the
XML object model. Whatever solution we arrive at (if we do), it should be
the case that “sérieux” spelled with a literal character or the new solution
should be indistinguishable in the XML object model. (Just as
“s&amp;#233;rieux” is indistinguishable now.)</para>

<para xml:id="p13">If it isn’t, then we don’t need a standard solution. If the
solution involves some sort of transformation from the “input
document” to the “real document”, then that transformation can just be
another step on the XML processing pipeline and we don’t need to bake
it into XML. (We need an XML processing pipeline, but that’s a
<emphasis>whole</emphasis> other story.)</para>

<para xml:id="p14">Suppose, for example, that we use elements to represent these characters.
Some folks feel strongly that the named character mechanism should have
namespaced names and elements provide that for free. Then you might
spell “sérieux” “s&lt;e:eacute/&gt;rieux”.</para>

<para xml:id="p15">Almost any non-trivial operation that you perform on a document
that uses this mechanism is going to have to transform all the named
characters into literal characters before it begins (I can’t imagine
searching or sorting or any other operation on the text that’s not
going to find it hugely inconvenient, if not impossible, to operate on
the text with embedded elements). But, if there’s going to be an
explicit transformation, nothing special is needed, just code that
transformation up in XSLT or Java or Perl or Python or your language of
choice and put it in the pipeline.</para>

<para xml:id="p16">One of the things that motivated this essay was a recent
proposal that named characters should be transformed by schema
validation. I think that’s a <emphasis>horrible</emphasis> idea. I can
understand how the XML Schema WG got there, but I think that
making the interpretation of named characters depend on validation
is a really bad idea.</para>

<para xml:id="p17">The proposal I saw<footnote>
<para xml:id="p18">From <personname>
	    <firstname>Michael</firstname>
<surname>Sperberg-McQueen</surname>
	  </personname>, alas
in W3C
<link xlink:href="http://lists.w3.org/Archives/Member/w3c-xml-plenary/2003Oct/0000.html">member-only space</link>. In the interest of fairness,
I stole the example
sentence about serious students from Michael’s proposal.</para>
</footnote> suggested that we might spell 
spell “sérieux” “s{#e:eacute}rieux” and that validation would provide
the replacement text.</para>

<para xml:id="p19">I don’t know if that’s a good way to spell those characters or not,
<personname>
	<firstname>John</firstname>
	<surname>Cowan</surname>
      </personname>
offered some good arguments for a slightly different spelling. But my
point is that we don’t need anything special to implement this.</para>

<para xml:id="p20">Suppose we decide to talk about serious French students this way:</para>

<screen>&lt;?xml version='1.0'?&gt;
&lt;doc xmlns:h="http://www.w3.org/1999/xhtml"&gt;
&lt;p&gt;Les {#h:eacute}tudiants sont tr{#h:egrave}s
s{#h:eacute}rieux.&lt;/p&gt;
&lt;/doc&gt;</screen>

<para xml:id="p21">I can easily construct an XSLT 2.0 stylesheet that transforms
the input <link xlink:href="unmacro.xsl">into actual characters</link>
and another that transforms the actual characters
<link xlink:href="macro.xsl">back into names</link>.</para>

</section>

<section xml:id="s2">
<title>So, what should we do?</title>

<para xml:id="p22">Nothing. Well, nothing standard at the XML level, anyway.
There’s nothing that can practically be done to introduce a new mechanism
at the lexical level in XML. I suppose that some future version of XML could
decide that <emphasis>all</emphasis> the
<link xlink:href="http://www.w3.org/2003/entities/">ISO 8879/9573 entities</link>
were predefined, but that seems unlikely to me.</para>

<para xml:id="p23">And if you aren’t going to solve this at the lexical level, just
solve it as part of your processing pipeline. If there really is a
best way to spell characters at this level, it’ll win by natural
selection and tool vendors will be motivated to make it efficient. If not,
then we shouldn’t be standardizing it anyway, right?</para>
</section>

</essay>

