Thoughts on Character Entities

Volume 6, Issue 110; 13 Nov 2003

After much consideration, I don’t think XML should try to solve the character entity problem.

The price one pays for pursuing any profession or calling is an intimate knowledge of its ugly side.

—James Baldwin

I don’t think XML should try to solve the character entity problem. There, I’ve said it. I wonder where I left my flame-retardant underwear?

The character entity problem, if you aren’t familiar with it (and if you aren’t, this essay might be a good one to skip) is the fact that entity references, often used to name characters that can’t conveniently be typed, aren’t part of well-formed XML documents that don’t have a DTD.

The following document is not well-formed:

<?xml version='1.0'?>
<doc>
<p>Les &eacute;tudiants sont tr&egrave;s
s&eacute;rieux.</p>
</doc>

XML only has five predefined entities (<, &, etc.), you have to define the rest in the DTD (either the internal or external subset). The problem is that well-formed documents don’t have to have a DTD. There are some good reasons that you might want to avoid a DTD (you may be using another schema language, for example, or you may want to use your documents in environments like SOAP which forbid the document type declaration).

So what can you do? That, as they say, is the sixty-four dollar question.

For a long time, the answer was “not much,” and we’ve been wrestling with this one for years. I think everyone accepts that the best answer is to just use the characters you want. XML is Unicode, so if you want to say “the students are very serious” in French, say it: “les étudiants sont très sérieux.”

But what if you can’t type in those characters? I’ve hacked up pretty good Unicode support in emacs, but I still can’t even pretend to support interesting mathematics. An equation as simple as this one stymies me:

One solution is to use numeric character references. These refer to characters by their Unicode codepoint, so no declarations are necessary. They work, but they’re hard to read. Now, in point of fact, emacs-nxml-mode does a pretty decent job of displaying numeric character references, which reinforces my feeling that this is really a user-interface issue.

With a better font, that’d be much more readable. It would still be ugly, but I wouldn’t have to remember all those codepoints to read it. Of course, with a better font, I wouldn’t have bothered with the numeric character references, I would have just stuck in the characters.

What are the other options?

Tim Bray recently proposed a solution that involved a new encoding. On balance, I think the proposal would introduce so much confusion that it’s probably not the right answer, but it does have a significant benefit over all the other proposals I’ve seen: it’s a lexical hack. The more I think about it, the more I think that a standard solution to this problem should address it at the lexical level.

By lexical level, I mean that the solution should not intrude into the XML object model. Whatever solution we arrive at (if we do), it should be the case that “sérieux” spelled with a literal character or the new solution should be indistinguishable in the XML object model. (Just as “sérieux” is indistinguishable now.)

If it isn’t, then we don’t need a standard solution. If the solution involves some sort of transformation from the “input document” to the “real document”, then that transformation can just be another step on the XML processing pipeline and we don’t need to bake it into XML. (We need an XML processing pipeline, but that’s a whole other story.)

Suppose, for example, that we use elements to represent these characters. Some folks feel strongly that the named character mechanism should have namespaced names and elements provide that for free. Then you might spell “sérieux” “s<e:eacute/>rieux”.

Almost any non-trivial operation that you perform on a document that uses this mechanism is going to have to transform all the named characters into literal characters before it begins (I can’t imagine searching or sorting or any other operation on the text that’s not going to find it hugely inconvenient, if not impossible, to operate on the text with embedded elements). But, if there’s going to be an explicit transformation, nothing special is needed, just code that transformation up in XSLT or Java or Perl or Python or your language of choice and put it in the pipeline.

One of the things that motivated this essay was a recent proposal that named characters should be transformed by schema validation. I think that’s a horrible idea. I can understand how the XML Schema WG got there, but I think that making the interpretation of named characters depend on validation is a really bad idea.

The proposal I sawFrom Michael Sperberg-McQueen, alas in W3C member-only space. In the interest of fairness, I stole the example sentence about serious students from Michael’s proposal. suggested that we might spell spell “sérieux” “s{#e:eacute}rieux” and that validation would provide the replacement text.

I don’t know if that’s a good way to spell those characters or not, John Cowan offered some good arguments for a slightly different spelling. But my point is that we don’t need anything special to implement this.

Suppose we decide to talk about serious French students this way:

<?xml version='1.0'?>
<doc xmlns:h="http://www.w3.org/1999/xhtml">
<p>Les {#h:eacute}tudiants sont tr{#h:egrave}s
s{#h:eacute}rieux.</p>
</doc>

I can easily construct an XSLT 2.0 stylesheet that transforms the input into actual characters and another that transforms the actual characters back into names.

So, what should we do?

Nothing. Well, nothing standard at the XML level, anyway. There’s nothing that can practically be done to introduce a new mechanism at the lexical level in XML. I suppose that some future version of XML could decide that all the ISO 8879/9573 entities were predefined, but that seems unlikely to me.

And if you aren’t going to solve this at the lexical level, just solve it as part of your processing pipeline. If there really is a best way to spell characters at this level, it’ll win by natural selection and tool vendors will be motivated to make it efficient. If not, then we shouldn’t be standardizing it anyway, right?

Comments

I'm not sure that input support in emacs (or any other editor) really addresses this. I've been able to type the above using a UK keyboard and its standard iso-accents suport since I started using emacs 18 sometime around 1987. That was/is entering latin1 bytes rather than a unicode encoding but the principle of a simple ascii based keyboarding producing non-ascii characters in the file is hardly new.

I think the main reason that people want "character entities" is for the reason that XML or TeX is "self describing". If I look at your document and see <foo>&pi;</foo> then I know how to produce that on whatever system I have, but if you have used some funky input mechanism so you type pi and a pi character gets inserted, then even if my system shows that as a pi I may not know how to enter it, and you can't tell me as you don't know my system.

Using elements/attributes addresses this problem, but at some considerable cost in the size of the underlying dom/infoset (Early mathml drafts proposed <mchar name="rightarrow"/> but there was negative comment on what happens to the dom in your browser if every character gets represented by an element node, and attribute node, and a few namespace nodes.

The problem with dtds is not only that they may not be allowed (soap) but they may not be read at all (mozilla) unless you put it all in the internal subset.

<x>&rightarrow;</x>

is not well formed, but

<!DOCTYPE x SYSTEM="x-rubbish:not-here"> <x>&rightarrow;</x>

is well formed (but presumably not valid) If undefined entities were wellformed, you would have a chance of passing fragments using them through an xml pipeline and so long as the end application knew what they were supposed to mean, things would work out. As is you tend to die with a fatal parse error at the start of your pipeline, which is no fun.

It isn't clear to me if it is too late to change this. I think it could perhaps have gone in xml 1.1 but it didn't and now it may be better to live with the problem rather than try to fix it, but that isn't quite the same thing as saying that there isn't a problem:-)

David