Emacs, XML, Unicode

Volume 6, Issue 90; 29 Sep 2003

Inserting Unicode characters into emacs. Input methods like this greatly reduce the need for entity declarations, the last remaining holdouts from my life with DTDs.

Some people do their laundry in emacs, but I find typing ^C-^X-^W-q-L-TT to add the fabric softener to be a bit cumbersome.

rlr at panix.com

Tim Bray followed up (in a sense) to my essay on moving beyond DTDs with some nifty emacs code for inserting special Unicode characters directly into a buffer.

Input methods like this greatly reduce the need for entity declarations, the last remaining holdouts from my life with DTDs. Who needs:

<!DOCTYPE article [
<!ENTITY euro "&#x20AC;">

And the corresponding “&euro;”, if you can just stuff a “” into your buffer!

I read Tim’s essay and decided that he was right about some things, like “smart quotes,” but he didn’t do it quite the way I would have. So I banged away for a bit and coded up XML Characters, my own solution.

XML Characters provides four functions:


This function, which I bind to " in nxml-mode, inserts the appropriate double quote. Called after a space, newline, or >, it inserts a left double quote. Called after a double quote, it cycles through the three possible quote styles: left, straight, or right. Called anywhere else, it inserts a right double quote.

Inside a start tag, it always inserts just a vanilla ".


I bind this to ' in nxml-mode and it does just what you think it does.


This function inserts a named XML character. For example, (insert-xml-char "sect") inserts a section mark (§). The set of names is maintained in a couple of associative lists, so you can easily tweak them. Called with no arguments, it pops up a menu, somewhat like Tim’s code.

I bind this to C-t c because I have a pretty extensive Ctrl-T map that I’m used to using.


Where Tim seems content to have a selection of accented characters in a menu, I decided I wanted more complete and uniform access to all the ISO Latin 1 accented characters (plus a few other things; there’s another list, so you’re free to tweak).

For my function, I chose to read two more keystrokes and compose the approprate character that way. I bind this to C-t e at the moment.

So, for example, I can type C-t e e ' to insert “e acute”. Or C-t e $ y to insert a yen symbol.

Thanks, Tim! I didn’t know how much I needed these functions before I wrote them. In the course of writing one essay, I’ve decided I wouldn’t want to live without them.


Thanks! This is incredibly useful.

One problem, though: When I try to use the insert-xml-char function, Emacs segfaults. Any idea on why?

—Posted by Mike Kozlowski on 30 Sep 2003 @ 04:57 UTC #

But entities aren&apos;t just useful for inserting characters outside the US-ASCII range. A document might for example contain the version of the software tool it desribes; with entities I simply change one entity declaration and get the new version number in all the fifty places (search and replace could go wrong and confuse version numbers of other tools mentioned). It would be great to have some mechansim for declaring simple constants in XML without having to use DTD doctype declarations with internal subsets.

—Posted by Tobi on 01 Oct 2003 @ 08:45 UTC #

Tobi: XInclude.


—Posted by Mike Kozlowski on 01 Oct 2003 @ 09:51 UTC #

I need to do more work on it, but I&apos;ve been using a separate step for character entity replacement prior to validation.


Ents uses an XML file which lists all the names and values of character entities, and I run it as a pre-processor before parsing or just directly on docs at the command line. Defining new entities is pretty simple, and one of the nicer aspects of this is that I can run it backwards to produce the entity-fied version instead of char ref version.

But then, since I don&apos;t use Emacs, I have to resort to such hackery, right?

—Posted by Simon St.Laurent on 03 Oct 2003 @ 02:08 UTC #

I&apos;m sorry Mike, I can&apos;t imagine why it segfaults. Send me your particulars in email and I&apos;ll see if I can help.

—Posted by Norman Walsh on 03 Oct 2003 @ 02:11 UTC #