Escaped Markup: What To Do Instead

Volume 6, Issue 86; 18 Sep 2003

I've argued against escaped markup in several forums: time to stop for a while. Either I've made my points or I haven't, repeating myself won't help. But since a number of people have suggested that I'm not proposing any solutions: here are some solutions. And a challenge; or at least an exercise that I think might be interesting.

Optimism is an occupational hazard of programming: testing is the treatment.

—K. Beck

I've written about this a few times now, enough to warrant a thread (even though I've mostly abandoned threading), and I think I've said just about all I can usefully say.

Apparently I still haven't specified what I think the alternatives are in a clear enough fashion. I'll try to rectify that in this essay.

But first, a quick recap.

I think escaped markup is inherently dangerous and must be outlawed in AtomSubstitute your favorite Son-of-RSS name for Atom; I'm agnostic. and all other specifications. In brief:

It moves content that one could reasonably desire to address with XML tools into a realm where those tools do not and cannot operate.
It is, at best, a partial solution to the problem. It fails to address encoding and other internationalization issues.
It encourages naive users to believe that escaped markup is an acceptable solution to the general problem of how to stick markup where a schema says they may not.

The last point, in particular, makes it dangerous. The first two just make it a nasty kludge.

And for the record, I strongly object to the allusion that my opinion on this matter demonstrates ivory tower thinking. I'm desperately worried about the practical ramifications of escaped markup.

So what are the alternatives?

Stick to plain text, don't even try to put any markup in there.

I think that's a marginally acceptable solution for Atom applications that are publishing abstracts and pointers, as most of the feeds I read seem to do.

If the schema for your Atom variant of choice defines the content of an element so that it can only contain text, this is what you must do. That's what it means to have schema constraints.
Allow markup and insist that it be well-formed.

This is arguably the hardest thing to do, but it's not really that hard, is it? For any piece of content that you want to publish in your feed, you have to run it through some utility to make it well formed. I argue that such a transformation is not significantly harder than the transformation needed to properly handle escaping.
If the content you want to syndicate really contains markup that you can't represent in XML (such as document type declarations), I think there are three options: use MIME or some other mechanism to make them proper attachments, leave them on the net somewhere and point to them, or base64 encode them.

What, demand some is the gain of base64 encoding? I'll tell you what the gain is: human authors will not be encouraged to write base64 by hand. They will not imagine that trivially escaped markup is the right answer in other problem domains where they want to put markup in fields that the schema constrains to text.

It has no technical gain for the machines (but no significant cost, either), but tremendously improved semantics for end users.

I'd like to try a little experiment. Here are two documents, neither is well-formed XML, but both display “correctly” in my browser (Mozilla Firebird on Linux):

doc1.htm [sic] is an ISO 8859-1 document.
doc2.html is a UTF-8 document.

Personally, I would syndicate just the abstracts, but I could syndicate the entire contents, if that's what was required.

If you think escaped markup is the answer, what does your feed look like? Do you have tools that build your feed automatically, what does it do with these files?

Comments

"Allow markup and insist that it be well-formed.

This is arguably the hardest thing to do, but it's not really that hard, is it?"

I agree that it isn't very hard. RSS is XML thus must be well-formed (and valid).

Thus people and tools creating RSS (and similiar formats) can be expected and required to and should be able to make sure that inlined XHTML is well-formed as well.

Tobi

... and it might be a good idea to put XHTML 1 in the XHTML 1 namespace.

Tobi

HTML is a meme. XHTML may someday replace it, but till then, I dont think you can get around it. I like the "leave them on the net somewhere and point to them" solution. That is, SSF.

What not use namespaces? This is what Jabber/XMPP use. For example:

<funkyxml xmlns="funky:xml">
  <item xmlns="http://www.w3.org/1999/xhtml">
    <h1>Woohoo!</h1>
    <p>This is in HTML.</p>
  </item>
</funkxml>

Here's a page on how to do this in XHTML. It should also be done with every other XML document too: http://www.w3.org/TR/xhtml1/#well-formed

- Nolan

Over two years later, and I'm sitting looking at a WS-I monitor log.xml file and it contains captured SOAP messages as escaped content .. Grr ..

You know what would help folks like me a lot is if we knew how to base64binary encode/decode in XSLT, but anyway, my version of a SOAP log file will embed captured XML documents as XML.

I'm getting 404 errors when clicking on the links to your example feeds and doc2.html.

Bah. I don't know when that happened. Sorry. Fixed now.

I've also been gently and not-so-gently pushing people away from CDATA and from escaped markup in the XML FAQ for a while as well, and I agree we simply haven't reached the mass market on this. We need to shout LOUDER!

It's a form of Tag Abuse, and as such I think it's time to resuscitate The Society for the Definitive Abolition of Tag Abuse (SDATA). All who grok the problem are welcome!

Your three solutions strike me a bit like a "let them eat cake" kind of thing. If a website allows some anonymous user to type in HTML which is then sanitized and published in an atom feed, then none of your solutions can be used. I suppose that one could attempt to doctor the input to make it well formed, but i don't know how to do that. So what is the solution from the ivory tower?