Taking a Hard Line

Volume 6, Issue 49; 30 Jun 2003

There's serious debate in the successor-to-RSS world about how to maintain escaped HTML markup in a feed. I'm just appalled. [Updated 22 Aug 2003.]

A man who is “of sound mind” is one who keeps the inner madman under lock and key.

—Paul Valéry

[For a more recent, and perhaps marginally less inflamatory, discussion of some of these issues, see Embedded Markup Considered Harmful on XML.com. --22 Aug 2003]

This is a flame. A rant. A tirade. There, I admit it. Up front. Any offense you take is your own.

The ongoing design effort to build a better syndication format (a successor to RSS in all its versions) has stumbled in some predictable places. On further reflection, feature creep doesn't surprise me: what starts out small and simple rapidly becomes not so small and not so simple. I've served on enough working groups to know that this is an inevitable side-effect of “design by committee”. That's ok. There's still plenty of time to throw most of the cruft away.

But I am astonished by the debate over escaped HTML. Most of the phrases that spring immediately to mind when I consider this debate are unprintable. At least, I'm not going to print them. Not in public. Here's a sanitized summary: “are you freaking nuts?”

If there's any part of RSS that's totally broken, it's the notion that I would publish the item description “this is my description” like this:

This is &lt;em&gt;my&lt;/em&gt; description.

That's so totally absurd that I won't do it. I could generate RSS feeds that way, but I'd rather just stick to plain vanilla text in my abstracts than stoop that low.

I find reading the Wiki pages quite challenging, but as near as I can tell, there are three arguments for allowing escaped HTML: legacy, tools, and content dispatching.

Legacy

You're joking, right? Any legacy data that you have will need to be transformed into the new format. You can fix the botched markup when you do the transformation. John Cowan's done all the heavy lifting already with TagSoup .

I suppose there may be systems out there in which it will be difficult or impossible to cleanup the legacy. I'm sympathetic, but not so sympathetic that I'd consider that sufficient argument to carry this ugly kludge forward.

Tools that generate markup that isn't well formed

Yeah. News flash: they're broken. Fix them or use others.

The content isn't interesting to the feed consumer

This argument follows from the fact that many feed consumers just want to pass the content to another application for interpretation. The aggregator, the argument goes, doesn't care about the content and shouldn't have to parse it.

Ok, that sounds reasonable. For about a microsecond.

First off, you've got to parse the content anyway. Whether you parse it as markup or text is irrelevant, it's still going through the parser. Granted, if it's not well formed, it'll make the parser choke, but that's a feature, folks.

Now, once you've got it parsed, if you really want to send it off to some other application as text, that's dead simple. Going from markup back to a serialized form (assuming you can't pass it on as SAX events or some DOM more efficiently anyway) is completely straightforward.

Going the other direction, taking a blob of text, unescaping the markup characters, and reinterpreting it in the face of possible well-formedness errors is much harder. And if you're expected to recover from broken markup (an expectation that I thought was squashed once and for all in 1998), not only is it much harder still, it's a playground for all sorts of miscreants to devise trojan horses and other mischief.

XML tools are commodities these days. Consensus that the successor to RSS will be expressed in XML was achieved fairly quickly. I expect to process feeds with XML tools and I expect those tools to have access to all of the content. I want to be able to examine the markup coming in to see if it's legit. Does it contain script tags? Does it use a namespace that I wish to reject or validate differently? In short, does it pass muster?

Well formed markup onlyAn extension module that supports base64 encoded binary or something, for distributing software patches or other binary content might be reasonable, but that doesn't belong in the base specification., please.

Comments

For what it's worth, I'm appalled too. All the talk about escaping, and even double-escaping HTML markup in RSS some time ago on various RSS lists literally made me sick inside, and I had to take a long break from involving me in anything RSS-related. And now it looks like the same mistakes will be done all over again with Echo.

But to comment more on the technical side of things: As far as I can see, much confusion seems to stem from a a general lack of understanding of what entities (and entity references), characters, (numeric) character references, bytes and CDATA sections really *are* and how they work in an ‘XML framework'. They're all just mechanisms for writing ordinary character data (i.e. ‘text'). (And entities can also be used as a simple macro language for inserting frequently used text blocks easily.)

XML already *has* a well thought-out mechanism for mixing various XML vocabularies. It's called XML namespaces, and nothing could be simpler to use. To embed an piece of XHTML (or any other XML-based content language, e.g. MathML) in a XML document, such as a Echo document, you just include the relevant piece directly (or by parsing and serialising, to get rid of any unresolved entity references), and put a namespace declaration on it. Example:

<description>This is my description</description>

(Since the content may span several paragraphs, it'll probably be a good idea to use the XHTML ‘body' element as a wrapper element.)

While it's extremely easy to write, it's even easier to parse. All XML parsers support namespaces, and you just have to dump the contents of each element (XSLT example: <xsl:value-of select="description"/>) to have a nice, readable plain-text version (not all RSS/Echo clients will have a XHTML and CSS rendering engine included). And if the RSS/Echo client works by building a XHTML document, and then sending this to the default browsers, you just have to serialise the ‘description' element again (an identity transformation in XSLT, and likely a one-liner in most XML tools/parsers).

Again, nothing could be simpler.

You write

"And if you're expected to recover from broken markup (an expectation that I thought was squashed once and for all in 1998), not only is it much harder still, it's a playground for all sorts of miscreants to devise trojan horses and other mischief."

but the source of the very same article, while sporting an XML doctype declaration

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

, includes around ten structural errors such as

Again, nothing could be simpler.</div>

, which make it malformed.

Check http://snurl.com/1oze .

Preaching well-formed XML will work much better when written in well-formed XML :)

Tobi

P.S. XHTML, just as RSS, is being parsed, processed, transformed, filtered, indexed, etc.

Ah. Bleh. The article is well formed and, in fact, valid. But the talkback comments are not. I'll have to fiddle my CGI script a bit.

Norm

English is not my native language, so I might miss the meaning, implications, and finer nuances of "Ah." and "Bleh.".

I just thought I'd let you know that the page of the article at http://norman.walsh.name/2003/06/30/hardline is/was invalid (since not well-formed).

I hope you didn't take it as an offense, or nitpicking; the report was meant as helpful feedback.

I very much share both your opions: 1. Including escaped markup in RSS is not a good idea; this should be done via namespaces. 2. Everything that says it's XML should be well-formed (IMHO, even valid, eg in respect to some standard).

Tobi

No offense taken, Tobi, I appreciate the report. Most of the uglier problems were in the feedback comments. I've fixed the CGI script that includes them. I've also fixed a couple of other HTML bugs.

I believe all the pages are valid now.

The RSS feed for my utterly unimportant web site uses plain text descriptions (stripping out all mark-up) for just that reason -- I do not want to get emmeshed in the ambiguity of escaped HTML content.

My attempts to create RSS readers quickly have been thwarted, because I cannot process RSS in XSLT (my tool of choice) because it is escaped, and, often, not well-formed. My friends' RSS feeds are generated by a program that takes the first N characters of the HTML data as title -- often slicing a tag in half.

The debate between allowing (or even *requiring*) HTML to be escaped as CDATA is a symptom of the document vs. data views of XML. Having used the phrase "Echo document", naturally you expect the content to be part of the document. People who say "Echo datastream" think the content of the entry "is just data" and should be escaped...

There is a weird minority who think that doing escaping with CDATA (rather than entities) makes it more OK because a human reading the document will be able to see the mark-up as mark-up, even though the processing application cannot. IMHO, the processing application needs to see the mark-up more than human readers do!