This is a flame. A rant. A tirade. There, I admit it. Up front. Any offense you take is your own.
The ongoing design effort to build a better syndication format (a successor to RSS in all its versions) has stumbled in some predictable places. On further reflection, feature creep doesn't surprise me: what starts out small and simple rapidly becomes not so small and not so simple. I've served on enough working groups to know that this is an inevitable side-effect of “design by committee”. That's ok. There's still plenty of time to throw most of the cruft away.
But I am astonished by the debate over escaped HTML. Most of the phrases that spring immediately to mind when I consider this debate are unprintable. At least, I'm not going to print them. Not in public. Here's a sanitized summary: “are you freaking nuts?”
If there's any part of RSS that's totally broken, it's the notion that I would publish the item description “this is my description” like this:
This is <em>my</em> description.
That's so totally absurd that I won't do it. I could generate RSS feeds that way, but I'd rather just stick to plain vanilla text in my abstracts than stoop that low.
I find reading the Wiki pages quite challenging, but as near as I can tell, there are three arguments for allowing escaped HTML: legacy, tools, and content dispatching.
You're joking, right? Any legacy data that you have will need to be transformed into the new format. You can fix the botched markup when you do the transformation. John Cowan's done all the heavy lifting already with TagSoup .
I suppose there may be systems out there in which it will be difficult or impossible to cleanup the legacy. I'm sympathetic, but not so sympathetic that I'd consider that sufficient argument to carry this ugly kludge forward.
Tools that generate markup that isn't well formed
Yeah. News flash: they're broken. Fix them or use others.
The content isn't interesting to the feed consumer
This argument follows from the fact that many feed consumers just want to pass the content to another application for interpretation. The aggregator, the argument goes, doesn't care about the content and shouldn't have to parse it.
Ok, that sounds reasonable. For about a microsecond.
First off, you've got to parse the content anyway. Whether you parse it as markup or text is irrelevant, it's still going through the parser. Granted, if it's not well formed, it'll make the parser choke, but that's a feature, folks.
Now, once you've got it parsed, if you really want to send it off to some other application as text, that's dead simple. Going from markup back to a serialized form (assuming you can't pass it on as SAX events or some DOM more efficiently anyway) is completely straightforward.
Going the other direction, taking a blob of text, unescaping the markup characters, and reinterpreting it in the face of possible well-formedness errors is much harder. And if you're expected to recover from broken markup (an expectation that I thought was squashed once and for all in 1998), not only is it much harder still, it's a playground for all sorts of miscreants to devise trojan horses and other mischief.
XML tools are commodities these days. Consensus that the
successor to RSS will be expressed in XML was achieved fairly quickly.
I expect to process feeds with XML tools and I expect those tools to
have access to all of the content. I want to be
able to examine the markup coming in to see if it's legit. Does it
script tags? Does it use a namespace that I
wish to reject or validate differently? In short, does it pass
Well formed markup onlyAn extension module that supports base64 encoded binary or something, for distributing software patches or other binary content might be reasonable, but that doesn't belong in the base specification., please.