Do not condemn the judgement of another because it differs from your own. You may both be wrong.
Opinion is clearly divided on the question of escaped markup. Several people supplied commentary on my XML.com essay, Sam Ruby commented in his blog, and Tim Bray talked about it before I did. There's even a part of the Atom Wiki devoted to it. Some people real want escaped markup.
The bottom line: I'm completely unconvined, most especially by the lines of argument that suggest this is a practice that's here to stay, so we just have to live with it or that end-users and aggregators want it, so we have to support it. First off, the point of doing Son-of-RSS, as I understand it, is to give everyone a clean, unified place to go. This escaping mess isn't clean and it can go. Second, the argument that customers (of one stripe or another) need it is bogus: they don't need it, there are better solutions, they just want it (because that's the way they're used to doing it, or because that's what they think is easiest, or because of something else).
Consulting taught me that customers sometimes want damned foolish things. And I know that sometimes you have to do what the customer wants. But experience suggests that when you let the customer convince you to do something damned foolish, later on you have to explain to the customer, when their experience convinces them it's damned foolish, why they paid you good money to do something so damned foolish. “Because you told me to” is only sometimes an effective explanation.
Having said all that, I did find an unfortunate flaw in the XML.com piece while I was preparing this essay: they retitled it which I think has caused some confusion. The essay I provided was titled “Escaped Markup Considered Harmful”. Editorial license, I presumed incorrectly, and I'm told the correct title will percolate through the next build.For the benefit of those who only ever see the corrected title, it was published originally on XML.com as “Embedded Markup Considered Harmful”, which suggests something else entirely.
The incorrect title may explain the confusion of several commenters who took me to task for not proposing an alternative to embedded markup. Embedding markup is absolutely fine. It's what we should be doing.
asks how to
store “I really found this book
helpful” in his (non XML) database, for example.
The answer is: “
I <emph>really</emph> found
this book helpful.
” Using XML markup to identify semantically or
stylistically significant portions of a unit of text is entirely
The practice that I'm arguing against is one that only occurs
where an XML context has been established. Some RSS variants have
“text only” elements and the workaround for sending marked
up text in those elements is to escape the markup, like this:
found this book helpful..”
That's just wrong.
Just to be perfectly clear, if the database in question had some weird prohibition against angle brackets, I think it'd be ok to store escaped markup in the database. Just so long as you unescaped it before you dropped it into any XML context. The evil of escaped markup in RSS is that RSS is already XML.
Jukka-Pekka Keisala offers the “escaped markup is necessary because I don't have control over the original content” argument. But just slapping CDATA sections around the authors content isn't sufficient, at the very least you have to look at every character in order to avoid encoding problems. It's not that much harder to turn the content into well-formed XML.
(In an earlier essay, I suggested that TagSoup as a method for converting random markup into well-formed XML. Ted Leung points out that NekoHTML would work as well. I'm sure there are other options, too. For the record, I wasn't trying to advocate any particular answer, just point out that it's a solved problem.)
John Vance asks how to deal with browsers that don't understand XHTML. I'm not sure exactly what the right answer here is because I'm not sure I understand the problem. You must be extracting the escaped markup out of the feed since a browser that doesn't understand XHTML certainly isn't going to grok the XML feed directly. That means you're parsing the feed. If you're building a DOM, maybe you can just hand that DOM to the browser (the different ways of representing empty end tags and other XMLisms don't manifest themselves in the DOM). If you're handing the browser a stream of text, well you can surely convert the embedded XHTML into HTML very nearly as easily as you can extract all the CDATA section start and end markers.
Julian Bond chastises me for not providing alternatives and makes several good points.
First off, he suggests that escaping is legitimate because RSS is just using XML as a transport protocol and aggregators can reasonably argue that they have no control over the content. We've heard this argument before, but he goes on to point out that SOAP and xmlrpc faced similar problems. He's absolutely right that this is not a new problem for systems using XML as a transport protocol. I don't know what xmlrpc does, but SOAP has two mechanisms for dealing with markup it can't embed in XML: attachments and base64 encoding. If RSS adopts either of these (to the exclusion of this escaping crap), I'll shut up. The point being that attachments remove the necessity to worry about the problem because the content isn't embedded in XML and base64 encoding makes it very clear that this is a mechanical encapsulation mechanism, not something users should start coding up by hand (in RSS or anywhere else).
He goes on to say that “[escaping] may be wrong, it may be ugly and it may well be harmful, but dammit, it works.” Uhm. Ok, for some definition of “works”, I suppose. In any event, I think the ugly and harmful parts dramatically overshadow the works part, especially since there are other mechanisms that work without being either ugly or harmful.
Next, Julian ponders who's to blame if one jumps through hoops to make well formed markup out of tag soup and it goes wrong once in a while. Ignoring for the moment the fact that I can't see how it can go wrong, I'll point out again that just wrapping things in CDATA can go wrong too. There are characters that can't be represented in CDATA and there are encoding issues.
Finally, he asks what my problem is given that CDATA is part of XML. I don't have any problem with CDATA. CDATA sections are fine, as long as what you're putting in them is text. What's evil about this escaped markup nonsense is the semantics provided by RSS/Atom, not the use of CDATA to avoid lots of ampersands and semicolons. That's my problem.
Finally, in a brief conversation yesterday, Tim told me that if I looked at Sam's recent slides (though I haven't in fact gone looking for them), I'd see examples of doubly escaped markup. If you don't buy any of my other arguments, surely that's enough to convince you that this way lies madness. Surely.
No one has produced a single argument that even begins to persuade me to accept escaped markup. I'm not a hardass, at least I don't think I am, but I'm not letting go on this one. Escaped markup is harmful. Stop it.