Escaped Markup: Still Harmful

Volume 6, Issue 85; 16 Sep 2003

No one has produced a single argument that even begins to persuade me to accept escaped markup.

Do not condemn the judgement of another because it differs from your own. You may both be wrong.

—Dandemis

Opinion is clearly divided on the question of escaped markup. Several people supplied commentary on my XML.com essay, Sam Ruby commented in his blog, and Tim Bray talked about it before I did. There's even a part of the Atom Wiki devoted to it. Some people real want escaped markup.

The bottom line: I'm completely unconvined, most especially by the lines of argument that suggest this is a practice that's here to stay, so we just have to live with it or that end-users and aggregators want it, so we have to support it. First off, the point of doing Son-of-RSS, as I understand it, is to give everyone a clean, unified place to go. This escaping mess isn't clean and it can go. Second, the argument that customers (of one stripe or another) need it is bogus: they don't need it, there are better solutions, they just want it (because that's the way they're used to doing it, or because that's what they think is easiest, or because of something else).

Consulting taught me that customers sometimes want damned foolish things. And I know that sometimes you have to do what the customer wants. But experience suggests that when you let the customer convince you to do something damned foolish, later on you have to explain to the customer, when their experience convinces them it's damned foolish, why they paid you good money to do something so damned foolish. “Because you told me to” is only sometimes an effective explanation.

Having said all that, I did find an unfortunate flaw in the XML.com piece while I was preparing this essay: they retitled it which I think has caused some confusion. The essay I provided was titled “Escaped Markup Considered Harmful”. Editorial license, I presumed incorrectly, and I'm told the correct title will percolate through the next build.For the benefit of those who only ever see the corrected title, it was published originally on XML.com as “Embedded Markup Considered Harmful”, which suggests something else entirely.

The incorrect title may explain the confusion of several commenters who took me to task for not proposing an alternative to embedded markup. Embedding markup is absolutely fine. It's what we should be doing.

Richard Pinneau asks how to store “I really found this book helpful” in his (non XML) database, for example. The answer is: “ I <emph>really</emph> found this book helpful. ” Using XML markup to identify semantically or stylistically significant portions of a unit of text is entirely respectable.

The practice that I'm arguing against is one that only occurs where an XML context has been established. Some RSS variants have “text only” elements and the workaround for sending marked up text in those elements is to escape the markup, like this: “I <emph>really</emph> found this book helpful..” That's just wrong.

Just to be perfectly clear, if the database in question had some weird prohibition against angle brackets, I think it'd be ok to store escaped markup in the database. Just so long as you unescaped it before you dropped it into any XML context. The evil of escaped markup in RSS is that RSS is already XML.

Jukka-Pekka Keisala offers the “escaped markup is necessary because I don't have control over the original content” argument. But just slapping CDATA sections around the authors content isn't sufficient, at the very least you have to look at every character in order to avoid encoding problems. It's not that much harder to turn the content into well-formed XML.

(In an earlier essay, I suggested that TagSoup as a method for converting random markup into well-formed XML. Ted Leung points out that NekoHTML would work as well. I'm sure there are other options, too. For the record, I wasn't trying to advocate any particular answer, just point out that it's a solved problem.)

John Vance asks how to deal with browsers that don't understand XHTML. I'm not sure exactly what the right answer here is because I'm not sure I understand the problem. You must be extracting the escaped markup out of the feed since a browser that doesn't understand XHTML certainly isn't going to grok the XML feed directly. That means you're parsing the feed. If you're building a DOM, maybe you can just hand that DOM to the browser (the different ways of representing empty end tags and other XMLisms don't manifest themselves in the DOM). If you're handing the browser a stream of text, well you can surely convert the embedded XHTML into HTML very nearly as easily as you can extract all the CDATA section start and end markers.

Julian Bond chastises me for not providing alternatives and makes several good points.

First off, he suggests that escaping is legitimate because RSS is just using XML as a transport protocol and aggregators can reasonably argue that they have no control over the content. We've heard this argument before, but he goes on to point out that SOAP and xmlrpc faced similar problems. He's absolutely right that this is not a new problem for systems using XML as a transport protocol. I don't know what xmlrpc does, but SOAP has two mechanisms for dealing with markup it can't embed in XML: attachments and base64 encoding. If RSS adopts either of these (to the exclusion of this escaping crap), I'll shut up. The point being that attachments remove the necessity to worry about the problem because the content isn't embedded in XML and base64 encoding makes it very clear that this is a mechanical encapsulation mechanism, not something users should start coding up by hand (in RSS or anywhere else).

He goes on to say that “[escaping] may be wrong, it may be ugly and it may well be harmful, but dammit, it works.” Uhm. Ok, for some definition of “works”, I suppose. In any event, I think the ugly and harmful parts dramatically overshadow the works part, especially since there are other mechanisms that work without being either ugly or harmful.

Next, Julian ponders who's to blame if one jumps through hoops to make well formed markup out of tag soup and it goes wrong once in a while. Ignoring for the moment the fact that I can't see how it can go wrong, I'll point out again that just wrapping things in CDATA can go wrong too. There are characters that can't be represented in CDATA and there are encoding issues.

Finally, he asks what my problem is given that CDATA is part of XML. I don't have any problem with CDATA. CDATA sections are fine, as long as what you're putting in them is text. What's evil about this escaped markup nonsense is the semantics provided by RSS/Atom, not the use of CDATA to avoid lots of ampersands and semicolons. That's my problem.

Finally, in a brief conversation yesterday, Tim told me that if I looked at Sam's recent slides (though I haven't in fact gone looking for them), I'd see examples of doubly escaped markup. If you don't buy any of my other arguments, surely that's enough to convince you that this way lies madness. Surely.

No one has produced a single argument that even begins to persuade me to accept escaped markup. I'm not a hardass, at least I don't think I am, but I'm not letting go on this one. Escaped markup is harmful. Stop it.

Comments

Don't give up :)

Escaping is bad. I'll concede that for the sake of argument. But how can the same person who seems to have a pretty serious negative reaction to escaping say that base64 encoding is a good alternative? Both are reversable ways to render a string free of angle brackets and ampersands.

I don't really understand all the buzz here around. I think every XML-thinking person should understand, that escaped mark-up is wrong. Let's take RSS as an example. If one wants tu put styling into <description> tag, the ONLY, natural, and not arguable decision is to use valid not escaped XHTML markup. And the possible problems (which I can't imagine for myself) about XHTML not being fully equal to HTML are really thousands of times more easily solved than the harm, that widespread escaped markup could do in the future.

I realize that your original article is 4 years old... nonetheless, it is showing up on searches and I thought I would share my views about this topic.

Up to about three days ago, I vehemently shared your view about escaped markup in RSS. However, I have recently been implementing a connector that converts RSS and other feed types into an internal form. I have come to realize that there is some validity to the argument that HTML content should be escaped.

Mind you, I am not trying to convince anyone of my position. Indeed, since I very recently swapped my view once, I see no reason why I won’t change my mind again. Furthermore, I have not fully read all the messages related to the original and follow-up article, so it is likely that I am just reiterating what you have already heard.

My realization came when I was working (as a programmer) with an XML DOM and was about to take the content in RSS' description element and place it into an internal data structure. The question I had to answer was "by what means do I take the data from the RSS XML DOM? There are essentially two ways of doing this.

1) Get the value being stored in the element. This is roughly equivalent to the XSLT value-of. This is also what you would use if the data is escaped. What you get back is the unescaped version of the data.
2) Get the serialized XML structure rooted at the element. This is roughly equivalent to the XSLT copy-of. This is also what you would use if the data is not escaped. It leaves the original markup (structure) alone.

My first observation is that RSS makes no assumptions about the reader's capabilities. The only guarantee seems to be that the reader will be capable of displaying/handling plain text. I make this observation based on the fact that elements are available to handle non-text items, such as images and links, so they can be handled and processed by an appropriate means. While these elements do not approach the display capabilities provided by HTML, they are clearly included so that the most basic of RSS consumers can make sense of the feed data.

First point: Even plain text data must be “escaped” relative to its true form.

Consider the simple text "this & that". In order for it to be represented in an RSS feed, it must be written in its XML serialized form "this & that". While this is not escaped relative to XML, it is escaped relative to the true data.

Now, ask yourself, "which of the two above techniques must I use to get that to be viewed in a text-only viewer?" The answer is technique 1. You must assume the data is escaped. It is escaped because (a) it must be stored as XML (RSS) and (b) the target data is actually text only. It is important to note that "this & that" is not the data stored in the element. That is how it must be represented in order to generate a well-formed XML document. "this & that" is the actual data. No matter how it is displayed, browser or otherwise, it must appear to the user/consumer as "this & that".

Second point: Data that looks nothing like XML must be “escaped” relative to its true form.

Let’s look at another example that may help illustrate the point more concisely. I think part of the problem is that HTML (or XHTML) is closely related to XML. They look similar and HTML *may* be represented as XML. Therefore, there is a strong desire to do just that. So, let’s consider content that isn’t remotely like XML.

Let’s say that you wanted to store a GIF image directly in XML. You must fist encode it in some way that would be representable in the XML. The binary form of a GIF image is simply not XML and never will be. Second, you must un-encode it when it comes time to produce this data.

Why should HTML be treated differently? Read on.

Third point: RSS feeds are consumed in contexts outside of HTML.

I understand that there is no shortage of ways to read RSS feeds directly in an HTML context. For example, there are a number of RSS feeds that have stylesheets associated with them that permit web browsers to directly render their content as pretty HTML views.

However, in reality, there is also no shortage where RSS is consumed in non-HTML contexts.

Indeed, the very project on which I am working must do just that. Also, I saw Dave Weiner on the Cranky Geeks video podcast say that he was somewhat annoyed that the browser manufacturers were permitting RSS data to be directly readable within their browsers. (He specifically blamed Firefox for this.) I must admit that the comment had puzzled me at the time.

The intention is to be able to share the information in richer contexts than a web page on a web site, not to become the new form of HTML. A simple example, you may host your own RSS feed which, rather than just provide your own content, may consume another feed and blend it. (While preserving the original attribution, of course.) You would be quite annoyed if you had to treat the data from your own feed differently than the other.

Fourth point: There is a point in time when data is taken out of the RSS feed, and its XML form, and placed in a context where that data can be interpreted in its true form.

Knowing that the consumers of your RSS feed are going to be viewing the data within a browser, using your stylesheet to format it nicely, is a wonderful convenience. Not only can you store the data in XML, but the data hardly leaves the XML context while it is being consumed. (It actually does, but that fact is somewhat obscured.)

However, in most other cases where RSS is consumed, the data must explicitly be removed from the XML structure of the feed so that it can be further processed.

In the first point, I illustrated how data must be removed for the RSS feed and un-escaped to get the data to its true form when the data is text-only. In truth, the only time un-escaping is not necessary is the one case where the data is going to continue to be used as XML… within the very same XML context. In all other cases, the data must be removed from the RSS feed and converted to its true form before it can be consumed by something that understands its true form.

Fifth point: Not escaping HTML is nothing more than a convenience. Sometimes.

There are a number of conveniences that are afforded by representing HTML as XHTML and not escaping it.

a) If you are manually writing the HTML, you don’t have to type the CDATA element or otherwise escape the markup. (You also don’t have to worry about using the CDATA ending sequence.)

I’ll give you the fact that this is a convenience. In practice, however, it does not bother me to have to type the CDATA element around my content any more than it bothers me to type any of the other XML markup in the RSS feed.

b) If you are consuming the RSS feed, you can directly manipulate the (X)HTML without moving it out of the XML first.

So what? Usually the only thing I am going to do to the HTML is display it anyway. (My XML DOM doesn’t help me do that.) Any other kind of manipulation makes assumptions about the structure of the data under the RSS defined element, which is processing outside of RSS proper anyway.

In my particular case, I have to extract the content from the feed and put it someplace else where it will later be displayed by a different system. Luckily that system can display HTML. However, at no time while processing the feed itself do I care that it is HTML nor do I manipulate the HTML in any way.

You can choose to represent the HTML as XHTML, but in the end, it really doesn’t buy you much of anything.

In fact, choosing to represent the content as unescaped XHTML means that I must treat the data differently than I would if that data were text only. Instead of getting the text data out of its XML form as I did in the first point using technique 1, I must serialize the data into an XML document fragment using technique 2. This creates an exceptional processing situation.

Worse, it is ambiguous. I can not tell which way to access the data to get the data to its true form. Text... technique 1. unencoded XHTML... technique 2. How do I know which form the data is in?

Sixth point: RSS does not provide a means to describe the true form of the data.

Atom does this. They have two attributes which are added to the element which describe the two dimensions involved. The "type" attribute is used to describe the true form of the data... text, html, etc. The "mode" attribute describes how the data is encoded in XML... "escaped" means to use technique 1 above to get to the data... otherwise, technique 2 is used to get the serialized form... or the DOM tree is used directly. This also permits Atom feeds to offer different versions of the content to explicitly support a number of different forms for the content.

However, RSS (2.0) does not have these attributes or anything like them. I realize that you may have been lobbying for the addition of such attributes in RSS. I agree that if the intentions of the RSS creators were to permit support for content other than text only, such attributes would be very welcome.

However, I don’t think, for the reasons outlined above as well as others, that this was ever the intention. Though, if you look at the definitions, they have specifically stated that escaped html is permitted. Obviously, the lobbying didn’t take.

I find myself unpersuaded by what I think is a sort of XML-imperialism of this attitude.

Let's start with Web Services, the ultimate XML transport.

The contract of a SOAP service is to transport a Unicode String. Any Unicode string. A string in some programming language could include non-XML-1.0 characters, it could contain low-grade HTML.

The protocol happens to use XML to organize the transport. It's none of your or my business how it chooses to deal with a message that happens to contain content in some cousin of HTML. Except for the non-1.0-character issue, escaping (& or CDATA) is as good as anything else.

At the level of RSS and such, an XML document schema might have a slot that contains any sort of textual content. It might be 'TeX'. It might be 'Runoff'. It might, horrors, even be HTML. Why is it some sort of special evil to treat the HTML 'just a string'? I don't much like 'slippery slope' arguments, but this argument seems to lead to a view that the only legitimate content of an XML document is more XML. If the XML schema type is 'string', then, by gum, it's a STRING. A fragment of XML or messy HTML is no better or worse of a STRING than anything else.

A critical principle of software design is protocol layering. You seem to be insisting that there can never be a legitimate protocol/module boundary between an XML document and some other protocol sits on top of it in a stack.

Given a nice, clean, modular protocol that happens to be conveying a string via XML, you seem to be happy to transport anything except a string that happens to contain markup that uses < > style tags.

If the claim here were merely, 'if you want to have a slot in your protocol for marked-up text, consider simply using XHTML in-line.' OK, I'd consider it. And there's no excuse for non-conforming wacky CDATA processing. So, indeed, if someone wants to shove arbitrary HTML around, and preserve it character-for-character in XML 1.0, they're in the land of attachments. Short of that, however, I'm not persuaded.

This debate is long since over and I lost. But it was never about arbitrary strings; if you've got some random sequence of characters that happens to include markup characters, escape them to your hearts content.

It was about sending escaped HTML goo instead of reasonable, well-formed XHTML.

But it doesn't matter four years later. We got what we got and that includes escaped markup.