<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<?xml-stylesheet type='text/xsl' href='/style/atom-comments.xsl'?>
<feed xmlns='http://www.w3.org/2005/Atom'>
<title>norman.walsh.name: Comments on /2003/09/16/escmarkup</title>
<link rel='alternate' type='text/html' href='http://norman.walsh.name/2003/09/16/escmarkup'/>
<id>http://norman.walsh.name/2003/09/16/escmarkup/comments.atom</id>
<updated>2007-09-23T20:44:57Z</updated>

<entry>
<title>Comment 0001 on /2003/09/16/escmarkup</title>
<link rel='alternate' type='text/html' href='http://norman.walsh.name/2003/09/16/escmarkup#comment0001'/>
<id>http://norman.walsh.name/2003/09/16/escmarkup#comment0001</id>
<published>2003-09-16T22:03:18Z</published>
<updated>2003-09-16T22:03:18Z</updated>
<author>
  <name>Tobi </name>
</author>
<content type='xhtml'><div xmlns="http://www.w3.org/1999/xhtml"><p>Don&amp;apos;t give up :)</p></div></content>
</entry>

<entry>
<title>Comment 0002 on /2003/09/16/escmarkup</title>
<link rel='alternate' type='text/html' href='http://norman.walsh.name/2003/09/16/escmarkup#comment0002'/>
<id>http://norman.walsh.name/2003/09/16/escmarkup#comment0002</id>
<published>2003-09-24T00:26:21Z</published>
<updated>2003-09-24T00:26:21Z</updated>
<author>
  <name>Patrick Lioi</name>
  <uri>http://patrick.lioi.net</uri>
</author>
<content type='xhtml'><div xmlns="http://www.w3.org/1999/xhtml"><p>Escaping is bad.  I&amp;apos;ll concede that for the sake of argument.  But how can the same person who seems to have a pretty serious negative reaction to escaping say that base64 encoding is a good alternative?  Both are reversable ways to render a string free of angle brackets and ampersands.</p></div></content>
</entry>

<entry xmlns:foaf='http://xmlns.com/foaf/0.1/'>
<title>Comment 0003 on /2003/09/16/escmarkup</title>
<link rel='alternate' type='text/html' href='http://norman.walsh.name/2003/09/16/escmarkup#comment0003'/>
<id>http://norman.walsh.name/2003/09/16/escmarkup#comment0003</id>
<published>2003-11-15T17:26:39Z</published>
<updated>2003-11-15T17:26:39Z</updated>
<author>
  <name>Martynas Jusevicius</name>
  <foaf:mbox_sha1sum>6b6ddd843fb3236effc641edcb8c67c3db3470b1</foaf:mbox_sha1sum>
  <uri>http://www.xml.lt</uri>
</author>
<content type='xhtml'><div xmlns="http://www.w3.org/1999/xhtml"><p>I don&amp;apos;t really understand all the buzz here around.
I think every XML-thinking person should understand, that escaped mark-up is wrong.
Let&amp;apos;s take RSS as an example.
If one wants tu put styling into &amp;lt;description&amp;gt; tag, the ONLY, natural, and not arguable decision is to use valid not escaped XHTML markup.
And the possible problems (which I can&amp;apos;t imagine for myself) about XHTML not being fully equal to HTML are really thousands of times more easily solved than the harm, that widespread escaped markup could do in the future.</p></div></content>
</entry>

<entry xmlns:foaf='http://xmlns.com/foaf/0.1/'>
<title>Comment 4 on /2003/09/16/escmarkup</title>
<link rel='alternate' type='text/html' href='http://norman.walsh.name/2003/09/16/escmarkup#comment0004'/>
<id>http://norman.walsh.name/2003/09/16/escmarkup#comment0004</id>
<published>2007-07-24T20:56:37Z</published>
<updated>2007-07-24T20:56:37Z</updated>
<author>
  <name>Vincent Sgro</name>
  <foaf:mbox_sha1sum>58c04ab4c120ec83397407e17f05b00dc203a0bc</foaf:mbox_sha1sum>
  <uri></uri>
</author>
<content type='xhtml'><div xmlns="http://www.w3.org/1999/xhtml"><p>I realize that your original article is 4 years old... nonetheless, it is showing up on searches and I thought I would share my views about this topic.

</p><p>Up to about three days ago, I vehemently shared your view about escaped markup in RSS. However, I have recently been implementing a connector that converts RSS and other feed types into an internal form.  I have come to realize that there is some validity to the argument that HTML content should be escaped.

</p><p>Mind you, I am not trying to convince anyone of my position.  Indeed, since I very recently swapped my view once, I see no reason why I won&#8217;t change my mind again.  Furthermore, I have not fully read all the messages related to the original and follow-up article, so it is likely that I am just reiterating what you have already heard.

</p><p>My realization came when I was working (as a programmer) with an XML DOM and was about to take the content in RSS' description element and place it into an internal data structure.  The question I had to answer was "by what means do I take the data from the RSS XML DOM?  There are essentially two ways of doing this.

</p><p><b>1)</b> Get the value being stored in the element.  This is roughly equivalent to the XSLT value-of.  This is also what you would use if the data is escaped.  What you get back is the unescaped version of the data.

<br></br><b>2)</b> Get the serialized XML structure rooted at the element.  This is roughly equivalent to the XSLT copy-of.  This is also what you would use if the data is not escaped.  It leaves the original markup (structure) alone.

</p><p>My first observation is that RSS makes no assumptions about the reader's capabilities.  The only guarantee seems to be that the reader will be capable of displaying/handling plain text.  I make this observation based on the fact that elements are available to handle non-text items, such as images and links, so they can be handled and processed by an appropriate means.  While these elements do not approach the display capabilities provided by HTML, they are clearly included so that the most basic of RSS consumers can make sense of the feed data.

</p><p><b>First point:</b> Even plain text data must be &#8220;escaped&#8221; relative to its true form.

</p><p>Consider the simple text "this &amp; that".  In order for it to be represented in an RSS feed, it must be written in its XML serialized form "this &amp; that".  While this is not escaped relative to XML, it is escaped relative to the true data.

</p><p>Now, ask yourself, "which of the two above techniques must I use to get that to be viewed in a text-only viewer?"  The answer is technique 1.  You must assume the data is escaped.  It is escaped because (a) it must be stored as XML (RSS) and (b) the target data is actually text only.  It is important to note that "this &amp; that" is not the data stored in the element.  That is how it must be represented in order to generate a well-formed XML document.  "this &amp; that" is the actual data.  No matter how it is displayed, browser or otherwise, it must appear to the user/consumer as "this &amp; that".

</p><p><b>Second point:</b> Data that looks nothing like XML must be &#8220;escaped&#8221; relative to its true form.

</p><p>Let&#8217;s look at another example that may help illustrate the point more concisely.  I think part of the problem is that HTML (or XHTML) is closely related to XML.  They look similar and HTML *may* be represented as XML.  Therefore, there is a strong desire to do just that.  So, let&#8217;s consider content that isn&#8217;t remotely like XML.

</p><p>Let&#8217;s say that you wanted to store a GIF image directly in XML.  You must fist encode it in some way that would be representable in the XML.  The binary form of a GIF image is simply not XML and never will be.  Second, you must un-encode it when it comes time to produce this data.

</p><p>Why should HTML be treated differently?  Read on.

</p><p><b>Third point:</b>  RSS feeds are consumed in contexts outside of HTML.

</p><p>I understand that there is no shortage of ways to read RSS feeds directly in an HTML context.  For example, there are a number of RSS feeds that have stylesheets associated with them that permit web browsers to directly render their content as pretty HTML views.

</p><p>However, in reality, there is also no shortage where RSS is consumed in non-HTML contexts.

</p><p>Indeed, the very project on which I am working must do just that.  Also, I saw Dave Weiner on the Cranky Geeks video podcast say that he was somewhat annoyed that the browser manufacturers were permitting RSS data to be directly readable within their browsers.  (He specifically blamed Firefox for this.)  I must admit that the comment had puzzled me at the time.

</p><p>The intention is to be able to share the information in richer contexts than a web page on a web site, not to become the new form of HTML.  A simple example, you may host your own RSS feed which, rather than just provide your own content, may consume another feed and blend it. (While preserving the original attribution, of course.)  You would be quite annoyed if you had to treat the data from your own feed differently than the other.

</p><p><b>Fourth point:</b> There is a point in time when data is taken out of the RSS feed, and its XML form, and placed in a context where that data can be interpreted in its true form.

</p><p>Knowing that the consumers of your RSS feed are going to be viewing the data within a browser, using your stylesheet to format it nicely, is a wonderful convenience.  Not only can you store the data in XML, but the data hardly leaves the XML context while it is being consumed.  (It actually does, but that fact is somewhat obscured.)

</p><p>However, in most other cases where RSS is consumed, the data must explicitly be removed from the XML structure of the feed so that it can be further processed.

</p><p>In the first point, I illustrated how data must be removed for the RSS feed and un-escaped to get the data to its true form when the data is text-only.  In truth, the only time un-escaping is not necessary is the one case where the data is going to continue to be used as XML&#8230; within the very same XML context.  In all other cases, the data must be removed from the RSS feed and converted to its true form before it can be consumed by something that understands its true form.

</p><p><b>Fifth point:</b> Not escaping HTML is nothing more than a convenience.  Sometimes.

</p><p>There are a number of conveniences that are afforded by representing HTML as XHTML and not escaping it.

</p><p><b>a)</b> If you are manually writing the HTML, you don&#8217;t have to type the CDATA element or otherwise escape the markup.  (You also don&#8217;t have to worry about using the CDATA ending sequence.)

</p><p>I&#8217;ll give you the fact that this is a convenience.  In practice, however, it does not bother me to have to type the CDATA element around my content any more than it bothers me to type any of the other XML markup in the RSS feed.

</p><p><b>b)</b> If you are consuming the RSS feed, you can directly manipulate the (X)HTML without moving it out of the XML first.

</p><p>So what?  Usually the only thing I am going to do to the HTML is display it anyway.  (My XML DOM doesn&#8217;t help me do that.)  Any other kind of manipulation makes assumptions about the structure of the data under the RSS defined element, which is processing outside of RSS proper anyway.

</p><p>In my particular case, I have to extract the content from the feed and put it someplace else where it will later be displayed by a different system.  Luckily that system can display HTML.  However, at no time while processing the feed itself do I care that it is HTML nor do I manipulate the HTML in any way.

</p><p>You can choose to represent the HTML as XHTML, but in the end, it really doesn&#8217;t buy you much of anything.

</p><p>In fact, choosing to represent the content as unescaped XHTML means that I must treat the data differently than I would if that data were text only.  Instead of getting the text data out of its XML form as I did in the first point using technique 1, I must serialize the data into an XML document fragment using technique 2.  This creates an exceptional processing situation.

</p><p>Worse, it is ambiguous.  I can not tell which way to access the data to get the data to its true form.  Text... technique 1.  unencoded XHTML... technique 2.  How do I know which form the data is in?

</p><p><b>Sixth point:</b> RSS does not provide a means to describe the true form of the data.

</p><p>Atom does this.  They have two attributes which are added to the element which describe the two dimensions involved.  The "type" attribute is used to describe the true form of the data... text, html, etc.  The "mode" attribute describes how the data is encoded in XML... "escaped" means to use technique 1 above to get to the data... otherwise, technique 2 is used to get the serialized form... or the DOM tree is used directly.  This also permits Atom feeds to offer different versions of the content to explicitly support a number of different forms for the content.

</p><p>However, RSS (2.0) does not have these attributes or anything like them.  I realize that you may have been lobbying for the addition of such attributes in RSS.  I agree that if the intentions of the RSS creators were to permit support for content other than text only, such attributes would be very welcome.

</p><p>However, I don&#8217;t think, for the reasons outlined above as well as others, that this was ever the intention.  Though, if you look at the definitions, they have specifically stated that escaped html is permitted.  Obviously, the lobbying didn&#8217;t take.</p></div></content>
</entry>

<entry xmlns:foaf='http://xmlns.com/foaf/0.1/'>
<title>Comment 5 on /2003/09/16/escmarkup</title>
<link rel='alternate' type='text/html' href='http://norman.walsh.name/2003/09/16/escmarkup#comment0005'/>
<id>http://norman.walsh.name/2003/09/16/escmarkup#comment0005</id>
<published>2007-09-23T20:24:24Z</published>
<updated>2007-09-23T20:24:24Z</updated>
<author>
  <name>benson margulies</name>
  <foaf:mbox_sha1sum>aa55bbbaff04ec6a05c8fd84a7306d83bea59605</foaf:mbox_sha1sum>
  <uri>http://www.basistech.com</uri>
</author>
<content type='xhtml'><div xmlns="http://www.w3.org/1999/xhtml"><p>I find myself unpersuaded by what I think is a sort of XML-imperialism of this attitude.
</p><p>
Let's start with Web Services, the ultimate XML transport.
</p><p>
The contract of a SOAP service is to transport a Unicode String. Any Unicode string. A string in some programming language could include non-XML-1.0 characters, it could contain low-grade HTML.
</p><p>
The protocol happens to use XML to organize the transport. It's none of your or my business how it chooses to deal with a message that happens to contain content in some cousin of HTML. Except for the non-1.0-character issue, escaping (&amp; or CDATA) is as good as anything else.
</p><p>
At the level of RSS and such, an XML document schema might have a slot that contains any sort of textual content. It might be 'TeX'. It might be 'Runoff'. It might, horrors, even be HTML. Why is it some sort of special evil to treat the HTML 'just a string'? I don't much like 'slippery slope' arguments, but this argument seems to lead to a view that the only legitimate content of an XML document is more XML. If the XML schema type is 'string', then, by gum, it's a STRING. A fragment of XML or messy HTML is no better or worse of a STRING than anything else. 
</p><p>
A critical principle of software design is protocol layering. You seem to be insisting that there can never be a legitimate protocol/module boundary between an XML document and some other protocol sits on top of it in a stack. 
</p><p>
Given a nice, clean, modular protocol that happens to be conveying a string via XML, you seem to be happy to transport anything except a string that happens to contain markup that uses &lt; &gt; style tags.
</p><p>
If the claim here were merely, 'if you want to have a slot in your protocol for marked-up text, consider simply using XHTML in-line.' OK, I'd consider it. And there's no excuse for non-conforming wacky CDATA processing. So, indeed, if someone wants to shove arbitrary HTML around, and preserve it character-for-character in XML 1.0, they're in the land of attachments. Short of that, however, I'm not persuaded.</p></div></content>
</entry>

<entry xmlns:foaf='http://xmlns.com/foaf/0.1/'>
<title>Comment 6 on /2003/09/16/escmarkup</title>
<link rel='alternate' type='text/html' href='http://norman.walsh.name/2003/09/16/escmarkup#comment0006'/>
<id>http://norman.walsh.name/2003/09/16/escmarkup#comment0006</id>
<published>2007-09-23T20:44:54Z</published>
<updated>2007-09-23T20:44:54Z</updated>
<author>
  <name>Norman Walsh</name>
  <foaf:mbox_sha1sum>9f5c771a25733700b2f96af4f8e6f35c9b0ad327</foaf:mbox_sha1sum>
  <uri>http://norman.walsh.name/</uri>
</author>
<content type='xhtml'><div xmlns="http://www.w3.org/1999/xhtml"><p>This debate is long since over and I lost. But it was never about arbitrary strings; if you've got some random sequence of characters that happens to include markup characters, escape them to your hearts content.
</p><p>
It was about sending escaped HTML goo instead of reasonable, well-formed XHTML.
</p><p>
But it doesn't matter four years later. We got what we got and that includes escaped markup.</p></div></content>
</entry>

</feed>
