<?xml version='1.0' encoding='utf-8'?>
<?xml-stylesheet href="/style/browser.xsl" type="text/xsl"?>
<essay xmlns="http://docbook.org/ns/docbook"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
       xmlns:dc='http://purl.org/dc/elements/1.1/'
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:gal='http://norman.walsh.name/rdf/gallery#'
       version="pto">
<info>
<title>Escaped Markup: Still Harmful</title>
<volumenum>6</volumenum>
<issuenum>85</issuenum>
<pubdate>2003-09-16</pubdate>
<date>$Date: 2005-09-11 10:27:02 -0400 (Sun, 11 Sep 2005) $</date>
<author><personname>
<firstname>Norman</firstname><surname>Walsh</surname>
</personname></author>
<copyright><year>2003</year><holder>Norman Walsh</holder></copyright>
<abstract>
<para>No one has produced a single argument that even begins to
persuade me to accept escaped markup.
</para>
</abstract>
</info>

<epigraph>
<attribution>Dandemis</attribution>
<para xml:id='p1'>Do not condemn the judgement of another because
it differs from your own. You may both be wrong.</para>
</epigraph>

<para xml:id='p2'>Opinion is clearly divided on the question of escaped markup. Several
people
<link xlink:href="http://www.xml.com/pub/a/2003/08/20/embedded.html#thread">supplied
commentary</link>
on my
<link xlink:href="http://www.xml.com/">XML.com</link>
<link xlink:href="http://www.xml.com/pub/a/2003/08/20/embedded.html">essay</link>,
Sam Ruby <link xlink:href="http://www.intertwingly.net/blog/1571.html">commented</link>
in his blog, and Tim Bray
<link xlink:href="http://www.tbray.org/ongoing/When/200x/2003/06/28/Learning">talked
about it</link> before I did. There's even a
<link xlink:href="http://www.intertwingly.net/wiki/pie/EscapedHtmlDiscussion">part
of</link> the
<link xlink:href="http://www.intertwingly.net/wiki/pie/">Atom Wiki</link>
devoted to it. Some people real want escaped markup.</para>

<para xml:id='p3'>The bottom line: I'm completely unconvined, most especially by
the lines of argument that suggest this is a practice that's here to
stay, so we just have to live with it or that end-users and
aggregators want it, so we have to support it. First off, the point of
doing Son-of-RSS, as I understand it, is to give everyone a clean,
unified place to go. This escaping mess isn't clean and it can
go. Second, the argument that customers (of one stripe or another)
need it is bogus: they don't need it, there are better
solutions, they just want it (because
that's the way they're used to doing it, or because that's what they
think is easiest, or because of something else).</para>

<para xml:id='p4'>Consulting taught me that customers sometimes want damned
foolish things. And I know that sometimes you have to do what the
customer wants. But experience suggests that when you let the customer
convince you to do something damned foolish, later on you have to
explain to the customer, when their experience convinces them it's
damned foolish, why they paid you good money to do something so damned
foolish. <quote>Because you told me to</quote> is only sometimes an
effective explanation.</para>

<para xml:id='p5'>Having said all that, I did find an unfortunate flaw in the XML.com
piece while I was preparing this essay: they
<emphasis>retitled</emphasis> it which I
think has caused some confusion. The essay I provided was titled
<quote><emphasis>Escaped</emphasis> Markup Considered Harmful</quote>.
Editorial license, I presumed incorrectly, and I'm told the correct title
will percolate through the next build.<footnote><para xml:id='p6'>For the benefit
of those who only ever see the corrected title, it was published originally
on XML.com as <quote>Embedded Markup Considered Harmful</quote>, which
suggests something else entirely.</para>
</footnote>
</para>

<para xml:id='p7'>The incorrect title may explain the confusion of several
commenters who took me to task for not proposing an alternative to
embedded markup. Embedding markup is <emphasis>absolutely
fine</emphasis>. It's what we should be doing.</para>

<para xml:id='p8'><personname><firstname>Richard</firstname>
<surname>Pinneau</surname></personname>
<link xlink:href="http://www.xml.com/cs/user/view/cs_msg/1455">asks</link> how to
store <quote>I <emphasis>really</emphasis> found this book
helpful</quote> in his (non XML) database, for example.
The answer is: <quote><literal>I &lt;emph&gt;really&lt;/emph&gt; found
this book helpful.</literal></quote> Using XML markup to identify semantically or
stylistically significant portions of a unit of text is entirely
respectable.</para>

<para xml:id='p9'>The practice that I'm arguing against is one that only occurs
where an XML context has been established. Some RSS variants have
<quote>text only</quote> elements and the workaround for sending marked
up text in those elements is to escape the markup, like this:
<quote><literal>I &amp;lt;emph&amp;gt;really&amp;lt;/emph&amp;gt;
found this book helpful.</literal>.</quote>
That's just <emphasis>wrong</emphasis>.</para>

<para xml:id='p10'>Just to be perfectly clear, if the database in question had some
weird prohibition against angle brackets, I think it'd be ok
to store escaped markup in the database. Just so long as you
unescaped it before you dropped it into any XML context. The evil of
escaped markup in RSS is that RSS is already XML.</para>

<para xml:id='p11'><personname><firstname>Jukka-Pekka</firstname>
<surname>Keisala</surname></personname>
<link xlink:href="http://www.xml.com/cs/user/view/cs_msg/1437">offers</link>
the <quote>escaped
markup is necessary because I don't have control over the original
content</quote> argument. But just slapping CDATA sections around the authors
content <emphasis>isn't sufficient</emphasis>, at the very least you have to
look at every character in order to avoid encoding problems. It's not that
much harder to turn the content into well-formed XML.</para>

<para xml:id='p12'>(In an
<link xlink:href="/2003/06/30/hardline">earlier essay</link>, I suggested that
<link xlink:href="http://www.ccil.org/~cowan/XML/tagsoup"><application>TagSoup</application></link>
as a method for converting random markup
into well-formed XML. <personname><firstname>Ted</firstname>
<surname>Leung</surname></personname>
<link xlink:href="http://www.sauria.com/blog/2003/06/30#302">points out</link>
that
<link xlink:href="http://www.apache.org/~andyc/neko/doc/html/"><application>NekoHTML</application></link>
would work as well. I'm sure there are other options, too. For the
record, I wasn't trying to advocate any particular answer, just point
out that it's a solved problem.)</para>

<para xml:id='p13'><personname><firstname>John</firstname><surname>Vance</surname></personname>
<link xlink:href="http://www.xml.com/cs/user/view/cs_msg/1422">asks</link>
how to deal with browsers that don't understand XHTML. I'm not sure
exactly what the right answer here is because I'm not sure I
understand the problem. You must be extracting the escaped markup out
of the feed since a browser that doesn't understand XHTML certainly
isn't going to grok the XML feed directly. That means you're parsing
the feed. If you're building a DOM, maybe you can just hand that DOM
to the browser (the different ways of representing empty end tags and
other XMLisms don't manifest themselves in the DOM). If you're handing
the browser a stream of text, well you can surely convert the embedded
XHTML into HTML very nearly as easily as you can extract all the CDATA
section start and end markers.</para>

<para xml:id='p14'><personname><firstname>Julian</firstname><surname>Bond</surname></personname>
<link xlink:href="http://www.xml.com/cs/user/view/cs_msg/1414">chastises</link>
me for not providing alternatives and makes several good points.</para>

<para xml:id='p15'>First off, he suggests that escaping is legitimate because RSS
is just using XML as a transport protocol and aggregators can
reasonably argue that they have no control over the content. We've heard
this argument before, but he goes
on to point out that SOAP and xmlrpc faced similar problems. He's
absolutely right that this is not a new problem for systems using XML
as a transport protocol. I don't know what xmlrpc does, but SOAP has
two mechanisms for dealing with markup it can't embed in XML:
attachments and base64 encoding. If RSS adopts either of these (to the
exclusion of this escaping crap), I'll shut up. The point
being that attachments remove the necessity to worry about the problem
because the content isn't embedded in XML and base64 encoding makes it
very clear that this is a mechanical encapsulation mechanism, not
something users should start coding up by hand (in RSS or anywhere
else).</para>

<para xml:id='p16'>He goes on to say that <quote>[escaping] may be wrong, it may be
ugly and it may well be harmful, but dammit, it works.</quote> Uhm.
Ok, for some definition of <quote>works</quote>, I suppose.
In any event, I think the ugly and harmful parts dramatically
overshadow the works part, especially since there are other mechanisms
that work without being either ugly or harmful.</para>

<para xml:id='p17'>Next, Julian ponders who's to blame if one jumps through hoops
to make well formed markup out of tag soup and it goes wrong once in a
while. Ignoring for the moment the fact that I can't see how it can go
wrong, I'll point out again that just wrapping things in CDATA can go
wrong too. There are characters that can't be represented in CDATA and
there are encoding issues.</para>

<para xml:id='p18'>Finally, he asks what my problem is given that CDATA is part of
XML. I don't have any problem with CDATA. CDATA sections are fine, as
long as what you're putting in them is text. What's evil about this
escaped markup nonsense is the semantics provided by RSS/Atom, not the
use of CDATA to avoid lots of ampersands and semicolons. That's my problem.
</para>

<para xml:id='p19'>Finally, in a brief conversation yesterday, Tim told me that if
I looked at Sam's recent slides (though I haven't in fact gone looking
for them), I'd see examples of <emphasis>doubly</emphasis> escaped
markup. If you don't buy any of my other arguments, surely that's
enough to convince you that this way lies madness. Surely.</para>

<para xml:id='p20'>No one has produced a single argument that even begins to
persuade me to accept escaped markup. I'm not a hardass, at least I
don't think I am, but I'm not letting go on this one. Escaped markup
<emphasis>is harmful</emphasis>. Stop it.</para>

</essay>



