<?xml-stylesheet href="/style/browser.xsl" type="text/xsl"?>
<essay xmlns="http://docbook.org/ns/docbook"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
       xmlns:dc='http://purl.org/dc/elements/1.1/'
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:gal='http://norman.walsh.name/rdf/gallery#'
       version="pto">
<info>
<title>Taking a Hard Line</title>
<volumenum>6</volumenum>
<issuenum>49</issuenum>
<pubdate>2003-06-30</pubdate>
<date>$Date: 2005-12-13 10:31:01 -0500 (Tue, 13 Dec 2005) $</date>
<author><personname>
<firstname>Norman</firstname><surname>Walsh</surname>
</personname></author>
<copyright><year>2003</year><holder>Norman Walsh</holder></copyright>
<abstract>
<para>There's serious debate in the successor-to-RSS world about how
to maintain escaped HTML markup in a feed. I'm just appalled.
[Updated 22 Aug 2003.]</para>
</abstract>
</info>
<epigraph>
<attribution>Paul Valéry</attribution>
<!-- FIXME: cwm barfs if I put the accented "e" in the indexterm! -->
<para xml:id='p1'><indexterm><primary>Valery</primary><secondary>Paul</secondary></indexterm>A
man who is <quote>of sound mind</quote> is one who
keeps the inner madman under lock and key.
</para>
</epigraph>

<para xml:id='p2'>[For a more recent, and perhaps marginally less inflamatory,
discussion of some of these issues, see
<link xlink:href="http://www.xml.com/pub/a/2003/08/20/embedded.html">Embedded
Markup Considered Harmful</link> on
<link xlink:href="http://www.xml.com/">XML.com</link>. --22 Aug 2003]</para>

<para xml:id='p3'>This is a flame. A rant. A tirade. There, I admit it. Up front.
Any offense you take is your own.</para>

<para xml:id='p4'>The ongoing <link xlink:href="http://www.intertwingly.net/wiki/pie/RoadMap">design
effort</link> to build a better syndication format (a successor to
RSS in all its versions) has stumbled in some predictable places.
On further reflection,
<link xlink:href="../26/echostatic">feature creep</link> doesn't surprise me:
what
starts out small and simple rapidly becomes not so small and not
so simple. I've served on enough working groups to know that this is
an inevitable side-effect of <quote>design by committee</quote>.
That's ok. There's still plenty of time to throw most of the cruft
away.</para>

<para xml:id='p5'>But I am astonished by the debate over escaped HTML. Most of the
phrases that spring immediately to mind when I consider this debate
are unprintable. At least, I'm not going to print them. Not in public.
Here's a sanitized summary: <quote>are you freaking
nuts?</quote></para>

<para xml:id='p6'>If there's any part of RSS that's totally broken, it's the notion that
I would publish the item description <quote>this is <emphasis>my</emphasis>
description</quote> like this:</para>

<screen>This is &amp;lt;em&amp;gt;my&amp;lt;/em&amp;gt; description.</screen>

<para xml:id='p7'>That's so totally absurd that I won't do it. I could generate
RSS feeds that way, but I'd rather just stick to plain vanilla text in
my abstracts than stoop that low.</para>

<para xml:id='p8'>I find reading
<link xlink:href="http://intertwingly.net/wiki/pie/EscapedHtmlDiscussion">the Wiki
pages</link> quite challenging, but as near as I can tell, there are
three arguments for allowing escaped HTML: legacy, tools, and content dispatching.</para>

<section xml:id="legacy">
<title>Legacy</title>

<para xml:id='p9'>You're joking, right? Any legacy data that you have will need to be transformed
into the new format. You can fix the botched markup when you do the transformation.
<personname><firstname>John</firstname><surname>Cowan</surname></personname>'s done
all the heavy lifting already with <link xlink:href="http://www.ccil.org/~cowan/XML/tagsoup"><application>TagSoup</application></link>.</para>
<para xml:id='p10'>I suppose there may be systems out there in which it will be difficult
or impossible to cleanup the legacy. I'm sympathetic, but not so sympathetic that
I'd consider that sufficient argument to carry this ugly kludge forward.</para>
</section>

<section xml:id="tools">
<title>Tools that generate markup that isn't well formed</title>

<para xml:id='p11'>Yeah. News flash: they're broken. Fix them
or use others.</para>

</section>

<section xml:id="feed-content">
<title>The content isn't interesting to the feed consumer</title>

<para xml:id='p12'>This argument follows from the fact that many feed consumers
just want to pass the content to another
application for interpretation. The aggregator, the argument goes,
doesn't care about the content and shouldn't have to parse it.</para>

<para xml:id='p13'>Ok, that sounds reasonable. For about a microsecond.</para>

<para xml:id='p14'>First off, you've got to parse the content anyway. Whether you
parse it as markup or text is irrelevant, it's still going through the
parser. Granted, if it's not well formed, it'll make the parser choke,
but that's a feature, folks.</para>

<para xml:id='p15'>Now, once you've got it parsed, if you really want to send it off to
some other application as text, that's dead simple. Going from markup back to
a serialized form (assuming you can't pass it on as SAX events or some DOM more
efficiently anyway) is completely straightforward.</para>

<para xml:id='p16'>Going the other direction, taking a blob of text, unescaping the markup
characters, and reinterpreting it in the face of possible well-formedness errors is
much harder. And if you're expected to recover from broken markup (an expectation
that I thought was squashed once and for all
<link xlink:href="http://www.w3.org/TR/1998/REC-xml-19980210#sec-documents">in 1998</link>),
not only is it much harder still, it's a playground for all sorts of miscreants
to devise trojan horses and other mischief.</para>

<para xml:id='p17'>XML tools are commodities these days. Consensus that the
successor to RSS will be expressed in XML was achieved fairly quickly.
I expect to process feeds with XML tools and I expect those tools to
have access to <emphasis>all</emphasis> of the content. I want to be
able to examine the markup coming in to see if it's legit. Does it
contain <tag>script</tag> tags? Does it use a namespace that I
wish to reject or validate differently? In short, does it pass
muster?</para>

<para xml:id='p18'>Well formed markup only<footnote>
<para xml:id='p19'>An extension module that supports base64 encoded binary or something,
for distributing software patches or other binary content might be reasonable,
but that doesn't belong in the base specification.</para></footnote>,
please.</para>

</section>
</essay>

