<?xml version='1.0' encoding='utf-8'?>
<?xml-stylesheet href="/style/browser.xsl" type="text/xsl"?>
<essay xmlns="http://docbook.org/ns/docbook"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
       xmlns:dc='http://purl.org/dc/elements/1.1/'
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:gal='http://norman.walsh.name/rdf/gallery#'
       xmlns:foaf="http://xmlns.com/foaf/0.1/"
       xml:lang="en"
       version='5.0'>
<info>
<title>Thinking about HTML5</title>
<volumenum>11</volumenum>
<issuenum>13</issuenum>
<pubdate>2008-01-22T19:23:15-05:00</pubdate>
<date>$Date$</date>
<author><personname>
<firstname>Norman</firstname><surname>Walsh</surname>
</personname></author>
<copyright><year>2008</year><holder>Norman Walsh</holder></copyright>
<abstract>
<para>HTML 5 is big. Big in a lot of different ways. I'm trying to
understand some of them. Let the random mutterings begin…</para>
</abstract>
</info>

<epigraph>
<attribution><personname><firstname>B.</firstname>
<surname>Reid</surname></personname></attribution>
<para xml:id='p2'>Computer Science is the first engineering
discipline in which the complexity of the objects created is limited
solely by the skill of the creator, and not by the strength of raw
materials.
</para>
</epigraph>

<para xml:id='p1'>I've been thinking about HTML 5 for a while now. I
haven't been making very much progress. It would help if I knew to
what ends I was hoping to make said progress, but that's one of the
things to think about, I guess. Instead of silent comtemplation, I'm
going to try to scribble down some of my thoughts as I go.</para>

<para xml:id='p3'>The genesis of this essay was some thinking about validity,
well-formedness, markup minimization, and parsing.</para>

<para xml:id='p4'>The design space for markup, especially markup that will be
authored by hand (directly or indirectly), is pretty big. It's interesting
to compare how
<wikipedia page="Standard_Generalized_Markup_Language">SGML</wikipedia>,
<wikipedia>XML</wikipedia>, and
<wikipedia>HTML 5</wikipedia> fit in that space.</para>

<mediaobject role="flickr"><!--Party like it's 1994-->
  <imageobject xlink:href="http://www.flickr.com/photos/ndw/2213425394/">
    <imagedata fileref="http://farm3.static.flickr.com/2184/2213425394_e16ef07c1b.jpg"/>
  </imageobject>
</mediaobject>

<para xml:id='p5'>SGML was designed with ease of authoring in mind, at least to
the extent that minimizing how much markup one had to type was an ease
to authoring. Consider the following <emphasis>valid</emphasis>
<link xlink:href="examples/shortcuts.sgm">SGML document</link>:</para>

<programlisting><![CDATA[<!DOCTYPE chapter SYSTEM "shortcuts.dtd">
<chapter final>Party like it's 1994</title>
<p>"Old school" baby
<glist>
shortref,datatag
<def>SGML features for markup minimization
</chapter>]]></programlisting>

<para xml:id='p6'>It has the same interpretation as
<link xlink:href="examples/longcuts.sgm">this document</link>:</para>

<programlisting><![CDATA[<!DOCTYPE chapter SYSTEM "shortcuts.dtd">
<chapter status="final"><title>Party like it's 1987</title>
<p><q>Old school</q> baby</p>
<glist>
<term>shortref</term><term>datatag</term>
<def><p>SGML features for markup minimization</p></def>
</glist>
</chapter>]]></programlisting>

<para xml:id='p7'>Because SGML required
(pre-<link xlink:href="http://www1.y12.doe.gov/capabilities/sgml/wg8/document/1955.htm">corrigendum</link><footnote><para xml:id='p8'>For a wonderful giggle, consider
the changes proposed in that document and then
<link xlink:href="http://www.onelook.com/?w=corrigendum&amp;ls=a">look up</link> the
definition of “corrigendum”.</para></footnote>)
all documents to be valid, this flexibility came
at a terrible price. SGML parsers were fiendishly hard to implement correctly.
</para>

<para xml:id='p9'>In the SGML world, those typing conveniences go hand-in-hand with
validity. The information
<link xlink:href="examples/shortcuts.dtd">in the DTD</link>:
</para>

<programlisting><![CDATA[<!ELEMENT chapter - - (title, (p|glist)+)>
<!ATTLIST chapter
          status (draft|final) #IMPLIED
>

<!ELEMENT title   o o (#PCDATA)*>

<!ELEMENT p       o o (#PCDATA|q)*>

<!ELEMENT glist   - o ((term+,def)+)>
<!ELEMENT term    o o (#PCDATA)*>
<!ELEMENT def     - o (p+)>

<!ELEMENT q       - o (#PCDATA)*>

<!ENTITY  MAPQS STARTTAG "q">
<!ENTITY  MAPQE ENDTAG "q">
<!ENTITY  MAPTS STARTTAG "term">

<!SHORTREF MAP-INQ '"' MAPQE>
<!SHORTREF MAP-INP '"' MAPQS>
<!SHORTREF MAP-INT "," MAPTS>

<!USEMAP MAP-INQ q>
<!USEMAP MAP-INP p>
<!USEMAP MAP-INT term>]]></programlisting>

<para xml:id='p10'>combined with the fact that the document is known/required to be
valid guarantee that there's only one interpretation of the characters in
the document.</para>

<para xml:id='p11'>XML was designed with ease of parsing in mind. In particular, it
relaxed the validity constraint and obviated the need for a
<wikipedia page="Document_Type_Definition">DTDs</wikipedia>. Without
a DTD, it's impossible to know where implied markup boundaries should go,
so you can't have any. Because you don't know the vocabulary.</para>

<para xml:id='p12'>SGML and XML are both “meta markup languages”.
They have no defined vocabulary. SGML includes a mechanism that allows
users to invent their own tag vocabularies; XML has several such mechanisms.
</para>

<para xml:id='p13'>HTML 5, in contrast, is explicitly a single vocabulary (or
perhaps a small family of vocabularies). As such, it would be much
less interesting where it not for two facts: first, it is a revision
of the single most important vocabulary on the planet and second, it
is neither SGML nor XML.</para>

<para xml:id='p14'>One of the two “authoring formats” described by the HTML 5
specification is a custom one. The other is XML, but in fact both are
described as just concrete syntaxes for “an abstract language for
describing documents and applications” which is what is really being
defined.</para>

<para xml:id='p15'>The goal of the custom parser, as I understand it, is that it
imposes an unambiguous <emphasis>HTML 5 interpretation</emphasis> on
any random stream of characters.</para>

<para xml:id='p16'>While that offers some apparent benefits to end users (they
don't for example, have to remember to type quotes around their
attribute values), I harbor some reservations about whether or not
this strategy will be a good thing for the broader markup community in
the long run.</para>

<para xml:id='p17'>I suppose to the extent that HTML 5 is the one and only markup
vocabulary that you will ever need, it'll be a good thing. And, you
know, for <emphasis>a whole lot</emphasis> of people, that's probably
the case.</para>

</essay>
