Thinking about HTML5

Volume 11, Issue 13; 22 Jan 2008; last modified 08 Oct 2010

HTML 5 is big. Big in a lot of different ways. I'm trying to understand some of them. Let the random mutterings begin…

Computer Science is the first engineering discipline in which the complexity of the objects created is limited solely by the skill of the creator, and not by the strength of raw materials.

—B. Reid

I've been thinking about HTML 5 for a while now. I haven't been making very much progress. It would help if I knew to what ends I was hoping to make said progress, but that's one of the things to think about, I guess. Instead of silent comtemplation, I'm going to try to scribble down some of my thoughts as I go.

The genesis of this essay was some thinking about validity, well-formedness, markup minimization, and parsing.

The design space for markup, especially markup that will be authored by hand (directly or indirectly), is pretty big. It's interesting to compare how SGML, XML, and HTML 5 fit in that space.

SGML was designed with ease of authoring in mind, at least to the extent that minimizing how much markup one had to type was an ease to authoring. Consider the following valid SGML document:

<!DOCTYPE chapter SYSTEM "shortcuts.dtd">
<chapter final>Party like it's 1994</title>
<p>"Old school" baby
<glist>
shortref,datatag
<def>SGML features for markup minimization
</chapter>

It has the same interpretation as this document:

<!DOCTYPE chapter SYSTEM "shortcuts.dtd">
<chapter status="final"><title>Party like it's 1987</title>
<p><q>Old school</q> baby</p>
<glist>
<term>shortref</term><term>datatag</term>
<def><p>SGML features for markup minimization</p></def>
</glist>
</chapter>

Because SGML required (pre-corrigendumFor a wonderful giggle, consider the changes proposed in that document and then look up the definition of “corrigendum”.) all documents to be valid, this flexibility came at a terrible price. SGML parsers were fiendishly hard to implement correctly.

In the SGML world, those typing conveniences go hand-in-hand with validity. The information in the DTD:

<!ELEMENT chapter - - (title, (p|glist)+)>
<!ATTLIST chapter
          status (draft|final) #IMPLIED
>

<!ELEMENT title   o o (#PCDATA)*>

<!ELEMENT p       o o (#PCDATA|q)*>

<!ELEMENT glist   - o ((term+,def)+)>
<!ELEMENT term    o o (#PCDATA)*>
<!ELEMENT def     - o (p+)>

<!ELEMENT q       - o (#PCDATA)*>

<!ENTITY  MAPQS STARTTAG "q">
<!ENTITY  MAPQE ENDTAG "q">
<!ENTITY  MAPTS STARTTAG "term">

<!SHORTREF MAP-INQ '"' MAPQE>
<!SHORTREF MAP-INP '"' MAPQS>
<!SHORTREF MAP-INT "," MAPTS>

<!USEMAP MAP-INQ q>
<!USEMAP MAP-INP p>
<!USEMAP MAP-INT term>

combined with the fact that the document is known/required to be valid guarantee that there's only one interpretation of the characters in the document.

XML was designed with ease of parsing in mind. In particular, it relaxed the validity constraint and obviated the need for a DTDs. Without a DTD, it's impossible to know where implied markup boundaries should go, so you can't have any. Because you don't know the vocabulary.

SGML and XML are both “meta markup languages”. They have no defined vocabulary. SGML includes a mechanism that allows users to invent their own tag vocabularies; XML has several such mechanisms.

HTML 5, in contrast, is explicitly a single vocabulary (or perhaps a small family of vocabularies). As such, it would be much less interesting where it not for two facts: first, it is a revision of the single most important vocabulary on the planet and second, it is neither SGML nor XML.

One of the two “authoring formats” described by the HTML 5 specification is a custom one. The other is XML, but in fact both are described as just concrete syntaxes for “an abstract language for describing documents and applications” which is what is really being defined.

The goal of the custom parser, as I understand it, is that it imposes an unambiguous HTML 5 interpretation on any random stream of characters.

While that offers some apparent benefits to end users (they don't for example, have to remember to type quotes around their attribute values), I harbor some reservations about whether or not this strategy will be a good thing for the broader markup community in the long run.

I suppose to the extent that HTML 5 is the one and only markup vocabulary that you will ever need, it'll be a good thing. And, you know, for a whole lot of people, that's probably the case.

Comments

Few will dispute that HTML5 will be adopted very differently than HTML. HTML was adopted by everyone at the same time because the W3 was the next big thing, browsers were crude and at first they all introduced their own "extensions". People (by which I mean me) authored HTML without even knowing what a doctype declaration was, and books that TAUGHT "web design" ignroed closing tags and semantics. That led people to author invalid documents in ignorance.

When W3C recomends their next version of (x)HTML, regardless whether well-formed, valid XML is required, the early adopters will be the highly technical people who author compliant XML by habit. The plebs will probably never learn the intricacies of HTML5. For them, HTML4 is good enough, and besides that the trend has been for sites to provide their own bracket-based markup, along the lines of wiki markup or bbcode.

So I wouldn't worry about messy HTML5 polluting the web... even if it is technically allowed I don't think many will use it.