HTML 5 is big. Big in a lot of different ways. I'm trying to understand some of them. Let the random mutterings begin…
Computer Science is the first engineering discipline in which the complexity of the objects created is limited solely by the skill of the creator, and not by the strength of raw materials.
I've been thinking about HTML 5 for a while now. I haven't been making very much progress. It would help if I knew to what ends I was hoping to make said progress, but that's one of the things to think about, I guess. Instead of silent comtemplation, I'm going to try to scribble down some of my thoughts as I go.
The genesis of this essay was some thinking about validity, well-formedness, markup minimization, and parsing.
SGML was designed with ease of authoring in mind, at least to the extent that minimizing how much markup one had to type was an ease to authoring. Consider the following valid SGML document:
1<!DOCTYPE chapter SYSTEM "shortcuts.dtd"> 2<chapter final>Party like it's 1994</title> 3<p>"Old school" baby 4<glist> 5shortref,datatag 6<def>SGML features for markup minimization 7</chapter>
It has the same interpretation as this document:
1<!DOCTYPE chapter SYSTEM "shortcuts.dtd"> 2<chapter status="final"><title>Party like it's 1987</title> 3<p><q>Old school</q> baby</p> 4<glist> 5<term>shortref</term><term>datatag</term> 6<def><p>SGML features for markup minimization</p></def> 7</glist> 8</chapter>
In the SGML world, those typing conveniences go hand-in-hand with validity. The information in the DTD:
1<!ELEMENT chapter - - (title, (p|glist)+)> 2<!ATTLIST chapter 3 status (draft|final) #IMPLIED 4> 5 6<!ELEMENT title o o (#PCDATA)*> 7 8<!ELEMENT p o o (#PCDATA|q)*> 9 10<!ELEMENT glist - o ((term+,def)+)> 11<!ELEMENT term o o (#PCDATA)*> 12<!ELEMENT def - o (p+)> 13 14<!ELEMENT q - o (#PCDATA)*> 15 16<!ENTITY MAPQS STARTTAG "q"> 17<!ENTITY MAPQE ENDTAG "q"> 18<!ENTITY MAPTS STARTTAG "term"> 19 20<!SHORTREF MAP-INQ '"' MAPQE> 21<!SHORTREF MAP-INP '"' MAPQS> 22<!SHORTREF MAP-INT "," MAPTS> 23 24<!USEMAP MAP-INQ q> 25<!USEMAP MAP-INP p> 26<!USEMAP MAP-INT term>
combined with the fact that the document is known/required to be valid guarantee that there's only one interpretation of the characters in the document.
XML was designed with ease of parsing in mind. In particular, it relaxed the validity constraint and obviated the need for a DTDs. Without a DTD, it's impossible to know where implied markup boundaries should go, so you can't have any. Because you don't know the vocabulary.
SGML and XML are both “meta markup languages”. They have no defined vocabulary. SGML includes a mechanism that allows users to invent their own tag vocabularies; XML has several such mechanisms.
HTML 5, in contrast, is explicitly a single vocabulary (or perhaps a small family of vocabularies). As such, it would be much less interesting where it not for two facts: first, it is a revision of the single most important vocabulary on the planet and second, it is neither SGML nor XML.
One of the two “authoring formats” described by the HTML 5 specification is a custom one. The other is XML, but in fact both are described as just concrete syntaxes for “an abstract language for describing documents and applications” which is what is really being defined.
The goal of the custom parser, as I understand it, is that it imposes an unambiguous HTML 5 interpretation on any random stream of characters.
While that offers some apparent benefits to end users (they don't for example, have to remember to type quotes around their attribute values), I harbor some reservations about whether or not this strategy will be a good thing for the broader markup community in the long run.
I suppose to the extent that HTML 5 is the one and only markup vocabulary that you will ever need, it'll be a good thing. And, you know, for a whole lot of people, that's probably the case.