Preview

Comment:

Posted by

Comment

Name: You must provide your name.
Email*: You must provide your email address.
  *Please provide your real email address; it will not be displayed as part of the comment.
Homepage:
Comment**:
  **The following markup may be used in the body of the comment: a, abbr, b, br, code, em, i, p, pre, strong, and var. You can also use character entities. Any other markup will be discarded, including all attributes (except href on a). Your tag soup will be sanitized...
What is nine plus seven?
  In an effort to reduce the amount of comment spam submitted by bots, I'm trying out a simple CAPTCHA system. In order to submit your comment, you must answer the simple math question above. For example, if asked "What is the two plus five?", you would enter 7.
Remember me? (Want a cookie?)

 (There must be no errors before you submit.)

The body of the essay you are commenting on appears below. Certain features, such as the navigation, are not supported in this preview. I might someday fix that. Or not.


HTML 5 is big. Big in a lot of different ways. I'm trying to understand some of them. Let the random mutterings begin…

Computer Science is the first engineering discipline in which the complexity of the objects created is limited solely by the skill of the creator, and not by the strength of raw materials.

B. Reid

I've been thinking about HTML 5 for a while now. I haven't been making very much progress. It would help if I knew to what ends I was hoping to make said progress, but that's one of the things to think about, I guess. Instead of silent comtemplation, I'm going to try to scribble down some of my thoughts as I go.

The genesis of this essay was some thinking about validity, well-formedness, markup minimization, and parsing.

The design space for markup, especially markup that will be authored by hand (directly or indirectly), is pretty big. It's interesting to compare how SGML, XML, and HTML 5 fit in that space.

[Photo]

Party like it's 1994

SGML was designed with ease of authoring in mind, at least to the extent that minimizing how much markup one had to type was an ease to authoring. Consider the following valid SGML document:

  1<!DOCTYPE chapter SYSTEM "shortcuts.dtd">
  2<chapter final>Party like it's 1994</title>
  3<p>"Old school" baby
  4<glist>
  5shortref,datatag
  6<def>SGML features for markup minimization
  7</chapter>

It has the same interpretation as this document:

  1<!DOCTYPE chapter SYSTEM "shortcuts.dtd">
  2<chapter status="final"><title>Party like it's 1987</title>
  3<p><q>Old school</q> baby</p>
  4<glist>
  5<term>shortref</term><term>datatag</term>
  6<def><p>SGML features for markup minimization</p></def>
  7</glist>
  8</chapter>

Because SGML required (pre-corrigendum[1]) all documents to be valid, this flexibility came at a terrible price. SGML parsers were fiendishly hard to implement correctly.

In the SGML world, those typing conveniences go hand-in-hand with validity. The information in the DTD:

  1<!ELEMENT chapter - - (title, (p|glist)+)>
  2<!ATTLIST chapter
  3          status (draft|final) #IMPLIED
  4>
  5
  6<!ELEMENT title   o o (#PCDATA)*>
  7
  8<!ELEMENT p       o o (#PCDATA|q)*>
  9
 10<!ELEMENT glist   - o ((term+,def)+)>
 11<!ELEMENT term    o o (#PCDATA)*>
 12<!ELEMENT def     - o (p+)>
 13
 14<!ELEMENT q       - o (#PCDATA)*>
 15
 16<!ENTITY  MAPQS STARTTAG "q">
 17<!ENTITY  MAPQE ENDTAG "q">
 18<!ENTITY  MAPTS STARTTAG "term">
 19
 20<!SHORTREF MAP-INQ '"' MAPQE>
 21<!SHORTREF MAP-INP '"' MAPQS>
 22<!SHORTREF MAP-INT "," MAPTS>
 23
 24<!USEMAP MAP-INQ q>
 25<!USEMAP MAP-INP p>
 26<!USEMAP MAP-INT term>

combined with the fact that the document is known/required to be valid guarantee that there's only one interpretation of the characters in the document.

XML was designed with ease of parsing in mind. In particular, it relaxed the validity constraint and obviated the need for a DTDs. Without a DTD, it's impossible to know where implied markup boundaries should go, so you can't have any. Because you don't know the vocabulary.

SGML and XML are both “meta markup languages”. They have no defined vocabulary. SGML includes a mechanism that allows users to invent their own tag vocabularies; XML has several such mechanisms.

HTML 5, in contrast, is explicitly a single vocabulary (or perhaps a small family of vocabularies). As such, it would be much less interesting where it not for two facts: first, it is a revision of the single most important vocabulary on the planet and second, it is neither SGML nor XML.

One of the two “authoring formats” described by the HTML 5 specification is a custom one. The other is XML, but in fact both are described as just concrete syntaxes for “an abstract language for describing documents and applications” which is what is really being defined.

The goal of the custom parser, as I understand it, is that it imposes an unambiguous HTML 5 interpretation on any random stream of characters.

While that offers some apparent benefits to end users (they don't for example, have to remember to type quotes around their attribute values), I harbor some reservations about whether or not this strategy will be a good thing for the broader markup community in the long run.

I suppose to the extent that HTML 5 is the one and only markup vocabulary that you will ever need, it'll be a good thing. And, you know, for a whole lot of people, that's probably the case.


[1]For a wonderful giggle, consider the changes proposed in that document and then look up the definition of “corrigendum”.