After considering a set of use cases, the HTML/XML Task Force has decided to issue a report of its findings. This is not that report. These are some notes, musings, and experimental prose.
[Nothing in this essay can be construed as the consensus opinion of the Task Force. These are Norm's opinions only. And not always his most politically correct opionions, either. He won't deny saying them, but he wouldn't dream of saying them with his Task Force chair hat on, either. —ed.]
It is difficult to articulate a concise problem statement that describes exactly what “XML and HTML divergence” is or why it matters. [That's never a good start —ed.] I suspect that the root of the problem is that the communities involved have very little in common and have been pushed, by events and circumstances, into positions that very often feel adversarial. Socially, the situation is simply a disaster. The W3C and the WHATWG get along spectacularly badly. The HTML community and the XML community struggle with absolutely fundamental disagreements. The whole body of discourse now seems like a poisonous morass punctuated by angry words, bruised feelings, and intolerable personality conflicts.
All of which is a bit ironic when you consider how much shared history there appears to be between the two technologies.
But perhaps the appearance of a shared history contributes to the problem, because it doesn't really exist in practice. HTML was inspired by SGML and eventually described as an SGML application but in practice it was never implemented that way by popular browsers and never understood that way by most users (or even most developers). XML was a simplification of SGML, but imposed its own costs; at first in the rigidity of its error handling and then later through its own growth in complexity.
We find ourselves now in a place where there are these two languages that have a similar looking surface syntax but very little actual common ground. Where HTML parsers accept more or less any sequence of characters and turn it into a tree, XML parsers impose strict constraints on the sequences of characters that can be interpreted as a tree. Where XML parsers build a tree directly from the sequence of characters presented, HTML parsers may inject new elements or otherwise modify the tree based on context. Where HTML has no element-name extensibility mechanism and reserves all element names for its own future use, XML has namespaces and is explicitly designed to allow disconnected communities to develop extensions that won't collide if the two communities share documents in the future.
So the problem statement boils down to something like this: we see the development of two divergent markup languages, the markup landscape has been forked. That imposes a duplication of effort in tooling and reduced interoperability for users. That's a bad problem. Fix it.
The first interesting observation that can be made here is that some people simply disagree that a problem exists. If you present someone with a problem and they disagree that it's a problem, there are a few possibilities:
They're too stupid or naive to see the problem. I don't think that's the case here. There are lots of smart folks on all sides.
They're motivated by other forces to pretend the problem doesn't exist. I don't think that's the case either. In my darker moments, I'm sometimes suspicious that a small number of niche implementors have a rather disproportionate influence and that makes me wonder about external motiviations, but only if I'm in a black mood to begin with.
They just don't understand the problem. Or they understand the problem statement perfectly but disagree that it's an actual problem.
If it's the last, then one way to build a shared understanding of the problem is to examine real world use cases and examine possible solutions.
The task force set out to examine a number of use cases. They boiled down to a smaller number than I would have at first expected. They boil down like this:
- How can an XML toolchain be used to consume HTML?
If we take “HTML” to mean the wild and wooly world of HTML as she is writ on the web, then using an XML parser is simply out of the question. However, HTML5 parsers produce a tree and some of those parsers expose the interface to that tree as a sequence of events, such as SAX events, from which an XML parser can build its data structures. In short: if you want to consume HTML, you need an HTML parser, and if you have an HTML parser that exposes the tree in a way that your XML process can ingest, you're golden.
In a much narrower sense, if you have tight control over the HTML, then you might get away with polyglot markup.
- How can an HTML toolchain be used to consume XML?
This question is the natural analog of the first, but it's not clear that any such toolchains exist today, or are likely to exist in the future. So this is a somewhat speculative use case.
The HTML parser will process the document that contains XML markup, but it will build a tree according to the HTML rules. That might work in some cases, but it's not likely to work in general. But there are lots of XML tools for doing transformations, so it may be possible to transform the XML into something more HTML-like. Alternatively, a simplified subset of XML could be developed that would increase the chance that such documents would produce rational results when parsed with an HTML parser.
- How can islands of HTML be embedded in XML?
If you want the XML parser to accept the HTML, then you have to either make it well-formed (i.e., use the XHTML serialization of HTML) or escape it (i.e.,
<p>…). Alternatively, you could rely on a more robust packaging infrastructure, but that's not quite in the spirit of embedding. Given that HTML has an XML serialization and the natural precondition for this problem is that you have an XML environment, it's pretty straightforward.
- How can islands of XML be embedded in HTML?
1<html> 2<head> 3<title>Document with some XML Data</title> 4</head> 5<body> 6<p>This is an HTML paragraph.</p> 7<script type="application/xml"> 8<status xmlns="http://example.com/ns/status-markup"> 9 <title>Some XML</title> 10 <p xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" 11 geo:lat="18.88579" geo:long="2.2948">Not an HTML paragraph 12 about the <link href="http://en.wikipedia.org/wiki/Eiffel_tower">Eiffel 13 Tower</link>.</p> 14</status> 15</script> 16<p>This is another HTML paragraph.</p> 17</body> 18</html>
That XML fragment is in the HTML, albeit with the markup escaped so you'll have to re-parse it with some XML API in the user agent to get at it. From the use case perspective, this is clearly an XML island embedded in HTML for some definition of “XML” and “embedded”. (Henri Sivonen put a live demo of the technique online.)
- Can XML be made more forgiving?
I'm putting two related use cases under this umbrella. The first is about string concatenation. It observes that that most HTML is produced by string concatenation and that HTML is fairly easy to produce this way. XML is considerably more difficult to produce this way, mostly because it's much less forgiving about errors. In fact, it's pretty much entirely unforgiving about errors of any sort: markup errors, unexpected Unicode characters in various contexts, etc.
It's possible to imagine a reformulation of XML that is much more forgiving about errors. Anne van Kesteren outlined one such approach in XML5.
The second asks much the same question but about
application/xhtml+xmlcontent in particular. Even if XML remained unchanged for the general case, could we imagine a future where users were encouraged to author in XML by making the specific case of
application/xhtml+xmlless draconian. Because this deals with only a specific document type, we might imagine front ends that do cleanup for that content type and only that content type.
There was another use case specifically about XForms, but it hasn't yet been fleshed out. I think there's some feeling that it's a specific example of the XML embedded in HTML use case. However, it's clearly more complex than a simple embedding. And on the other hand, the argument has been made that XForms is not a use case, it's a solution: the use case is more sophisticated forms processing.
So where does that leave us? More or less where we started, I'm afraid, though perhaps a little wiser. At least when the task force report comes out, there will be a document that describes some common use cases and how they might be approached using the technology as it stands.
I'm disappointed by the state of affairs that exists now, but the world would be a much different place in many more important ways if it could be bent arbitrarily to resolve my disappointments.
The HTML community is full of bright, energetic folks who believe they know the problem that needs to be solved and how to solve it. I don't think there's any evidence to suggest that they're choosing solutions from a position of ignorance.
The suggestion that HTML needs some form of distributed extensibility mechanism is dead.
I'm frankly astonished by the position that the HTML community has taken. The idea that any community too small or too weak to successfully lobby a standing committee in order to have their markup made part of the one, global all-encompasing markup standard for the world wide web is absurd to me.
Yes, I appreciate that extensibility diminishes interoperability. The fact that there are thousands of Perl modules out there means that the Perl I write may not be compatible with the Perl you have on your system, but by and large we muddle through without a central authority on what constitutes a valid Perl program.
I also appreciate that HTML has extensibility mechanisms and that for many kinds of information those mechanisms are sufficient. But I wouldn't be prepared to assert that they were sufficent for everyone: the chemists and horticulturists, astronomers and geographers, architects and historians, doctors and lawyers, that nowhere in the vast sea of specialists does there exist any important information that can't be represented without any new structure.
It's pretty clear what the immediate future holds, but I'm going to take an “I told you so” token and squirrel it away. The future is longer than the past.
Yes, I know it's slightly inflammatory to call web browsers a niche, but it's in aid of a point: in the grand scheme of things there's a lot you can do with markup other than format it and display it for human eyes.