Thinking about the HTML and XML

Volume 14, Issue 6; 10 Feb 2011

After considering a set of use cases, the HTML/XML Task Force has decided to issue a report of its findings. This is not that report. These are some notes, musings, and experimental prose.

[Nothing in this essay can be construed as the consensus opinion of the Task Force. These are Norm's opinions only. And not always his most politically correct opionions, either. He won't deny saying them, but he wouldn't dream of saying them with his Task Force chair hat on, either. —ed.]

Problem Statement

It is difficult to articulate a concise problem statement that describes exactly what “XML and HTML divergence” is or why it matters. [That's never a good start —ed.] I suspect that the root of the problem is that the communities involved have very little in common and have been pushed, by events and circumstances, into positions that very often feel adversarial. Socially, the situation is simply a disaster. The W3C and the WHATWG get along spectacularly badly. The HTML community and the XML community struggle with absolutely fundamental disagreements. The whole body of discourse now seems like a poisonous morass punctuated by angry words, bruised feelings, and intolerable personality conflicts.

All of which is a bit ironic when you consider how much shared history there appears to be between the two technologies.

But perhaps the appearance of a shared history contributes to the problem, because it doesn't really exist in practice. HTML was inspired by SGML and eventually described as an SGML application but in practice it was never implemented that way by popular browsers and never understood that way by most users (or even most developers). XML was a simplification of SGML, but imposed its own costs; at first in the rigidity of its error handling and then later through its own growth in complexity.

We find ourselves now in a place where there are these two languages that have a similar looking surface syntax but very little actual common ground. Where HTML parsers accept more or less any sequence of characters and turn it into a tree, XML parsers impose strict constraints on the sequences of characters that can be interpreted as a tree. Where XML parsers build a tree directly from the sequence of characters presented, HTML parsers may inject new elements or otherwise modify the tree based on context. Where HTML has no element-name extensibility mechanism and reserves all element names for its own future use, XML has namespaces and is explicitly designed to allow disconnected communities to develop extensions that won't collide if the two communities share documents in the future.

So the problem statement boils down to something like this: we see the development of two divergent markup languages, the markup landscape has been forked. That imposes a duplication of effort in tooling and reduced interoperability for users. That's a bad problem. Fix it.

The first interesting observation that can be made here is that some people simply disagree that a problem exists. If you present someone with a problem and they disagree that it's a problem, there are a few possibilities:

They're too stupid or naive to see the problem. I don't think that's the case here. There are lots of smart folks on all sides.
They're motivated by other forces to pretend the problem doesn't exist. I don't think that's the case either. In my darker moments, I'm sometimes suspicious that a small number of niche implementorsYes, I know it's slightly inflammatory to call web browsers a niche, but it's in aid of a point: in the grand scheme of things there's a lot you can do with markup other than format it and display it for human eyes. have a rather disproportionate influence and that makes me wonder about external motiviations, but only if I'm in a black mood to begin with.
They just don't understand the problem. Or they understand the problem statement perfectly but disagree that it's an actual problem.

If it's the last, then one way to build a shared understanding of the problem is to examine real world use cases and examine possible solutions.

Use Cases

The task force set out to examine a number of use cases. They boiled down to a smaller number than I would have at first expected. They boil down like this:

1. How can an XML toolchain be used to consume HTML?
2. How can an HTML toolchain be used to consume XML?
3. How can islands of HTML be embedded in XML?
4. How can islands of XML be embedded in HTML?
5. Can XML be made more forgiving?

1.	How can an XML toolchain be used to consume HTML?
	If we take “HTML” to mean the wild and wooly world of HTML as she is writ on the web, then using an XML parser is simply out of the question. However, HTML5 parsers produce a tree and some of those parsers expose the interface to that tree as a sequence of events, such as SAX events, from which an XML parser can build its data structures. In short: if you want to consume HTML, you need an HTML parser, and if you have an HTML parser that exposes the tree in a way that your XML process can ingest, you're golden. In a much narrower sense, if you have tight control over the HTML, then you might get away with polyglot markup.
2.	How can an HTML toolchain be used to consume XML?
	This question is the natural analog of the first, but it's not clear that any such toolchains exist today, or are likely to exist in the future. So this is a somewhat speculative use case. The HTML parser will process the document that contains XML markup, but it will build a tree according to the HTML rules. That might work in some cases, but it's not likely to work in general. But there are lots of XML tools for doing transformations, so it may be possible to transform the XML into something more HTML-like. Alternatively, a simplified subset of XML could be developed that would increase the chance that such documents would produce rational results when parsed with an HTML parser.
3.	How can islands of HTML be embedded in XML?
	If you want the XML parser to accept the HTML, then you have to either make it well-formed (i.e., use the XHTML serialization of HTML) or escape it (i.e., `<p>…`). Alternatively, you could rely on a more robust packaging infrastructure, but that's not quite in the spirit of embedding. Given that HTML has an XML serialization and the natural precondition for this problem is that you have an XML environment, it's pretty straightforward.
4.	How can islands of XML be embedded in HTML?
	If what you mean by embedding is the ability to insert XML directly into the document so that the parser will build the tree you'd expect from it, then you can't. HTML's parsing rules are going to treat the XML as (invalid) HTML and make assumptions about namespaces and which elements are empty. If what you need is simply the ability to put some XML in your document where JavaScript can get at it, then there's the `script` element: `<html> <head> <title>Document with some XML Data</title> </head> <body> <p>This is an HTML paragraph.</p> <script type="application/xml"> <status xmlns="http://example.com/ns/status-markup"> <title>Some XML</title> <p xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" geo:lat="18.88579" geo:long="2.2948">Not an HTML paragraph about the <link href="http://en.wikipedia.org/wiki/Eiffel_tower">Eiffel Tower</link>.</p> </status> </script> <p>This is another HTML paragraph.</p> </body> </html>` That XML fragment is in the HTML, albeit with the markup escaped so you'll have to re-parse it with some XML API in the user agent to get at it. From the use case perspective, this is clearly an XML island embedded in HTML for some definition of “XML” and “embedded”. (Henri Sivonen put a live demo of the technique online.)
5.	Can XML be made more forgiving?
	I'm putting two related use cases under this umbrella. The first is about string concatenation. It observes that that most HTML is produced by string concatenation and that HTML is fairly easy to produce this way. XML is considerably more difficult to produce this way, mostly because it's much less forgiving about errors. In fact, it's pretty much entirely unforgiving about errors of any sort: markup errors, unexpected Unicode characters in various contexts, etc. It's possible to imagine a reformulation of XML that is much more forgiving about errors. Anne van Kesteren outlined one such approach in XML5. The second asks much the same question but about `application/xhtml+xml` content in particular. Even if XML remained unchanged for the general case, could we imagine a future where users were encouraged to author in XML by making the specific case of `application/xhtml+xml` less draconian. Because this deals with only a specific document type, we might imagine front ends that do cleanup for that content type and only that content type.

There was another use case specifically about XForms, but it hasn't yet been fleshed out. I think there's some feeling that it's a specific example of the XML embedded in HTML use case. However, it's clearly more complex than a simple embedding. And on the other hand, the argument has been made that XForms is not a use case, it's a solution: the use case is more sophisticated forms processing.

Conclusions

So where does that leave us? More or less where we started, I'm afraid, though perhaps a little wiser. At least when the task force report comes out, there will be a document that describes some common use cases and how they might be approached using the technology as it stands.

I'm disappointed by the state of affairs that exists now, but the world would be a much different place in many more important ways if it could be bent arbitrarily to resolve my disappointments.

The HTML community is full of bright, energetic folks who believe they know the problem that needs to be solved and how to solve it. I don't think there's any evidence to suggest that they're choosing solutions from a position of ignorance.

The suggestion that HTML needs some form of distributed extensibility mechanism is dead.

I'm frankly astonished by the position that the HTML community has taken. The idea that any community too small or too weak to successfully lobby a standing committee in order to have their markup made part of the one, global all-encompasing markup standard for the world wide web is absurd to me.

Yes, I appreciate that extensibility diminishes interoperability. The fact that there are thousands of Perl modules out there means that the Perl I write may not be compatible with the Perl you have on your system, but by and large we muddle through without a central authority on what constitutes a valid Perl program.

I also appreciate that HTML has extensibility mechanisms and that for many kinds of information those mechanisms are sufficient. But I wouldn't be prepared to assert that they were sufficent for everyone: the chemists and horticulturists, astronomers and geographers, architects and historians, doctors and lawyers, that nowhere in the vast sea of specialists does there exist any important information that can't be represented without any new structure.

It's pretty clear what the immediate future holds, but I'm going to take an “I told you so” token and squirrel it away. The future is longer than the past.

Comments

What I'm trying to understand is why W3C put distributed extensibility at the core of its work for over a decade, and yet we end up in 2011 with a W3C HTML WG decision, posted by well-known HTML fan Sam Ruby, that found the case presented by W3C TAG (and others) for distributed extensibility in HTML was "weak" and supported by "virtually nothing" in terms of evidence or use cases.

If the W3C is back where it started after ~13 years of work, is being "a little wiser" an acceptable outcome for an organisation charged with leading the web to its full potential?

If it is, how is this new wisdom going to be applied going forward? If it isn't, why did this process continue for so long without remedial action, and how can the W3C avoid situations like this arising in the future? Either way, what can the W3C do now to mitigate the "social disaster" it has created, if anything?

ISTM the only thing that has changed is that it is now widely recognised that HTML will continue to develop along one path, and XML/RDF will continue to develop along another path. If one of those development paths wants to feature in the other development path, it should probably do so by integrating rather than attempting overwriting. There is nothing blocking further development of XML/RDF technologies (i.e. the Semantic Web, capital S, capital W). That's one advantage the WHATWG did not have when getting started.

The best way to protest is to build something useful. Widely useful, if the aim is widespread adoption. In my view, the XML/RDF community should set up an XWHATWG and just get on with it. It won't be short of volunteers because the W3C director, all past/present members of W3C TAG, most of WAI-PF and a small but not insignificant number of web developers will almost certainly join.

TL;DR - arguments about the future of the Web will now be decided on merit via healthy competition, not by authority. And always have been anyway.

Fan of HTML? Guilty as charged. I'll also point out that I'm a fan of distributed extensibility. I also serve my weblog as well formed XHTML.

The W3C and the WHATWG get along spectacularly badly.? I think that this statement misses the larger dynamic. HTML is being evolved by those that chose to participate.

The system ate my comment yesterday. Let's try again.

Even if it's in the darker moments only and with knowledge that it's inflammatory, calling Web browsers a "niche" is really an indication of the fundamental mindset problem that allowed XML to diverge from what was already on the Web (HTML) in the first place. Even though serious browser engines are few in number, they are what people experience the Web through. Treating them as a "niche" thing means treating the Web as a niche thing compared to, say, aerospace documentation workflows (XSL), enterprise integration message passing (SOAP) or multi-database dataset merging (RDF). That the Web shouldn't be a niche thing for a Web consortium should be obvious.

I guess limiting the "niche" talk to darker moments only is progress, but I had hoped we'd be over that already--even for darker moments.

The idea that any community too small or too weak to successfully lobby a standing committee in order to have their markup made part of the one, global all-encompasing markup standard for the world wide web is absurd to me.

I'm pretty sure I know what you're getting at but something's gone wrong with the structure of this sentence. It's important enough that I think a rework would be worthwhile.

May I suggest a quote to historically anchor the origins of this problem? http://groups.google.com/group/alt.hypertext/browse_thread/thread/7824e490ea164c06/395f282a67a1916c?#395f282a67a1916c "[...] It's just a question of generating plain text or SGML (ugh! but standard) mark-up ", TimBL, Aug 6 1991, in news:alt.hypertext. Would be nice to have things looking happier before the 20th anniversary of that post... Re "niche implementors", while browsers aren't the Web (we also have search engines, CMS systems, and various other consumers), "niche" seems unfair and dismissive. I'd rather say "privileged implementors". One thing you've not mentioned here is the rise of Javascript / JSON, which has eaten away at Web developer's enthusiasm for using XML in many contexts. How does .js./.json fit into this picture?

There's a danger (already instantiated) of talking past each other here with browsers as "niche". It's absolutely true that "there's a lot you can do with markup other than format it and display it for human eyes". It's also the case that the Web is much more than documents and viewers.

In particular it's worth noting that browsers are a lot more than just viewers these days. There's the Javascript/JSON danbri mentions, which (especially around Ajax) enables them to act as much more general purpose agents. When you look at that in the context of API systems like Facebook, there's a lot more going on. Having said that, and while Henri's right about the importance of browsers because of the number of people using them, browsers are still in essence just a front-end for one kind of connector to the Web, they aren't the Web.

So I do think you're right to suggest that a small number of implementers have a disproportionate influence.

The status quo has evolved from (literally) a very browser-oriented view of the Web, and now there's an oligopoly calling the shots. This is potentially a big problem because the development of specifications will favour this single view of the Web, and collaterally handicap development of different approaches.

Basically I think it's a huge mistake to base decisions around markup etc. on the assumption that the future Web will look just like today's Web when the one certainty is that it will change. In particular the decision on distributed extensibility was a really bad one which I'm sure will cause unnecessary problems further down the line (I have now joined the HTML WG, but alas didn't get around to it until it was too late on that issue).

While I haven't a clue what the future holds for XML, I don't think a fairly root-level fork with the HTML path is necessarily a bad thing. HTML might well now have over-specialised itself as a browser language and in a few year's times start looking anachronistic, another Gopher. Similarly I'm not sure things like loosening XML error handling would be a good move - seems less like playing to its strengths than (dare I say it) HTML envy. Dunno, my feeling would be to leave HTML to the user-facing end of pipelines, and do the machine-machine comms (where messy humans aren't likely to get in the way) with XML. And ne'er the twain shall meet.