HTML+XML at XML Prague
Here are the slides that I presented at XML Prague 2011.
Below are the slides that I presented at XML Prague this morning, with a few snippes of commentary before each one. I'm not sure how valuable they are, but… You may get more milage out of just reading the draft report.
I don't have conclusions yet, so I'll ask for help instead.
- 
                           History 
- 
                           Use Cases 
- 
                           Conclusions 
- 
                           Ask for your help 
I don't claim to be speaking for anyone. I'll try not to say inflamatory things, but my biases may show.
Facts are facts. But any opinions expressed are the opinions only of myself and may or may not reflect the opinions of anybody else with whom I may or may not have discussed the issues at hand.
In particular, I do not claim that what follows is necessarily the consensus opinion of the HTML/XML Task Force.
</foil>In the beginning, there was SGML.

HTML came from SGML. Sort of. Although it was specified as an SGML application, it was never broadly implemented that way.

XML did really come from SGML. You can parse XML with an SGML parser if you fiddle the SGML declaration in the right way.

The next logical step was to combine HTML and XML.

That might have formed the basis of HTML5.

But it didn't.

Alternatively, we could have adjusted XML to better meet the HTML use cases.

Alas, the window of opportunity for this plan has passed. Maybe it never really existed.

What we have instead is divergent evolution. One of ideas that arose at least indirectly out of the task force discussions was a proposal for MicroXML. I don't know what might come of that.

If you think XML and HTML are both important, this is…unfortunate.
Perhaps the biggest challenge that faces the W3C's technical work on the Web is the growing chasm between HTML and XML
- 
                           TAG Issue-67: HTML and XML Divergence - 
                                    Tag soup 
- 
                                    Namespaces 
- 
                                    Syntactic differences (quoted attribute values) 
- 
                                    DOM differences ( tbodyinsertion)
- 
                                    Distributed extensibility 
 
- 
                                    
Of these problems, the DOM differences and distributed extensibility are the most troubling. Here's an valid HTML5 document (I think) that's also a well-formed XML document.
No bonus points to this audience for guessing what the XML DOM looks like.

(Using the HTML5 Live DOM Viewer)
</foil>But what about the HTML5 DOM? As you can see, the HTML5 parser injects a required
                  tbody element. So you can't get the same DOM even if you use polyglot markup.

The other big problem is distributed extensibility. The HTML5 WG has decided it won't have any. That has implications for groups, both inside and outside the W3C that might want to create extensions.
In fairness, the distributed extensibility mechanism that XML provides, namespaces, are not well loved. Me, I like them just fine, but there always has to be one weirdo in the group.
It's also worth observing that there are in principle objections to extensibility because it impacts interoperability. I can see those arguments. I don't agree with them, but I can see them.
- 
                           In practice - 
                                    SVG in W3C and HTML WG 
- 
                                    RDFa in W3C not in HTML WG 
- 
                                    FBML not in W3C 
 
- 
                                    
- 
                           In practice - 
                                    Namespaces 
 
- 
                                    
- 
                           In principle 
Against this history and background, the W3C Technical Architecture Group formed an HTML/XML Task Force.
| 
 | 
 | 
We looked at several use cases, starting with processing HTML with an XML toolchain.
How can an XML toolchain be used to consume HTML?
- 
                           Author HTML5 with polyglot markup. 
- 
                           Add an HTML5 parser to the front end of your toolchain. - 
                                    Doesn't solve the pernicious “ document.write” problem or other script-related problems.
- 
                                    But short of running a JavaScript engine on the content, nothing is likely to solve those problems. 
 
- 
                                    
And its logical converse, processing XML with an HTML toolchain.
How can an HTML5 toolchain be used to consume XML?
- 
                           HTML5 tools won't be designed to deal with arbitrary element names in arbitrary namespaces. 
- 
                           Transforming to HTML5 is probably the best route. - 
                                    Even a partial transformation to remove namespaces, PIs, etc. might prove valuable. 
 
- 
                                    
- 
                           It's probably best not to encourage users to imagine this will be broadly successful. 
Embedding HTML in XML.
How can islands of HTML be embedded in XML?
- 
                           Use the XML serialization of HTML5. 
- 
                           Escape the markup. 
- 
                           Rely on more sophisticated multipart-message handling systems. 
Regardless, some care may be necessary. How are the HTML islands going to be processed? By “clipping” them out and processing them with an HTML5 tool, or by passing the whole DOM to the tool?
</foil>And its logical converse.
How can islands of XML be embedded in HTML?
- 
                           The HTML5 parser interprets unfamiliar markup as an error and corrects for it. 
- 
                           Correction can include changing the order and nesting of elements. 
- 
                           Practically speaking: you can't embed a “naked” island of XML in HTML5. 
You can clothe it in script. Yeah, the name's a bit of a shame, but
                  for legacy reasons…
Putting clothes on your XML
<script type="application/xml">
  <data>
    <title>Your XML</title>
    <gpx xmlns="http://www.topografix.com/GPX/1/1">
      <wpt lat="50.077484" lon="14.443800">
        <ele>200.08</ele>
        <time>2007-01-06T17:33:04Z</time>
        <name>001</name>
        <sym>Restaurant</sym>
      </wpt>
    </gpx>
  </data>
</script>Note that the content of a script element is CDATA. All those things
                  that look like elements are actually escaped text.
Belt and suspenders
  <data>
    <title>Your XML</title>
    <gpx xmlns="http://www.topografix.com/GPX/1/1">
      <wpt lat="50.077484" lon="14.443800">
        <ele>200.08</ele>
        <time>2007-01-06T17:33:04Z</time>
        <name>001</name>
        <sym>Restaurant</sym>
      </wpt>
    </gpx>
  </data>Finally, it's important to note that just because HTML5 doesn't have distributed extensibility mechanisms doesn't mean that it doesn't have any extensibility. It has a bunch of extesibility mechanisms, maybe you should just use those.
Use the HTML5 extensibility mechanisms
<div class="data">
  <h1 class="title">Your XML</h1>
  <div class="gpx">
    <div class="wpt"
         data-lat="50.077484" data-lon="14.443800">
      <span class="ele">200.08</span>
      <span class="time">2007-01-06T17:33:04Z</span>
      <span class="name">001</span>
      <span class="sym">Restaurant</span>
    </div>
  </div>
</div>One of the things that pushed HTML and XML apart is that XML's error handling is arguably inappropriate in a web context. We could try to fix those things.
How can XML be made easier to use?
- 
                           Guaranteeing that you'll get well-formed XML out of naive attempts to generate it with “print” statements is tricky. 
- 
                           Rules could be devised for providing some degree of markup minimization/error correction in XML. 
- 
                           It's possible to consider other simplifications as well, for namespaces, for example. 
The Task Force talked about XForms but failed to craft a use case on which we could all agree.
- 
                           Is XForms a use case or a specific solution to the use case of better form controls? 
- 
                           Is XFroms different in some substantial way than the general “embedding XML in HTML” use case? 
Help!
- 
                           Review the Task Force Report 
- 
                           Talk to the communities you know about the use cases they have. 
- 
                           Report use cases that you think are not met. 
If you're as depressed as I am, remember that the future is longer than the past. Just because things are bad today doesn't mean they can't be made better in the future.

- 
                           Join the mailing list 
- 
                           Read the draft report 
Hope that was useful.
Comments
I'm not sure what your arrows mean. I would have expected the line between HTML and HTML5 to be solid rather than dotted, because HTML5 is designed to be (almost) fully backward-compatible with HTML.
Yeah, that's fair. I changed that back and forth several times while I was working on the graphics. In the end, I think I chose a dashed line because it feels like the (re)specification of the parsing algorithm makes it a less direct descendant.
But maybe it should have been solid.