XML v.next

Volume 14, Issue 14; 04 Apr 2011

If you weren't paying attention when XML was being designed, it may seem odd that it carries so much SGML heritage on its back. Surely, it could have been made even simpler if it didn't have to be compatible with SGML. And it really didn't, did it?

I enjoyed reading John Cowan’s poster for MicroXML at XML Prague. As I stood there, thinking about the constraints and eavesdropping on conversations about themThe common sentiment, predictably, was that it was just right except for this one thing where the one thing differed for almost everyone., an idea began to form.

By dinner time the idea was firmly established. If we're thinking of creating an even-simpler-still angle bracket markup language in the spirit of XML, is backwards compatibility with XML really a requirement? Or is XML 1.0 just our SGML?

The question of whether XML really had to be compatible with SGML is impossible to answer. What can be said for sure is that at the time, it seemed absolutely necessary. SGML was a large, successful system. XML was an alternative to SGML. If the SGML community didn't/couldn't/wouldn't migrate to XML, it felt like XML would be DOA.

I'm going to go out on a limb and suggest that syntactic compatibility with XML is not a requirement. I believe the real constraint is that it must be possible to process these new documents with existing XML tools. Concretely, it think these new documents must have a data model that is compatible with the XDM. (I'd be happy to declare that their data model is the XDM and be done with it, but that's probably not necessary.)

This is the important bit, so I'll repeat it: I think this new language can be syntactically incompatible, but it must produce a data model that is completely compatible with the XDM. (n.b. this is not reflexive, it's ok if there are some XDMs that can't be represented in the new language.)

The overwhelming benefit of this approach is that all the existing validation, transformation, and other XML processing tools will “just work” with these new documents (after they're parsed). It's possible that we can develop simpler technologies for these new documents, but we don't have to start over with the whole stack.

This is like the question, “how you process HTML5 with XML tools?” The answer is you stick a new parser on the front and then process the tree. A couple of the ideas I outline below require extensions to the XDM, but not big ones, and I think they're things that are already in sort of a gray area. XSLT and XQuery, for example, can already produce XDMs that no XML parser can produce.

If there's even a snowball's chance in hell of successfully deploying a successor to XML that solves substantially the same problems as XML, it had better be damned compelling. It better really be simpler. It better really shake off the complexities that chafe XML's detractors (and even some of its champions).

MicroXML, as described by James and John and others, is not intended as a successor to XML. I think I understand the rationale for that point of view, but I don't think that's how most people are thinking about it. If there's something simpler, and it gets any adoption at all, that's what people will want to use. It will be a de facto successor, even if it isn't explicitly established as one.

If the cost of simplification is only a new, presumably also simpler, parser at the front end [only!? —ed], I think that might be a cost that could be borne. The alternative, I fear, is going to be a language that's not substantially, obviously, compellingly simpler and consequently one that cannot succeed.

Here are some of the ideas rolling around in my head. I doubt they're all good ones, but at least a few of them “feel” pretty good. In case that's just angle bracket overload and sleep deprivation from Prague, I won't assert which ones. They're in an ordered list only for labelling purposes, not because there's anything intrinsic about the order.

Remove the restriction that a document can have only a single root node.
Drop the DOCTYPE. I'm not trying to make something compatible with HTML5, a point I'll come back to below, so I don't see any point in allowing the empty declaration.
Discard comments in favor of an xml:comment element. Discard processing instructions in favor of an xml:pi target="name" element (or xml:processing-instruction, if you really prefer). In theory, this extends the data model because these elements could have structure; to avoid that, we say that their value is their string-value irrespective of what they contain. (Authors get benefits anyway, since they'll nest properly.)

Note that these new elements don't have any impact on validation, they are a syntactic device; they become comments and processing instructions in the data model. They lose their “elementness” when parsed, so you can't/don't have to validate them.

I think I'd keep the XML declaration as it is. It's not a PI anyway, despite all appearances to the contrary. And it's useful for character encoding detection.
Allow attribute values to be repeated, so that lists can be represented without microparsingCredit to Jason Hunter for this one.. Note that this isn't a departure from the XDM either. This element:
```
<phrase condition="secret" condition="expert">...</phrase>
```
has a single “condition” attribute whose value is a sequence of two strings, ("secret", "expert").
Support xml:id and xml:base. It's hugely tempting to allow multiple ID values on a single element, but I'll have to look more closely at how the XDM deals with IDs before I'd be willing to commit to that. (I'm not short-changing xml:link [shouldn't you? --ed], but I think it can remain a separate specification.)
Perhaps the hardest question: what to do about namespaces? One radical proposal is to do away with them, but I can't support that. I'm prepared to be persuaded by any number of simplification proposals, but I'll start by outlining my own.

I think the biggest problems with namespaces are the fact that they use a silly pseudo-attribute syntax and they nest. You wind up with declarations scattered willy nilly across documents and every element has to carry a potentially different set of in-scope namespaces. You can never really be sure when you're looking at an element that you know what's in scope without scanning all its ancestors.

So let's fix that. Introduce namespaces, globally, only at the top of a document, with element syntax (in the XDM, these declarations appear on (all the) root elements):
```
<xml:ns prefix="dc" uri="http://purl.org/dc/elements/1.1/"/>
<xml:ns prefix="xlink" uri="http://www.w3.org/1999/xlink"/>
<xml:ns prefix="" uri="http://docbook.org/ns/docbook"/>
<book>...</book>
```
In theory, this complicates some use cases, but I've used namespaces a lot and I've rarely taken advantage of the ability to redeclare prefixes part way through a document and I doubt I've ever been in a situation where I needed to do that.

In some “cut-and-paste” scenarios, it may be necessary to do a little fixup, but I'm not confident that those scenarios arise often enough to justify the cost. Plus the cost doesn't really fix the problem. Grabbing elements in a text editor and pasting them into a new document doesn't magically carry over their in-scope namespaces. And if you've got a tool smart enough to carry them over, can't it be smart enough to fixup the declarations at the top?

(Some folks would like to replace URIs with something more like Java package names. I think that ship has sailed, and I don't really agree anyway. I like URIs.)
Introduce a lexical syntax for expanded names that doesn't require a prefix. We could use what the XPath folks are thinking of for expressions, "namespace-uri":local-name. You can use that form for elements and attribute names, literally:
```
<"http://docbook.org/ns/docbook":book>...
```
You must use the same form for the start tag and the end tag. I'd be tempted to go a step further and reuse or introduce a “markup start character” for the purpose, but I'm not sure it's necessary.
Introduce some sort of error correction. This is a slippery slope and it's not clear how far down we can go without losing our footing entirely. Allowing users to omit the quote marks around attribute values that don't contain spaces seems easy (and it will make at least one user very happy ☺).

We could also allowing “&” and “<" to be their literal selves if they're not followed by a name character.

We could say that a closing tag closes any open tags necessary to make the tree balanced.

I'm not sure it's possible to go much further. Any error correction algorithm has to be consistent and schema-independent so a lot of HTML-style fixup isn't possible. (The HTML parser knows a lot about the elements it sees, the same isn't true for XML in general; I don't think I'd want to introduce two different flavors of error-correction, depending on whether or not the schema is known.)
Allow text content and attribute values to contain any sequence of Unicode characters, including NUL. If we feel really uncomfortable with the fact that this makes encoding detection harder, we could say that any document that uses control characters must assert it in the XML declaration and must assert the encoding: “<?xml encoding="utf-8;binary"?>”.

This would open the door to the possibility of adding explicit support for text and binary content. We could introduce “xml:text” and “xml:binary”. Each have a content-type and a boundary attribute. The boundary is arbitrary but must occur immediately before the closing tag. The boundary is not part of the content but provides an extensible mechanism for assuring that the boundary can always be found. (And that the string “</xml:text>” can occur inside a text block.)

This has the added advantage of removing the need for “<![CDATA[”:
```
<xml:text content-type="text/plain"><random>In XML 1.0 this might
have been a CDATA section & “]]>” would
not have been allowed.</xml:text>
```
I didn't need to specify a boundary, but I could have:
```
<xml:text content-type="text/plain" boundary="EOT"><random>In XML 1.0 this might
have been a CDATA section & “]]>” would
not have been allowed.EOT</xml:text>
```

Of course, the danger in doing this sort of exercise is that engineers are good at thinking up clever features. Add enough clever features and it won't be simpler, just differently complex.

What about enties?

Short of grandfathering in all the MathML entity names, the HTML5 solution, nothing about what I've proposed here attempts to address the problem of declaring names for characters.

The reason that DTDs can define entities (and W3C XML Schema and RELAX NG cannot) is that entities require either the ability to leave unexpanded entities in the data model or some way to interact with the parser.

Some data models support unexpanded entity declarations (and start/end boundaries), but in practice very few tools do. The XDM doesn't, so I'm not going to try.

But what I'm outlining here is a language that needs a new parser, so we have more freedom. We could introduce an xml:macro facility, for example, but I'm not sure it's a good idea.

What about HTML5?

One of the motivations for simplifying XML is to make it more compatible with HTML5. I understand the appeal, but I'm not sure there's any long term benefit to be gained.

No parser for a simplfied XML will ever be able to successfully parse the vast majority of strings that an HTML5 parser will accept. So if you need to read HTML5, you need an HTML5 parser.
If you want to write XML documents that can be parsed by an HTML5 parser, you already can: there's even a spec for that. If you're willing to live within the constraints of XML, you're already 90% of the way there. I can easily imagine a syntax checker/converter that reads “ordinary” XML and, where possible, makes it “polyglot”. Or an editing mode that restricts you to polyglot constructs.
Having a simplified XML won't help address the problems caused by the fact that the HTML5 parser infers structure that no XML parser will. Just making it simplified doesn't make it the same as HTML.
Having a simplified XML won't help address the problem of embedding islands of XML in HTML. You just can't do that without wrapping the XML in a script at which point it's all just CDATA anyway.

XML is just fine as it is. Yes, it has some warts and odd complexities but so do most things. Does anyone actually believe that the intersection of HTML, CSS, and JavaScript in the landscape of modern web browsers is intrinsically less complex than XML?

No, I'm slowly coming [being dragged, kicking and screaming, would be more accurate —ed] to the realization that users get pissed off about some kinds of complexity and not others. When XML was developed, making it easy to parse was a definite goal. It was something the desperate Perl hacker should be able to cook up quickly. That was a visceral reaction to SGML which was so difficult to parse, it's possible that only a small handful of fully conformant SGML parsers ever existed.

I have on more than one occasion muttered something unkind when someone [Tim —ed] harped on about quoted attribute values. Surely, I'd moan, that's not the important bit! But maybe it is. Or at least maybe it's way more annoying to more people than I imagine. I'll grant that it's the kind of complexity that seems arbitrary rather than necessary which may be what's annoying about it.

Conclusions

HTML has taught users that markup doesn't have to be perfect. You can be a little bit sloppy and it comes out ok. The XML community can process those documents by putting an HTML5 parser on the front end. We can produce those documents by writing our XML according to the polyglot pattern.

Some (perhaps many) HTML users will never need more than HTML offers. But some will want to author or process documents with a richer structure than HTML, or to use XML tools like XSLT and XProc that require XML authoring. We can reduce the barrier to entry for these folks by making an XML-datamodel-compatible but slightly-more-forgiving language.

The goal would be that such documents are parsed and then treated just exactly like they'd come from their XML equivalents. So the overall cost to the XML ecosystem is quite small.

Is this worth doing? Does the proverbial snowball have a chance?

I dunno, but I think I feel better about this direction than I do about trying to create a simple-but-syntactically-compatible subset of XML. It's interesting to compare my current thinking with three years ago.

Comments

Thank you very much for summarizing your ideas. It is indeed interesting to add functionalities and simplify in the same way. This is clearly what I'm expecting for the next version of XML. So, what's next? Shouldn't there now be a Work Group to fully specify this?

@Alain: before creating a Working Group, it would be better to just write code and experiment. Issues are better addressed when there is working code for testing the idea. Specifically interoperability issues with deployed content and interoperability issues between the new pieces of code.

I'm still on the fence over whether XMLvNext is needed, because as you say XML is just fine as it is. But I do like pretty much all your suggestions here.

The only one that makes me scratch my head is "namespace-uri":local-name - why not just the URI of namespace-uri/local-name or namespace-uri#local-name?

(Familiar? Confusion with Turtle/N3 might be a feature :)

My current feelings on this are in the same direction, but go a little further. So it’s a good question: what is the purpose of a slimmed-down XML other than scratching an engineering itch? I fear the moment for this kind of exercise passed years ago, as we now have a reasonable choice of XML tools for most environments -- which we'd likely end up using for microXML in most "real world" projects in any event.

There are also political/procedural hurdles to developing a new version of XML. Perhaps one way to do this would be to revise SGML to get an alternative profiling (assuming we view XML as the first profile of SGML). But even then I think getting support for this kind of activity would be tricky -- what is the "market justification"? A cleaned-up syntax, and easy parser authoring, might be nice; but these are not going to be enough to get sufficient traction, I suspect. I think a couple of things which could justify a new SGML/XML would be the goal of having something that could describe native HTML 5 markup, and/or which could deal with more complex documents by supporting, say, overlapping markup. The key would be to have something which had some clear value by clearly targeting a different problem space to XML, otherwise the question always is: why not just use XML while - okay - having to put up with a few of its foibles, but while getting the benefit of all of the great big XML tool ecosystem?

Danny: why not just the URI of namespace-uri/local-name or namespace-uri#local-name?

The XPath 3 convention that Norm is suggesting be copied allows for cases where the namespace-uri itself has a #, or is empty

Your suggestions look fine, but I think they are somewhat lacking. I think people are not (only) complaining about the syntactic complexity needed to obtain an XDM instance. I think they are unhappy with the total number of moving parts/complexity in XML, and the XDM itself is part of the problem (too many node types, data types, etc.). What about just having elements and attributes, and just the fundamental types number, string, and date, and having all other validation/typing be orthogonal?

You comment that using attribute syntax for namespace declarations was "silly". Is it clear that using element syntax is better? Since the proposal is not syntactically compatible anyway why not have a new specific syntactic construct for this?

Same applies to comments. You note that using element syntax authors gain the feature that comments nest, however they lose (the often more useful) feature that you can comment out bad markup and so make the total document well formed. A new syntactic construct could be devised that would allow nesting but be able to contain non well formed content.

Martin: fair point. If we were really starting from scratch we could do it all differently. But I don't really think we can start from scratch and rebuild the entire stack.

David: the motivation for element syntax was to remove syntactic constructs. That that in itself was a simplification. Commenting out bad markup is a good point, maybe xml:comment should work like xml:text and allow arbitrary content.

Mostly good stuff, I think. I'm still looking for the killer feature and core messaging for XML v.next, but maybe it's as simple as "the forgiving XML". Which leads me to think about

Along the lines of what David Carlisle was saying - it's not hard to parse comment syntax (maybe reduced to ""?), but it's a lot harder for authors to write "". But comments should nest, for sure.

So where do we go from here? I like Karl Dubost's idea that writing code is a good place to start. I happen to have a spare XML parser lying around that could be easily adapted - in fact, I'm pretty sure it'll be about 50% of the size with these modifications. I also have an XQuery implementation to plug the new parser into. Wanna give it a try?

David: the motivation for element syntax

Hmm, not convinced. from an Xpath/XDM perspective the main complication of the current namespace syntax is that xmlns:x="jhg" looks like an attribute but isn't selected by @*, in the proposal that gets replaced by <xml:ns/> looks like an element but isn't matched by * . So I'm not sure that that is so much of a win, except that the namespace model itself is simplified by restricting declarations to the top level., but that restriction of course has knock on complications in other parts of the stack, notably xinclude which would presumably need a more aggressive namespace fixup, or live with xincluded documents not being able to be serialised to this syntax

As the chief MicroXML pusher, I of course have a lot to say about this, so I'll try to keep it short. We have different purposes in mind, because you are trying to provide a mildly alternative syntax for the XDM, and I want to stay compatible with the syntax (data models come and go, but text is forever) and ditch the XDM/Infoset.

Consequently, to my mind the most important thing about MicroXML is its very simple data model: there are only elements, and elements have only three properties: a string name, a map mapping attribute names to their values, and a content sequence consisting of other elements and/or strings. I contend that that's what you need to do SGML-ish attributed trees, and all that you need. That's also my response to Alex Brown: XML provides too rich a model, and all new as well as old applications pay the cost of dealing with that model, if only to ignore most of it.

Now a new model alone wouldn't require a new language definition. You can easily write a routine that takes SAX events and generates this sort of element tree, and people often do, just by discarding the rest of the cruft. But by stripping the syntax to represent that model and nothing else (with just a tiny bit more), we can provide simpler, smaller, faster parsers. That's unambiguously a Good Thing. The tiny bit more is comments, which are traditional, and a very limited sort of PIs (not in the model), which allow you to specify an inline stylesheet with xml-stylesheet or inline validation with xml-model.

Similarly, the thing that matters about HTML5 to a MicroXML world is not its unmatched syntactic complications, but the semi-arbitrary limitations on its valid subset. If you are interested in generating valid HTML5, you can generate well-formed MicroXML and then use ordinary schema validation to ensure you are using only HTML5-compliant elements and attributes. Allowing the dummy DOCTYPE is no different from disallowing empty <br> elements or newlines immediately after a <pre> element in this respect.

I won't go point by point, but I'll just say that in a world where 70% of the documents on the Web are in UTF-8 (or its ASCII subset) and rising, 20% in Windows-1252/Latin-1/Latin-9 and falling, and only 10% in all other encodings put together (with UTF-16 down below a tenth of a percent), character encoding looks to me like a feature that will soon be obsolete.

John Cowan said "data models come and go, but text is forever". I humbly disagree.

I'll point out that text itself has a data model. We won't understand any text 100 years from now unless we retain a knowledge of UTF-8 encoding and the Unicode character set. Similarly we won't understand any XML unless we understand the data model it represents.

I believe in data longevity, but I rather think that the data lasts only as long as the data model - and that the data model can indeed be subject to multiple representations.

Why about <! as a comment start sequence? (Until EOL?) Then a <!DOCTYPE would be no problem (and the construct would be useful in many more places)

John Snelson: How do you think we read printed books or Roman inscriptions or the Rosetta Stone? They don't have a data model. I grant that if we forget Unicode (may my right hand lose its cunning, just saying), digital text will be just ones and zeros, but give Unicode I can't imagine anyone being unable to reconstruct some meaning for reasonably well-designed markup. (Not including MSO.) After all, the effective meaning of a text, any text, is in the change to the recipient's behavior, not the author's head.

Would be nice to hear you expand on the "entities" section (your typo = "enties").

Internal / external / parsed / unparsed. It would seem that XInclude would be the obvious replacement in some cases but not all (but perhaps external unparsed (ie: notation) might be better suited to xml:link or similar anyway).

Your remarks deserve more serious consideration than I can give them here. But I can't resist one topic. You write “Allowing users to omit the quote marks around attribute values that don't contain spaces seems easy”. Agreed, with the emphasis on seems.

The string “"hi>” doesn't have spaces. So it doesn't need to be quoted, right? And the string “foo bar=baz” doesn't have any blanks, either (that's a U+00A0 between the o and the b). So it's OK, right?

Seriously: It's not in fact hard to have simple rules allowing unquoted attributes that make sense. It's not trivial, though, in my experience; I've never seen anyone get a plausible, coherent formulation in less than four tries (including a group of people at the W3C Tech Plenary which included Michael Kay and some others of the smartest people I know personally).

SGML had simple rules for that purpose. But none of the HTML parsers actually followed those rules, apparently because they couldn't be bothered. Why ought we to expect a better or different fate for a new and different set of rules for saying when values must be quoted and when they don't have to be quoted?

How about dropping attributes entirely, is there a good reason why they can not be replaced by subelements?

Sorry for the tangential question but can someone give me an example of a an XDM that a XML parser cannot produce? Thanks.

Lots of good stuff to consider. I'll probably write a follow-up essay in a day or two.

Jakub: There's no technical reason, but some things just "feel" like attributes. If you make them all subelements then you have to worry about order and the meaning of whitespace between them and how they interact with the rest of the content model. (If they're subelements in an element with mixed content, can text occur before them? Whitespace? Is it significant?)

stand: Any data model that consists of only text nodes, a data model which has a mixture of top-level text nodes and elements, a data model with two top-level elements, ...