If you weren't paying attention when XML was being designed, it may seem odd that it carries so much SGML heritage on its back. Surely, it could have been made even simpler if it didn't have to be compatible with SGML. And it really didn't, did it?
I enjoyed reading John Cowan’s poster for MicroXML at XML Prague. As I stood there, thinking about the constraints and eavesdropping on conversations about them, an idea began to form.
By dinner time the idea was firmly established. If we're thinking of creating an even-simpler-still angle bracket markup language in the spirit of XML, is backwards compatibility with XML really a requirement? Or is XML 1.0 just our SGML?
I'm going to go out on a limb and suggest that syntactic compatibility with XML is not a requirement. I believe the real constraint is that it must be possible to process these new documents with existing XML tools. Concretely, it think these new documents must have a data model that is compatible with the XDM. (I'd be happy to declare that their data model is the XDM and be done with it, but that's probably not necessary.)
This is the important bit, so I'll repeat it: I think this new language can be syntactically incompatible, but it must produce a data model that is completely compatible with the XDM. (n.b. this is not reflexive, it's ok if there are some XDMs that can't be represented in the new language.)
The overwhelming benefit of this approach is that all the existing validation, transformation, and other XML processing tools will “just work” with these new documents (after they're parsed). It's possible that we can develop simpler technologies for these new documents, but we don't have to start over with the whole stack.
This is like the question, “how you process HTML5 with XML tools?” The answer is you stick a new parser on the front and then process the tree. A couple of the ideas I outline below require extensions to the XDM, but not big ones, and I think they're things that are already in sort of a gray area. XSLT and XQuery, for example, can already produce XDMs that no XML parser can produce.
If there's even a snowball's chance in hell of successfully deploying a successor to XML that solves substantially the same problems as XML, it had better be damned compelling. It better really be simpler. It better really shake off the complexities that chafe XML's detractors (and even some of its champions).
If the cost of simplification is only a new, presumably also simpler, parser at the front end [only!? —ed], I think that might be a cost that could be borne. The alternative, I fear, is going to be a language that's not substantially, obviously, compellingly simpler and consequently one that cannot succeed.
Here are some of the ideas rolling around in my head. I doubt they're all good ones, but at least a few of them “feel” pretty good. In case that's just angle bracket overload and sleep deprivation from Prague, I won't assert which ones. They're in an ordered list only for labelling purposes, not because there's anything intrinsic about the order.
Remove the restriction that a document can have only a single root node.
Drop the DOCTYPE. I'm not trying to make something compatible with HTML5, a point I'll come back to below, so I don't see any point in allowing the empty declaration.
Discard comments in favor of an
<xml:comment>element. Discard processing instructions in favor of an
<xml:pi target="name">element (or
<xml:processing-instruction>, if you really prefer). In theory, this extends the data model because these elements could have structure; to avoid that, we say that their value is their string-value irrespective of what they contain. (Authors get benefits anyway, since they'll nest properly.)
Note that these new elements don't have any impact on validation, they are a syntactic device; they become comments and processing instructions in the data model. They lose their “elementness” when parsed, so you can't/don't have to validate them.
I think I'd keep the XML declaration as it is. It's not a PI anyway, despite all appearances to the contrary. And it's useful for character encoding detection.
Allow attribute values to be repeated, so that lists can be represented without microparsing. Note that this isn't a departure from the XDM either. This element:
1<phrase condition="secret" condition="expert">...</phrase>
has a single “
condition” attribute whose value is a sequence of two strings,
xml:base. It's hugely tempting to allow multiple ID values on a single element, but I'll have to look more closely at how the XDM deals with IDs before I'd be willing to commit to that. (I'm not short-changing
xml:link[shouldn't you? --ed], but I think it can remain a separate specification.)
Perhaps the hardest question: what to do about namespaces? One radical proposal is to do away with them, but I can't support that. I'm prepared to be persuaded by any number of simplification proposals, but I'll start by outlining my own.
I think the biggest problems with namespaces are the fact that they use a silly pseudo-attribute syntax and they nest. You wind up with declarations scattered willy nilly across documents and every element has to carry a potentially different set of in-scope namespaces. You can never really be sure when you're looking at an element that you know what's in scope without scanning all its ancestors.
So let's fix that. Introduce namespaces, globally, only at the top of a document, with element syntax (in the XDM, these declarations appear on (all the) root elements):
1<xml:ns prefix="dc" uri="http://purl.org/dc/elements/1.1/"/> 2<xml:ns prefix="xlink" uri="http://www.w3.org/1999/xlink"/> 3<xml:ns prefix="" uri="http://docbook.org/ns/docbook"/> 4<book>...</book>
In theory, this complicates some use cases, but I've used namespaces a lot and I've rarely taken advantage of the ability to redeclare prefixes part way through a document and I doubt I've ever been in a situation where I needed to do that.
In some “cut-and-paste” scenarios, it may be necessary to do a little fixup, but I'm not confident that those scenarios arise often enough to justify the cost. Plus the cost doesn't really fix the problem. Grabbing elements in a text editor and pasting them into a new document doesn't magically carry over their in-scope namespaces. And if you've got a tool smart enough to carry them over, can't it be smart enough to fixup the declarations at the top?
(Some folks would like to replace URIs with something more like Java package names. I think that ship has sailed, and I don't really agree anyway. I like URIs.)
Introduce a lexical syntax for expanded names that doesn't require a prefix. We could use what the XPath folks are thinking of for expressions,
"namespace-uri":local-name. You can use that form for elements and attribute names, literally:
You must use the same form for the start tag and the end tag. I'd be tempted to go a step further and reuse or introduce a “markup start character” for the purpose, but I'm not sure it's necessary.
Introduce some sort of error correction. This is a slippery slope and it's not clear how far down we can go without losing our footing entirely. Allowing users to omit the quote marks around attribute values that don't contain spaces seems easy (and it will make at least one user very happy ☺).
We could also allowing “&” and “<" to be their literal selves if they're not followed by a name character.
We could say that a closing tag closes any open tags necessary to make the tree balanced.
I'm not sure it's possible to go much further. Any error correction algorithm has to be consistent and schema-independent so a lot of HTML-style fixup isn't possible. (The HTML parser knows a lot about the elements it sees, the same isn't true for XML in general; I don't think I'd want to introduce two different flavors of error-correction, depending on whether or not the schema is known.)
Allow text content and attribute values to contain any sequence of Unicode characters, including
NUL. If we feel really uncomfortable with the fact that this makes encoding detection harder, we could say that any document that uses control characters must assert it in the XML declaration and must assert the encoding: “
This would open the door to the possibility of adding explicit support for text and binary content. We could introduce “
<xml:text>” and “
<xml:binary>”. Each have a
boundaryattribute. The boundary is arbitrary but must occur immediately before the closing tag. The boundary is not part of the content but provides an extensible mechanism for assuring that the boundary can always be found. (And that the string “
</xml:text>” can occur inside a text block.)
This has the added advantage of removing the need for “
1<xml:text content-type="text/plain"><random>In XML 1.0 this might 2have been a CDATA section & “]]>” would 3not have been allowed.</xml:text>
I didn't need to specify a boundary, but I could have:
1<xml:text content-type="text/plain" boundary="EOT"><random>In XML 1.0 this might 2have been a CDATA section & “]]>” would 3not have been allowed.EOT</xml:text>
Of course, the danger in doing this sort of exercise is that engineers are good at thinking up clever features. Add enough clever features and it won't be simpler, just differently complex.
What about enties?
Short of grandfathering in all the MathML entity names, the HTML5 solution, nothing about what I've proposed here attempts to address the problem of declaring names for characters.
The reason that DTDs can define entities (and W3C XML Schema and RELAX NG cannot) is that entities require either the ability to leave unexpanded entities in the data model or some way to interact with the parser.
Some data models support unexpanded entity declarations (and start/end boundaries), but in practice very few tools do. The XDM doesn't, so I'm not going to try.
But what I'm outlining here is a language that needs a new
parser, so we have more freedom. We could introduce an
<xml:macro> facility, for example, but I'm not sure it's a
What about HTML5?
One of the motivations for simplifying XML is to make it more compatible with HTML5. I understand the appeal, but I'm not sure there's any long term benefit to be gained.
No parser for a simplfied XML will ever be able to successfully parse the vast majority of strings that an HTML5 parser will accept. So if you need to read HTML5, you need an HTML5 parser.
If you want to write XML documents that can be parsed by an HTML5 parser, you already can: there's even a spec for that. If you're willing to live within the constraints of XML, you're already 90% of the way there. I can easily imagine a syntax checker/converter that reads “ordinary” XML and, where possible, makes it “polyglot”. Or an editing mode that restricts you to polyglot constructs.
Having a simplified XML won't help address the problems caused by the fact that the HTML5 parser infers structure that no XML parser will. Just making it simplified doesn't make it the same as HTML.
Having a simplified XML won't help address the problem of embedding islands of XML in HTML. You just can't do that without wrapping the XML in a
<script>at which point it's all just CDATA anyway.
No, I'm slowly coming [being dragged, kicking and screaming, would be more accurate —ed] to the realization that users get pissed off about some kinds of complexity and not others. When XML was developed, making it easy to parse was a definite goal. It was something the desperate Perl hacker should be able to cook up quickly. That was a visceral reaction to SGML which was so difficult to parse, it's possible that only a small handful of fully conformant SGML parsers ever existed.
I have on more than one occasion muttered something unkind when someone [Tim —ed] harped on about quoted attribute values. Surely, I'd moan, that's not the important bit! But maybe it is. Or at least maybe it's way more annoying to more people than I imagine. I'll grant that it's the kind of complexity that seems arbitrary rather than necessary which may be what's annoying about it.
HTML has taught users that markup doesn't have to be perfect. You can be a little bit sloppy and it comes out ok. The XML community can process those documents by putting an HTML5 parser on the front end. We can produce those documents by writing our XML according to the polyglot pattern.
Some (perhaps many) HTML users will never need more than HTML offers. But some will want to author or process documents with a richer structure than HTML, or to use XML tools like XSLT and XProc that require XML authoring. We can reduce the barrier to entry for these folks by making an XML-datamodel-compatible but slightly-more-forgiving language.
The goal would be that such documents are parsed and then treated just exactly like they'd come from their XML equivalents. So the overall cost to the XML ecosystem is quite small.
Is this worth doing? Does the proverbial snowball have a chance?
I dunno, but I think I feel better about this direction than I do about trying to create a simple-but-syntactically-compatible subset of XML. It's interesting to compare my current thinking with three years ago.