Explaining identifiers in XML

Volume 8, Issue 18; 11 Feb 2005; last modified 08 Oct 2010

Scenes from a possible future in which Norm tries to buy groceries and explain XML at the same time.

[Update: David Megginson pointed out that I may not have given enough context for this essay. The xml:id Candidate Recommendation draft has just been published, but a storm of controversy has erupted. It started in a couple of threads on the public xml:id comments list in February and eventually spilled over onto the TAG’s discussion list where a formal request has been made for the TAG to take up the issue. This morning, Elliotte Rusty Harold noted the CR draft's publication and voiced his support for “xmlid” as a way to resolve the controversy (he's not alone, though I don't yet detect a majority of support for that resolution). In any event, below, I ponder how that solution might play out…]

So I'm standing in the checkout line at the local supermarket and this guy walks up to me, “hey,” he says, “you're Norm, aren't you? We met once before and you were telling me about this markup stuff. I've been giving it a try and it seems pretty great, but I have a few questions, can you help me?”

“I can try. What would you like to know?”

“Well,” he says, “I've been writing this paper and it's almost entirely in English, but I managed to work the phrase ‘C'est la vie’ into it. Should I put markup around that to indicate that it's in French?”

“Yeah, that's probably a good idea.”

“Ok, how do I do that?”

“It's easy, you put an element around it, maybe ‘phrase’ or ‘span’ or whatever seems appropriate in your vocabulary, and then you set the XML colon lang attribute to ‘fr’ on that element.”

“Cool. I've also got some poetry. Most of the paragraphs in my document get reformatted by the rendering application, but I want to make sure that the line breaks don't move in my poetry. Can I do that?”

“Well, that really should be part of the semantics of your ‘poem’ element or whatever your using to markup your poetry, but if you want to make sure every processor knows that white space is significant in your poems, you can set the XML colon space attribute to ‘preserve’ to do that.”

“Use XML colon space, ok I think I can remember that. It looks like the guy in front of you is having trouble finding his check book, can I ask you a couple more questions?”

“Sure.”

“Thanks. I've been learning to use XInclude. Why did it take five years to finish that spec?”

“I don't want to talk about it.”

“Oh. Sorry. That wasn't really my question anyway, I was just curious. What I really wanted to say was that I've noticed that when XInclude merges files together, it puts an XML colon base attribute on the parts it pulls in. That seems to indicate what the path to the original document was, is that right?”

“Basically, yes.”

“Neat. So can I use that myself? One part of my paper has a whole bunch of pictures and it's really tedious to type the great big long URI for each one of them. Can I put my own XML colon base attribute in there and then just use relative URIs for all the graphics?”

“Yep, that ought to work as long as the XML colon base is on some element that's an ancestor of all the image elements. And as long as it makes sense for that to be the base URI for all the relative URIs under it.”

“Sweet. Ok, one last question. I want to put my documents on the web and I'd like folks to be able to point into them with anchors like they can with HTML. Can I do that?”

“Well, there's a sort of technically correct answer to that question and a practically correct answer. The technically correct answer is ‘no’ because we're still waiting for a couple of specs to get finished. Right now there's no official fragment identifier syntax for XML documents; that part after the hash is the fragment identifier, but don't fret too much about that right now. It seems pretty clear that this is all going to work itself out and you'll be able to use “hash ID” to point to the element with that ID in your document. So, yeah, you can, just put IDs on the things you want to be able to point to.”

“I'm not using a DTD or anything, so I don't really have any IDs. How do I do that?”

“That's what the XML ID attribute is for. Put an XML ID on each element you want to be able to point to and make sure that it has a unique value in your document. Oh, and that the value is a ‘name’ in the XML sense, without a colon.”

“You said ‘XML ID’ but you meant ‘XML colon ID’, right?”

“No, that one doesn't have a colon.”

“What? Why not?”

“That's a long story. Basically, there was a bug in another spec and everyone decided that it was too expensive to fix the bug because there was so much deployed software that used it. Plus a lot of it was security-related software and changing that stuff is really hard.”

“But that seems really confusing. Wouldn't it have made more sense to fix the bug and keep everything consistent? Besides, isn't that like the story Michael Sperberg-McQueen was telling me the other day about the ‘creat’ system call in Unix? They figured out just about as soon as they started using it that they ought to have spelled it correctly, ‘create’, but they decided there was too much legacy software that relied on it. There were a grand total of six machines in the world running Unix at the time. There's never going to be less legacy than there is today. And isn't this a horrible precedent to set?”

“What can I say,” I said with a shrug, “that's what the XML community decided they wanted to do.”

“But that's so confusing. I thought I could use any unqualified names I wanted. I know they suck, but I thought we were supposed to use namespaces to identify things.”

“Well, yeah, but it turns out that the W3C grabbed all the names and all the prefixes that begin with ‘x’, ‘m’, ‘l’ in any case, so the name “xmlid” was never actually available for you to use.”

“Ok, but this is so confusing, does that mean I can use XML lang, XML space, and XML base, without the colons, if I want?”

“No, those are reserved too, but they don't mean anything. So you have to use the colon for those.”

“This sucks. All because there was a bug in another spec? Was it really a bug, or was it maybe not a bug?”

“Everyone pretty much seems to agree that it was a bug. Anyway, wanna hear the best part? Now, if we agree some other attribute should apply to all of XML, we can put a colon in it if it ‘inherits’ but not if it doesn't. So you'll just have to remember which is which. Or maybe we'll never be able to use the XML namespace again, who knows?”

“But wait, XML colon base doesn't really inherit. Why is that ok?”

“Oh, it's not. So the other spec is still potentially broken in some subtle ways.”

“But in that case, you didn't really gain anything by using XML ID without the colon, did you? I mean, it's all inconsistent and confusing now, but the software that has the bug still has a bug and has to be fixed anyway, right?”

“As far as I can see, yeah. Maybe I'm wrong though.”

“Don't you guys care about users at all?”

“I do my best, pal, that's all I can say.”

The cashier interrupted us at that point, “will that be paper or plastic?” he asked, and we parted company. Last I saw, he was walking back towards the aisle where they keep the aspirin. Me, I'm going next door to the liquor store.

Comments

For the benefit of non-W3C-insiders, what are you trying to tell us here Norm? The CR that was just released uses xml:id, not xmlid, but I'm guessing that's not what we're going to see in the REC after all.

The W3C needs to get its act together when to comes to evolving their various specs. I had a rant pending about the fact that another addition was being made to the XML namespace without a consideration for the downstream breakage but it seems the W3C has come to their senses. Instead of whining about this, the W3C crowd should figure out what their versioning and extensibility story is going to be instead of pretending they can ignore backwards compatibility when deploying specs.

Thanks Norm for writing this funny story. Exactly my feeling. C14N is already broken for xml:base (since it does not inherit), so let's fix it.

The longer the spell, the more uncertain the results.

Hi Michael,

The W3C Quality Assurance Working Group is working on that. It's indeed very important to not only produce a good technical specification, but also to spend time on the normative reference analysis. How each specification will impact on the specification, the WG is producing by their technical requirements or their own conformance model.

The Good Practice is Do systematic reviews of normative references and their implications.

http://www.w3.org/TR/qaframe-spec/#ref-define-practice

If such analysis was made by WGs when they create specifications, that would be a way of controlling the risks of incompatible dependencies.

But we also have to accept that with the quality process (Publication rules, requirements of interoperability and implementations, QA, etc.) add a bit of time to the whole thing. Some W3C Members think that there is not enough process, some think that there is too much. It's always a question of reaching the consensus.

Producing a good specification takes time at many levels. It's not an easy task.

I'd like to think that xml:id is solving a real problem and that not having xml:id (or xmlid) would leave that problem unsolved (and therefore be a bad thing) but I'm having a real problem constructing any currently working examples of breakage.

Currently I serve a lot of xml files, and I use a lot of fragment itentifiers pointing in to them.
http://www.w3.org/TR/MathML2/chapter1.xml#id.1.2.2
for example.

that file has an id attribute of id.1.2.2 that is of type ID, but it would have made no difference if it was of type CDATA the link would still have worked. I always style the served XML with XSL (see your recent essay on CSS:-) and if you do that, the identifiers that matter to the fragment syntax are (whatever any spec says) the identifiers in the generated (X)HTML not the identfiers in the XML file. Being able to reliably refer to ids in the XML file would be useful for some machine to machine translation scenarios, and sematic web type statements but in the use case in your "story" linking into human oriented documents, they just don't seem to be relevant at all

That said the current storm over xml:id just seems really strange. Surely it is clear that the sets of names refered to by a namespace are all the same (names matching NCName) so all this talk of adding names to namespaces is bogus. It seems that teh people who wanted the original namespace spec to say that a namespace was defined by (and uniquely associated with) a schema will just never accept that it does not say that, and we have the 3-namespaces for html debate every other year.