XInclude, xml:base, and validation

Volume 8, Issue 47; 01 Apr 2005; last modified 08 Oct 2010

It turns out that there's a nasty interaction between XInclude, xml:base, and validation. Update: I was wrong. The interaction is real, but it didn't go unnoticed.

Testing can show the presence of errors, but not their absence.

—E. Dijkstra

I doubt this is news, it's drifted across my radar a couple times in the past month or two: it turns out there's a nasty interaction between XInclude, xml:base, and validation.

Consider the following documents:

Example 1. http://norman.walsh.name/2005/04/01/examples/enbook.xml


      <?xml version="1.0" encoding="UTF-8"?>
<book xml:lang="en" version="ipa" xmlns="http://docbook.org/docbook-ng">
<title>My Book</title>
;
</book>

Example 2. http://norman.walsh.name/2005/04/01/examples/chapters/chap01.xml


      <?xml version="1.0" encoding="UTF-8"?>
<chapter xml:lang="en" version="ipa" xmlns="http://docbook.org/docbook-ng">
<title>My Chapter</title>
<para>This is my chapter.</para>
<mediaobject>
<imageobject>
<imagedata fileref="picture.png"/>
</imageobject>
</mediaobject>
</chapter>

Is the book valid? Well, let's see. Entities are expanded by the parser, so that book is structurally equivalent to this document:


      <?xml version="1.0" encoding="UTF-8"?>
<book xml:lang="en" version="ipa" xmlns="http://docbook.org/docbook-ng">
<title>My Book</title>
<chapter version="ipa">
<title>My Chapter</title>
<para>This is my chapter.</para>
<mediaobject>
<imageobject>
<imagedata fileref="picture.png"/>
</imageobject>
</mediaobject>
</chapter>
</book>

And that document is valid DocBook NG “IPA”. So the answer is yes.

Next, if I tell you that the fileref attribute is resolved against the current base URI, can you tell me the URI of that graphic? Did you get http://norman.walsh.name/2005/04/01/examples/chapters/picture.png? I knew you could. The point is that expanding an entity preserves the base URI of that entity.

Now let's drag this document into the twenty-first century.

Example 3. http://norman.walsh.name/2005/04/01/examples/xibook.xml


      <?xml version="1.0" encoding="UTF-8"?>
<book xml:lang="en" version="ipa" xmlns="http://docbook.org/docbook-ng" xmlns:xi="http://www.w3.org/2001/XInclude">
<title>My Book</title>
<xi:include href="chapters/chap01.xml"/>
</book>

Is the book valid? No, because the DocBook NG schema doesn't allow XInclude elements. Oh, you meant after XInclude processing. (I'll resist a long rant about processing models by reference.) Well, let's see. After XInclude expansion, we'll get:


      <?xml version="1.0" encoding="UTF-8"?>
<book xml:lang="en" version="ipa" xmlns="http://docbook.org/docbook-ng" xmlns:xi="http://www.w3.org/2001/XInclude">
<title>My Book</title>
<chapter version="ipa" xml:base="chapters/chap01.xml">
<title>My Chapter</title>
<para>This is my chapter.</para>
<mediaobject>
<imageobject>
<imagedata fileref="picture.png"/>
</imageobject>
</mediaobject>
</chapter>
</book>

So the answer is, “it depends”. Specifically, it depends on whether or not DocBook NG “IPA” allows xml:base (you did notice the extra xml:base attribute, didn't you?) to appear on chapter. It does, so the answer is yes.

This time, I'm sure you can tell me that the URI of that graphic is http://norman.walsh.name/2005/04/01/examples/chapters/picture.png, because the base URI is explicit.

The problem is that lots and lots of schemas out there, maybe some that you're responsible for don't allow xml:base to appear anywhere. And XInclude is fundamentally incompatible with all those schemas in the presence of validation.

Ugh.

In the short term, I think there's only one answer: update your schemas to allow xml:base either (a) everywhere or (b) everywhere you want XInclude to be allowed. I urge you to put it everywhere as your users are likely to want to do things you never imagined.

Longer term, I've heard a couple of possible solutions. One is to change (W3C XML) schema validation so that attributes in the xml: namespace are silently allowed everywhere (just like attributes from the http://www.w3.org/2001/XMLSchema-instance namespace). That isn't a very attractive answer to me; I think it's a flaw in W3C XML Schema that I can't control where the schema instance attributes can occur. But it would allow you to use XInclude with all those schemas that you have no power to update.

Another possibility, though I haven't heard it suggested with any seriousness, would be to update XInclude so that it doesn't add xml:base attributes to the included document. The sad thing is, that attribute is a bit of a “belt and suspenders” approach to the base URI. The included document has a base URI property in the Infoset and that property has the correct value. We didn't have to add the attribute. Except maybe we did, because without the attribute, the correct base URI wouldn't survive serialization (as might occur, for example, if you packaged the document up and shipped it off to some web service in a SOAP envelope).

In any event, even if we could remove it, removing it would be backwards incompatible so it's awfully painful to do that. And it won't magically fix all the running code out there. And it will seriously inconvenience folks who need to ship documents around after XInclude expansion.

Updated, 04 Apr 2005.

I originally said, I think what pains me most about this situation is that XInclude was in development for just over five years. It went through eleven draftsEvolving at an average rate of just over 37 words a day, if you count all the words in all the public drafts. including three Candidate Recommendations.

I went on to ask why no one noticed until after XInclude became a Recommendation. And I was just wrong. The problem was noticed, and the decisions taken were deliberate. This is a good thing. It means the process didn't fail in a spectacular fashion. It's embarrassing to screw up in public, but the embarrassment at hand is insignificant (though much more acutely personal) compared to what I originally feared.

The informality of this medium allows me to write quickly, sometimes too quickly. I've said stupid things before, I'm bound to say them again, though this incident is likely to make me a little more careful.

In the original version of this essay, I went on to compare the length of XInclude (8,563 “words” according to my word counting script) with the length of the XSL/XML Query family of specifications (clocking in at 505,779, just over a half million). The point of my comparison is no longer relevant, but I'll leave the numbers.

P.S. I am still not kidding, though perhaps I wish I could say I had been.

Comments

I for one would certainly like that xml:base, xml:id and xml:lang would be allowed everywhere. I do think that it should be restricted to known 'xml:foo' attributes and shouldn't apply to all by default.

Is the fundemental problem here one of treating xml:base as an attribute at all?

Should it not have been decided to treated xml:base specially in the same way that namespace declarations are? That would have avoided the Infoset problem of possible contradiction between the xml:base attribute and the [base URI] as well as this validation problem.

Also, ...because without the attribute, the correct base URI wouldn't survive serialization.... Hmm, that's not really serialization if it's throwing away some of the data. This is the mirror image argument for the xml:base attribute to be special: it should be inserted when serializing in which case it shouldn't be present as an attribute in the Infoset.

I'm exploring the edge of what I understand here so I'd be pleased to see some discussion of this.

Are you sure nobody noticed this? I could swear I had users complaining to me about this before the spec went to REC, and I suspect the working group heard some of this too, though I'd have to search through the archives to make sure. OK. Found it. Peter McCracken raised this issue with the working group on June 12, 2003; and it was discussed. After this date, XInclude was sent back to working draft status for other reasons. So the problem is not that the working group didn't know about this issue. The problem (if any) is that the choice they made. If they had decided differently, they could have easily changed this.

Personally, I don't consider this all that big a deal. Schema-validity is vastly overrated. I routinely add markup to my documents that is not accounted for by the schemas, and as long as you don't blindly throw away all invalid documents, everything pretty much works. I think the working group's decision here is the right one. If you really need validity and XInclude, then you need to update your schemas/DTDs to support xml:base everywhere. It's probably a good idea to do that anyway.

I stand corrected. It wasn't unknown. I'll have to think some more about what I think this means. Bill de hÓra questions the process. I'm not sure at the moment if it's the process, or my memory (or something else) that I think is the problem.

I agree with Eliotte, it's in no ways new, and the issue had been raised and discussed. One of the change (from memory) that we did end up with is that the xml:base is added only if necessary, i.e. if addding an xml:base is needed to get URI-Reference expansion right within the included subtrees. In practice it means that if the included resource is in the same directory/folder/whatever than the resource including it, adding the xml:base is not necessary, and for example libxml2 will not add it. This is a publishing limitation but it allows to cope with the validation problem until the various DTDs/Schemas have been updated with the extra xml:base.

Daniel

There is a general process problem here that bedevils most XML standards efforts. You get a group of volunteers together, and then put together what you can with limited resources. The problem is that so much of what you do is theoretical, because no-one is testing along the way.

Where I think Java got it so *right* was that conformance was always part of the standard, so there was always a reference implementation and a set of conformance tests. I think that is where all XML standards have to get to, if they are going to avoid this type of problem as a general consequence.

There *are* XInclude implementations, of course, but I'm not sure they are so widely used. I suspect XQuery 1.0 and XSLT/XPath 2.0 are getting much more user testing before going to recommendation (thanks to Mike Kay and others). So I don't think you can apply the XInclude statistics directly to XQuery/XSLT/XPath.

However, it *would* help if the W3C & OASIS got out of the paper standard business, and moved more to being in the "standard + reference implementation + conformance" business. It's harder to get resources for, tougher to get to the finish of, but makes for *much* better results. C++ showed us how wrong things could get with a paper standard, Java showed us a much better process. XML shouldn't ignore that example.

Cheers, Tony.

Daniel's recollection is incorrect. This was discussed but the original decision stood. XInclude processors are required to preserve the full base URIs of the elements they include. This means that any content included from a different document is required to have an xml:base attribute. It is not enough that the relative URIs resolve correctly. According to the spec, the full base URI must be preserved; not merely the URI of the parent directory. libxml is nonconformant in this respect; though I'm not sure if anyone will notice or care since the practical impact of this nonconformance seems small.