text/plain, RFC 5147, XInclude, and XML Calabash

Volume 15, Issue 16; 21 Apr 2012

Useful non-conformance: on the intersection of text/plain documents, RFC 5147, XInclude, and XML Calabash.

I believe I implemented this back in October, but appear never to have written about it. It came in handy (yet again) while writing my Balisage paper, so here it is.

In technical documentation, you're often writing about plain text artifacts that have significant semantics: program listings, shell scripts, configuration files, etc.; in the XML world: stylesheets, schemas, etc. Cutting and pasting these files into your documentation is always a mistake: if you're documenting a real system, these files always change over time and keeping them up-to-date that way is a nightmare. Also, embedding them often makes it very difficult to check for errors.

Incorporating whole files as plain text is straightforward:

<xi:include href="examples/schema.rnc" parse="text"/>

If your authors can be persuaded to write some fairly awkward prose, (“See lines 24-35 in Listing 3.12 on page 92.”) this is almost enough. But note that it's still a nightmare to maintain because lines 24-35 might be different after some bug is fixed.

What you really want is the ability to incorporate small sections of larger files in individual examples. (“The foobar function from interesting.cpp is shown in Listing 4.9, it's salient features are...”)

If your build environment is managed by developers willing to work a little, it might be possible to set up the system so that the real files can be automatically broken into chunks that authors can incorporate with XInclude. But it's a lot of work and there are lots of folks working without such generous developers.

What you really want is a way for authors to point to a file and extract from it just the relevant lines, without involving already overworked developers. In an XML context, you want a fragment identifier for text/plain documents.

Enter RFC 5147. RFC 5147 provides just that, a fragment identifier scheme for plain text documents. It allows you to refer to both ranges of characters within a plain text document and ranges of lines. For extra bonus points, it even supports integrity check mechanisms for identifying cases when the file may have changed, making the fragment identifiers unreliable.

Great! So I can use it in XInclude today to...why are you shaking your head?

Here we run afoul of standardization. Folks, like myself, who write standards, struggle often with interoperability and the future. Interoperability, the point of standardization, encourages us to tighten every screw we can, to nail down every edge case we can think of so that different implementations will interpret the specification in the same, interoperable way. And yet, we never know the future. We can but gaze at cloudy crystal balls in an attempt to imagine how needs might change.

How does this relate to the problem at hand? Well, for (good, valid) reasons I won't go into now, XInclude separated the fragment identifier out of the URI into a separate attribute called xpointer. For good, valid reasons, XPointers are explicitly about addressing into XML, and, for good, valid reasons, XInclude says the xpointer attribute is only allowed when parsing XML.

All of which means we have no way of using the RFC 5147 fragment identifier scheme in XInclude.

Except, [expletive deleted, -ed] that!

In XML Calabash, I decided to allow the xpointer attribute to use RFC 5147 fragment identifiers when parse="text". That means in, for example, my Balisage paper, I can write things like this:

<xi:include href="examples/schema.rnc" parse="text" xpointer="text(line=20,44;length=1032)"/>

To extract portions of a text file. XML Calabash only supports the “length” integrity constraint, but that works fine. If I edit examples/schema.rnc then the length is likely to change and this XPointer will fail. (When this XPointer fails, my pipeline will fail, and I'll know that there's something I have to address: I won't silently get the wrong text on lines 20-44.)

As a nod to standards compliance (I do write standards after all), you must enable the xpointer-on-text extension for this to work. This extension makes XML Calabash non-conformant (in precisely the way you want it to be non-conformant).

As a further nod to standards compliance (I do write standards after all!), I've persuaded the XML Core Working Group at the W3C to include this issue in the XInclude 1.1 Requirements and Use Cases. Your feedback is encouraged.

Share and enjoy!

Comments

Thanks Norm. Makes lots of sense. Just a check. 'length=1032' is the count, in characters of the selected content? What of line ending buggeration? Same issue as you mention? A hint to go check it

My add-on would be 'put a marker on line 23' for the 'host' document to refer to that content. So that I can discuss the ... second for loop, not the first?

No, length is the length of the whole file, in bytes.

The other integrity check described by RFC 5147 is an MD5 checksum. But that's harder to compute and might require reading the file twice, so I don't (yet) support that one. There's some discussion of line endings in 5147, but I haven't really gone out of my way to investigate that. Bug reports welcome. I just assume that Java will read the lines correctly.

i think that makes a lot of sense, and it would be great to see XInclude changed to support text fragment identifiers. the question is whether an attribute called "xpointer" is the best place to put non-xpointer fragment identifiers. something like "fragment" or "fragid" might me more appropriate (but would introduce a breaking change). just in case you have it available, could you please point to the discussion about why to separate the fragment identifier in the first place?

I have abused the xinclude and xpointer even further to bring in data from other sources on the fly :

<xi:include href="contact.py" xpointer="addresses.vcf?surname:Smith"
            parse="text" accept-language="en"/>

This is only used by my tools, there might be more flexible alternatives.

If contact.py is executable it will be executed with two arguments, lang=en and xptr=addresses.vcf?surname:Smith. This in turn will generate a valid docbook representation of the tag person in the correct language. In this case it will parse the VCARD file and search for surname:Smith. Once it is found take as much data as possible into the docbook representation. It will replace the include in the original document. If it does not return valid docbook, it currently dies in an exception. Once the docbook is validated, the stylesheets will be applied.

I have used it for other simple things that generate quite much docbook output, for example simple tables.

I don't see how this gets around this limitation: "[...]it's still a nightmare to maintain because lines 24-35 might be different after some bug is fixed."

Doesn't that just mean you will have to update the XInclude reference rather than updating the text content ("lines 24-35")? I don't see how that could be automated, so perhaps I'm missing something.