I was chatting with some folks at
XML Summer School
this year when the topic turned to XInclude. I described how
fragment identifiers can be used to extract portions of a
remarked that it sounded useful, but that it would be even more useful
if you could search for lines to include.
That did sound useful! And so on the long flight home, I took a stab at designing and implementing a fragment identifier for that purpose. Here’s what I have so far (in a lazy, pseudo EBNF):
search = "search=" startSearch? ("," endSearch?)? (";" searchOpt?)? startSearch = searchExpr (";" startOpt?)? endSearch = searchExpr (";" endOpt?)? searchExpr = ([0-9]+)? (.) (.*?) \2 startOpt = "from" | "after" | "trim" endOpt = "to" | "before" | "trim" searchOpt = "strip" | RFC 5147 integrity checks
The core of the syntax is the
searchExpr. A search
expression is an optional number, followed by any quote character, followed
by a string delimited by a second occurrence of the quote character.
The number allows you to find a specific occurrence of the string.
The expression “
3/abcde/” finds the third line that contains
the string “abcde”. So do “
3xabcdex”. If you leave the occurrence number out, it
defaults to 1: “
/marker text/” finds the first line that contains
the string “marker text”.
If you don’t specify a start expression, inclusion starts at the beginning of the file. If you don’t specify an end expression, all of the file after the starting match is included. It’s an error if the starting expression is specified and it never matches.
After that, it’s just a matter of a few useful options. On search expressions,
from” and “
to“, the default values, specify that the
matched line is included. The values
after” and “
before“, specify that the matched line
is not included. The value “
trim” specifies not only that the
matched line is not included, but that any leading (in the case of start) or
trailing (in the case of end) lines that consist entirely of whitespace
are trimmed away.
The top level search option “
strip” specifies that whitespace
stripping should be performed on the start of each included line. The smallest
indent value is determined and that number of whitespace characters is removed
from the beginning of each line. The other top level search options are the
RFC 5147 integrity check options.
That’s it. I implemented it in the XInclude step in
XML Calabash (version 1.1.12). To use it, specify
I most often XInclude text files in order to incorporate code snippets into documentation. With this fragment identifier scheme, I can include ranges based on parts of the code, or if it’s more convenient, insert markers into the code in comments. The trimming and stripping options avoid spurious whitespace caused by the markers or the way the code is indented.
What about regular expressions? Yes, the search expression could be a regular expression. But that would be more complicated, more difficult to read, and it doesn’t seem like it would provide much additional value. This isn’t about pointing into utterly random, rapidly fluctuating texts. It’s about pulling lines you’ve chosen from a file you know that probably doesn’t change that dramatically or that frequently.
In any event, comments and suggestions most humbly solicited.
I did all of this on a plane with no internet. Maybe it’s all been done before. If this is a wheel reinvented, I’ll just scrap it and implement the other wheel. But it kept me awake for a few hours, so that’s a win.
Should this really be a fragment identifier?
Fifteen years ago, I thought I understood how the open web was going to be built. It was going to follow the “small pieces, loosely connected” model. I spent a lot of time thinking about (and discussing) where various functionalities should be specified. We had URIs to identify resources, HTTP to interact with them, and media types (with fragment identifiers) to describe representations of those resources.
This was very flexible, but not always completely straightforward. It isn’t a perfect abstraction and sometimes simple pragmatism and the quest for a kind of platonic ideal didn’t align very well. But that web was about sharing information, publishing resources, in a distributed, declarative way across a global network. It was important that different communities could invent new formats, which would need new identifier schemes, and that those formats could be used in composition with other tools.
I suppose if textual inclusion was something that you were going to define for the web today, you’d do it as a web component:
<x-text-filter start="start here" start-count="2” end="ending" trim-end strip>http://document/uri.txt</x-text-filter>
If we assume, perhaps quite reasonably, that no one else is ever
going to implement this fragment identifier scheme, we could argue
that I shouldn’t bother with one. I could just as easily (more easily, even, to the extent
that I wouldn’t have to parse a microsyntax in an attribute value)
define my own
norm:text-filter element and do that.
But for the moment, I’ve done it as a fragment identifier.