Searching text/plain documents with a fragid

Volume 19, Issue 19; 30 Sep 2016

It’s very handy to be able to include portions of a text/plain resource with an RFC 5147 fragment identifier, but sometimes it would be even more handy if the start and end could be located dynamically.

I was chatting with some folks at XML Summer School this year when the topic turned to XInclude. I described how RFC 5147 fragment identifiers can be used to extract portions of a text/plain document. Nic remarked that it sounded useful, but that it would be even more useful if you could search for lines to include.

That did sound useful! And so on the long flight home, I took a stab at designing and implementing a fragment identifier for that purpose. Here’s what I have so far (in a lazy, pseudo EBNF):

search      = "search=" startSearch? ("," endSearch?)? (";" searchOpt?)?
startSearch = searchExpr (";" startOpt?)?
endSearch   = searchExpr (";" endOpt?)?
searchExpr  = ([0-9]+)? (.) (.*?) \2
startOpt    = "from" | "after" | "trim"
endOpt      = "to" | "before" | "trim"
searchOpt   = "strip" | RFC 5147 integrity checks

The core of the syntax is the searchExpr. A search expression is an optional number, followed by any quote character, followed by a string delimited by a second occurrence of the quote character. The number allows you to find a specific occurrence of the string.

The expression “3/abcde/” finds the third line that contains the string “abcde”. So do “3#abcde#” and “3xabcdex”. If you leave the occurrence number out, it defaults to 1: “/marker text/” finds the first line that contains the string “marker text”.

If you don’t specify a start expression, inclusion starts at the beginning of the file. If you don’t specify an end expression, all of the file after the starting match is included. It’s an error if the starting expression is specified and it never matches.

After that, it’s just a matter of a few useful options. On search expressions, “from” and “to“, the default values, specify that the matched line is included. The values “after” and “before“, specify that the matched line is not included. The value “trim” specifies not only that the matched line is not included, but that any leading (in the case of start) or trailing (in the case of end) lines that consist entirely of whitespace are trimmed away.

The top level search option “strip” specifies that whitespace stripping should be performed on the start of each included line. The smallest indent value is determined and that number of whitespace characters is removed from the beginning of each line. The other top level search options are the RFC 5147 integrity check options.

That’s it. I implemented it in the XInclude step in XML Calabash (version 1.1.12). To use it, specify parse="text" and fragid="search=…".

I most often XInclude text files in order to incorporate code snippets into documentation. With this fragment identifier scheme, I can include ranges based on parts of the code, or if it’s more convenient, insert markers into the code in comments. The trimming and stripping options avoid spurious whitespace caused by the markers or the way the code is indented.

What about regular expressions? Yes, the search expression could be a regular expression. But that would be more complicated, more difficult to read, and it doesn’t seem like it would provide much additional value. This isn’t about pointing into utterly random, rapidly fluctuating texts. It’s about pulling lines you’ve chosen from a file you know that probably doesn’t change that dramatically or that frequently.

In any event, comments and suggestions most humbly solicited.

Caveat

I did all of this on a plane with no internet. Maybe it’s all been done before. If this is a wheel reinvented, I’ll just scrap it and implement the other wheel. But it kept me awake for a few hours, so that’s a win.

Should this really be a fragment identifier?

Fifteen years ago, I thought I understood how the open web was going to be built. It was going to follow the “small pieces, loosely connected” model. I spent a lot of time thinking about (and discussing) where various functionalities should be specified. We had URIs to identify resources, HTTP to interact with them, and media types (with fragment identifiers) to describe representations of those resources.

This was very flexible, but not always completely straightforward. It isn’t a perfect abstraction and sometimes simple pragmatism and the quest for a kind of platonic ideal didn’t align very well. But that web was about sharing information, publishing resources, in a distributed, declarative way across a global network. It was important that different communities could invent new formats, which would need new identifier schemes, and that those formats could be used in composition with other tools.

Today’s web is about building JavaScript applications that run in a web browser. It doesn’t need any of those things. The only important formats are HTML and JSONYes, yes, and image formats and video formats and CSS and a few other things, but by and large those stand off on the side.. There are some vistigal uses of fragment identifiers and media types, but not because the modern web would have chosen to do things that way. There’s no particular interest in supporting new formats in a declarative way.

I suppose if textual inclusion was something that you were going to define for the web today, you’d do it as a web component:

<x-text-filter start="start here" start-count="2”
               end="ending" trim-end strip>http://document/uri.txt</x-text-filter>

(I assume this is the sort of thing you can do with a web component.) The component would include a JavaScript implementation that you’d have to run to understand what it does.

If we assume, perhaps quite reasonably, that no one else is ever going to implement this fragment identifier scheme, we could argue that I shouldn’t bother with one. I could just as easily (more easily, even, to the extent that I wouldn’t have to parse a microsyntax in an attribute value) define my own norm:text-filter element and do that.

But for the moment, I’ve done it as a fragment identifier.

Comments

Use of 'lines' makes me twitchy Norm? Lines when? When laid out in my browser? In the source XML? in the presented html at the server? All a bit fluffy?

This is a scheme for dealing with text/plain files; we're talking about lines (delimited by newlines; Mac, Unix, or PC style) as returned in the text/plain document. Nothing to do with what's displayed in the browser.

See also https://tools.ietf.org/html/rfc5147#section-1.1

https://github.com/dret/I-D/blob/master/Published/text-fragment/ has the history of all versions, and as you can see, we even had regexes somewhere along the way. those of course had the big problem of picking the right regex dialect to use. in the end (as often) simplicity won and RFC 5147 ended up being just positions and ranges. if at any point there is enough momentum to extend RFC 5147, then nothing would keep such a revision to include more powerful fragment identifiers, such as the ones you have come up with.

in terms of fragment identification vs. structured syntax for something like fragment-based inclusion: the difference is that if i have fragment identifiers, i can pass around URI references saying "there's a mistake on this line: ...#line=42", and that will be self-describing because of the well-defined fragment identifier semantics. no such self-describing reference is possible if the semantics are bound to a processing model in some document type.