Human Readable Resource Identifiers
Dealing with the things that you type that look mostly like URIs but aren't.
It's a curious thing about our industry: not only do we not learn from our mistakes, we also don't learn from our successes.
There are lots of places where we expect authors to type URI values. Left to their own devices, authors type these identifiers in a “human readable” form; that is, they may contain spaces, punctuation characters, non-ASCII text, etc.
Consider the current state of play in the XML specifications:
-
Although we think of, and casually describe, XML system identifiers as URIs (or, more accurately, IRIs), both XML 1.0 and XML 1.1 describe system identifiers as strings “meant to be converted to URI reference(s)”. Converted, in this case, meaning mostly percent-encoding various characters not allowed in URIs.
Historically, this was a necessary compromise with SGML where system identifiers are just strings that the, uhm, system can use to identify an entity. Given the intentionally open-ended definition of system identifiers in SGML, there were bound to be legacy identifiers that contained spaces and non-ASCII characters and all sorts of stuff.
It was also done in recognition of the fact that human authors often use invalid characters in identifiers. Consider the number of HTML documents that have spaces in
href
attributes. Users are used to browsers doing the right thing and it was reasonable to make sure XML processors would do the same right thing. -
XLink 1.0 goes to considerable trouble to define special processing for
xlink:href
attributes. In this case, the analagy withhref
attributes in HTML is perfect. -
XML Base copies the XLink text for encoding and escaping the
xml:base
attribute value. Again, for the same reasons. -
XML Schema Part 2, in discussion of the lexical space of
xsd:anyURI
values, appeals directly to the XLink 1.0 text. -
XInclude uses a reference to the XML 1.1 processing to accomplish the same task for its
href
attribute.
(Those are just the specifications I could think of off the top of my head that make reference to this special processing for “human readable” resource identifiers; there may be others.)
Many of these documents were written before, or while, the IRI specification was being written. When it came time to consider, yet again, the same text in the context of XLink 1.1, after IRIs were defined, the fact that IRIs don't allow spaces meant we couldn't just excise it all, we would have to craft it again.
The fact that it's copied and referenced all over the place gave us pause. For one thing, it meant we had to be extra careful. For another, any sober reflection of the situation is bound to conclude that XLink is just the wrong place for this text.
Having specs totally unrelated to XLink pointing into it just for a standard description of how to deal with invalid, but entirely expected, characters in URI values doesn't make any sense.
What we decided to do instead was attempt to publish Human Readable Resource Identifiers (HRRIs) as an RFC.
The text is short and straightforward and will likely be of value outside of the XML context. And given that URIs and IRIs are defined by RFCs, that seems like the right place for this text.
The first Internet Draft of HRRIs has now been published.
Comments most welcome and appreciated, of course. The best place to send them is www-xml-linking-comments@w3.org.
Comments
Oh no, not another *R? syntax! Seriously, it's a great pity that these, worthwhile, extensions didn't get included in IRIs. I hope everybody has a really good think about whether there's anything else that needs to go in before this becomes an RFC.
Yes! In fact we have recently implemented a similar set of rules in our XForms engine. Now we would have a "spec" to go by. The use case has to do with the XForms submission construct which can be used to send an HTTP request. This can be used to send some XQuery to eXist on the URI. The URI is provided in the action attribute of xforms:submission and it be convenient to write something like:
action="/exist/rest/db/mycollection?_query=element count { count(/*) }"
This would not be a valid URI, but it is a valid HRRI which can be converted to a URI following the set of rules you and Richard proposed. Is this in line with the use cases you have in mind?
Alex