Dealing with the things that you type that look mostly like URIs but aren't.
It's a curious thing about our industry: not only do we not learn from our mistakes, we also don't learn from our successes.
There are lots of places where we expect authors to type URI values. Left to their own devices, authors type these identifiers in a “human readable” form; that is, they may contain spaces, punctuation characters, non-ASCII text, etc.
Consider the current state of play in the XML specifications:
Although we think of, and casually describe, XML system identifiers as URIs (or, more accurately, IRIs), both XML 1.0 and XML 1.1 describe system identifiers as strings “meant to be converted to URI reference(s)”. Converted, in this case, meaning mostly percent-encoding various characters not allowed in URIs.
Historically, this was a necessary compromise with SGML where system identifiers are just strings that the, uhm, system can use to identify an entity. Given the intentionally open-ended definition of system identifiers in SGML, there were bound to be legacy identifiers that contained spaces and non-ASCII characters and all sorts of stuff.
It was also done in recognition of the fact that human authors often use invalid characters in identifiers. Consider the number of HTML documents that have spaces in
hrefattributes. Users are used to browsers doing the right thing and it was reasonable to make sure XML processors would do the same right thing.
(Those are just the specifications I could think of off the top of my head that make reference to this special processing for “human readable” resource identifiers; there may be others.)
Many of these documents were written before, or while, the IRI specification was being written. When it came time to consider, yet again, the same text in the context of XLink 1.1, after IRIs were defined, the fact that IRIs don't allow spaces meant we couldn't just excise it all, we would have to craft it again.
The fact that it's copied and referenced all over the place gave us pause. For one thing, it meant we had to be extra careful. For another, any sober reflection of the situation is bound to conclude that XLink is just the wrong place for this text.
Having specs totally unrelated to XLink pointing into it just for a standard description of how to deal with invalid, but entirely expected, characters in URI values doesn't make any sense.
What we decided to do instead was attempt to publish Human Readable Resource Identifiers (HRRIs) as an RFC.
The text is short and straightforward and will likely be of value outside of the XML context. And given that URIs and IRIs are defined by RFCs, that seems like the right place for this text.
The first Internet Draft of HRRIs has now been published.
Comments most welcome and appreciated, of course. The best place to send them is email@example.com.