Man is not logical and his intellectual history is a record of mental reserves and compromises. He hangs on to what he can in his old beliefs even when he is compelled to surrender their logical basis.
Before the web, there was SGML. SGML identifies external subsets, external parsed and unparsed entities, notations, and perhaps a few other things I’ve forgotten about, with external identifiers. External identifiers have two parts: a public identifier and a system identifier. The public identifier is “a name” and the system identifier is “a location”.
Historically, system identifiers weren’t URIs and what was a reasonable identifier in one system might have been unintelligible in another. Public identifiers provided a hook for interoperability. Both systems could find the external identifier associated with this document type declaration:
<!DOCTYPE book PUBLIC "-//Owner//DTD Name//EN" "c:\:/name.dtd">
because they had the name if they didn’t understand the location. In fact, in SGML, the system identifier was entirely optional:
<!DOCTYPE book PUBLIC "-//Owner//DTD Name//EN">
because implementations made use of the fact that they could map from the name, the public identifier, to the appropriate local representation.
External identifiers survived into XML 1.0. In order to conform to the evolving architecture of the web, system identifiers were made required in XML.
Over the course of more than 10 years working with SGML and XML documents, the presence of names in external identifiers has saved many hours, perhaps many hundreds of hours, of my time. I consider that positive value.
As XML developed, I tried, unsuccessfully, to extend the notion of names and identifiers into the new technologies that were developing (stylesheets, schemas, etc.). With Paul Grosso and John Cowan, I wrote RFC 3151, A URN Namespace for Public Identifiers, in order to preserve public identifiers in a URI-only world.
I’ve argued my case in many forums. Most recently, this came up in a thread on the Atom mailing list. I have always been in the minority, though I have sometimes been encouraged by like-minded colleagues.
That document says, in part:
So: I’ve got a new resource that I want to identify. Given my public committment to the WebArch document, I feel that I ought not to violate its tenets. That means I want to use a URI, I want to provide a representation, I don’t want to create multiple URIs, and I don’t want to use a new scheme.
The WebArch document expresses an explicit bias towards HTTP. There’s a whole set of infrastructure built around HTTP that makes it a pretty compelling protocol if you’re going to serve up a representation.
That means I’m going to identify my document with an HTTP URI and only an HTTP URI. That URI becomes both its name and its address, if you like (or even if you don’t).
All Is (Not Quite) Lost
I’ve lost my names. Presented with a document, I will be forced to figure out what representation to use to process it based only on its single URI.
Remember my document interchange scenario? That’s where folks send me documents to process. It still happens, so what do I do with this document:
On the web, maybe that’s easy, I just go off and get the resource. At this point, the infrastructure that I mentioned earlier comes into play. Perhaps some intermediate cache will return the representation, perhaps the server will tell us the document has moved and another get will be issued, etc. But what if I’m not connected?
I get some significant relief from XML Catalogs, developed by the Entity Resolution Technical Committee at OASIS. XML Catalogs provide for XML what SOCATs provide for SGML. In particular, they allow me to map external identifiers and URIs to local representations. So I can use this entry to map the URI:
<uri name="http://example.org/path/to/book.xsd" uri="/my/local/path/to/book.xsd"/>
Alas, it’s not a total win. What about documents like this:
If I don’t have
book.xsd in the same relative location
as the sender, I lose. And in this case:
I just lose outright, although in this case I could argue that the author is at fault: he’s given a different URI to the same resource, bifurcating the web. But if caches or resolvers of some sort aren’t widely deployed, authors will do this, because they don’t have a practical alternative, and I lose.
I live on Planet Web too. I pour a fair amount of my intellectual effort into understanding and expanding that planet (even if that metaphor doesn’t scan very well). I don’t have to like all of the consequences of choosing to live on that planet, but having made that choice, it makes little sense to carp about its basic principles.
I hearby abandon argument about the useful distinction between names and addresses. Do what WebArch says. Give resources one URI. Provide representations for your resources. Choose a URI scheme that has useful retreival semantics. That probably means HTTP. To the extent that the consequences of doing what WebArch says are painful, let’s work on fixing the pain.