On identifiers

Volume 9, Issue 84; 05 Sep 2006; last modified 08 Oct 2010

More thoughts on identifiers. Names, that is, and addresses, of course.

[This essay is effectively part of a conversation. You're more likely to find it understandable if you've read my earlier piece, Names and addresses, and Stuart Weibel’s follow-up essays, starting with On Identifiers, Scholarship, and Spitoons. —ed]

Stuart Weibel responded at length to my latest posting about names and addresses. Happily, we seem to be largely in agreement.

The most significant point of disagreement, I think, is over the value of what Stuart calls “pure identifiers”. That is, identifiers that are intentionally and explicitly decoupled from any resolution mechanism.

Part of this disagreement clearly stems from different understandings about what it means to use an http URI. Using an http URI does not require deployment of a web server or obligate the user to provide representations for the identifiers created. It's definitely desirable and useful to do so, but it isn't required. That's simply a fact.

Arguments against http URIs based on the cost or inconvenience of maintaining web infrastructure to support access to those URIs don't hold water. I accept that there are some issues of user expectation here, but I don't find those issues sufficient to warrant the invention or use of “pure identifiers”.

In particular, I observe that not deploying a web server for your http URIs today doesn't preclude you from doing so tomorrow. So an organization might decide that some of its identifiers had gained such broad use that there was value in supporting access to them.

That general observation aside, Stuart makes some specific arguments in favor of pure identifiers:

His first argument is that they are valuable for things that are conceptual, in particular, to identify things that are language independent and have different meanings in different cultural contexts. He gives, as an example, two books with the same Dewey Decimal number, 959.7043:

Vietnamese War, 1961-1975 DDC/22/eng//959.7043
(English language version of DDC 22)

American War, 1961-1975 DDC/22/vie//959.7043
(Vietnamese language version of DDC 22)

and observes that “the distance between the abstractions rendered in two languages is greater than mere translation.”

Unfortunately, I think his further argument that “what this identifier should resolve to is complicated and context dependent”, arises because he's changed the question in mid-stream. “Which book does the user want?” isn't the right question. The right question is, what does “959.7043” identify?

As near as I can tell, the answer to that question is entirely independent of language and cultural context: the Dewey Decimal number 959.7043 identifies the concept of the Vietnam War. For that concept, I assert that http://xmlns.com/wordnet/1.6/Vietnam_war is an equally good identifier. Superior, in fact, because if I do a GET on that resource I find that it's about “a prolonged war (1954-1975) between the communist armies of North Vietnam who were supported by the Chinese and the non-communist armies of South Vietnam who were supported by the United States” whereas I had to rely on Google and heuristics to determine that that's what 959.7043 is most likely about.

Stuart next argues for pure identifiers for legacy assets. He gives ISBN numbers as an example, but I think the isbn: scheme is an historical accident as much as anything else. Given a legacy identifier, “12345”, there are a few of ways to imagine making a URI out of it. One is “legacy-identifier-scheme:12345”. Another is “http://legacy-identifier.org/12345”. I don't see any advantage of the former over the latter.

(With respect to ISBN numbers in particular, I observe that they identify a non-intuitive resource. That resource is the set of all books that have ever had that ISBN number. At least, that's the only interpretation of an ISBN number that makes sense to me.)

Finally, Stuart argues that there are business cases for late-resolution-binding of identifier resolution. Perhaps. Unfortunately, I don't think I really understand the examples given.

One thread seems to to be about access control, the argument apparently being that newscheme:something is better because the resolution mechanism for that URI can manage whether or not the user has authority to access that resource. But authority is an entirely orthogonal issue. There are several ways to limit access to http://example.com/something.

There may be circumstances under which there are compelling reasons not to use http URIs, but no such circumstances have yet been convincingly articulated to me.

Comments

OK, Norm, how would you deal with this practical example:

How would you identify a traditional productivity document (text, presentation, etc.)?

So document can move, and you cannot rely on any stable location.

I was thinking this might be a good case for something like a urn uuid, at least as a fallback.

I guess you would say, though, that even if I don't put the document on the web at a stable location, I can still give it a stable HTTP URI; like, I dunno, "http://purl.org/net/me/documents/some_presentation"?

I've not at all arrived at a confident position on all this, so welcome the conversation you kicked off.

Norm, I agree with the arguments you put forth, but let me put forth a few considerations that I haven't seen addressed that could support a new scheme. Personally I don't think they amount to much in comparison with the arguments for http URIs, but I wanted to bring them up.

1) Bruce's post. I'm struggling with a good way to organize my personal photo collection

2) A new scheme could define semantic information in the identifier. Or is the goal of an identifier only to be unique? For example take a supposed dd: scheme for dewey decimal numbers. Compare dd:DDC/22/eng//959.7043 to http://dd.org/DDC/22/eng//959.7043. When we see the dd: we know that the rest of the parameters each have specific meaning, whereas the http: example doesn't.

2b) Is there a standards body to which organizations could standardize the meaning of a http: URI? Or a schema with which anyone could define the semantics of their http: URIs? If so, 2) could be resolved by writing a schema/standard to specify what the semantic meaning of a http://dd.org/ URI is.

3) Closely related to 2: the handling of schemes. Say you have an application that takes URIs as input and does something with them. With different schemes you can have one handler per scheme. What would the best way be to handle that scenario with http: scheme URIs? Simply prefix-matching the URI?

I agree with the basic argument of these essays, http URI make good names. But I take issue with some of the assumptions you make about names. For example, that good URI are, or should be, if used properly, unambiguous. I believe that I have proved, in my post Problems Identifying Information that, as names, some ambiguous URI are powerful and useful. Furthermore, that post shows that the W3C itself, in spite of what is stated in it's Architecture document, utilizes the power of ambiguous URI in its basic operation, that of creating and promoting technology standards.

Also, in my post Anatomy of a Reference, I question the basic idea that just "having a URI" for something is sufficient to identify that something. I believe it takes more, as I suggest in that post.

Finally, it is often assumed that the uniqueness of an HTTP URI somehow prevents ambiguity. But this is not case as I show in Ambiguity and Identity.

My point in all this is to say, yes, HTTP URI make good names, but names do not function the way the W3C says URI should.

Norm, you showed two ways to express a legacy identifier, 12345, in a URI:

1) legacy-identifier-scheme:12345

2) http://legacy-identifier.org/12345

I see an advantage to the first. An application (not just a human at a browser) can have a much easier time of knowing how to process the first. Syntax, constraints, matching rules, normalization, ordering logic (for concepts like versions), and special features such as built in check digits or certain crypto properties can be explicitly conveyed to an application using the first approach.

With the second approach there's no governing scheme. I suppose there's a governing authority (legacy-identifier.org), but there's no defined means for an application (or application developer) to learn the syntax, constraints, rules, etc. of identifiers minted under legacy-identifier.org. If an application encounters "http://w3.org/12345" how is it supposed to know the constraints, rules, special features imposed by w3.org? Or if it's just a regular URI with no additional constraints beyond vanilla URI? I don't think tribal knowledge is sufficient. I also assert that the guess-and-infer approaches used by humans at browsers are inadequate for programmatic processing of identifiers.

What would you think about a new top level DNS domain (like maybe ".spec"), under which all the sub-domains have specifications describing the characteristics of the URIs? So instead of "http://xri.net/=marty" we could have "http://xri.spec/=marty". Then an application could recognize that this is not just a generic URI, but that there is a specification for how to deal with it.

It would then be logical to define the common means to learn the characteristics of such identifiers; e.g., issue a GET to "http://legacy-identifier.spec/" to retrieve the legacy-identifier's specification. Also, wouldn't it be great if that specification could be programmatically meaningful? Then applications could dynamically learn to deal with new types of identifiers they hadn't previously known about.

Http: URIs currently rely on guess-and-infer efforts to try to figure out what they mean, and I consider that a compelling reason for non-http URIs. It's the only compelling reason I know of. If an approach like a ".spec" top level domain (or some other approach) can provide concise programmatic interpretations for http: URIs, then I think I'd hop to the "http: only" side of the fence.

Other discussions point out benefits of "http: only", but only when coupled with something like a ".spec" TLD do I think they outweigh the advantages of separate schemes.

"How would you identify a traditional productivity document (text, presentation, etc.)?

So document can move, and you cannot rely on any stable location."

Well when you say document can move and you cannot rely on any stable location this implies that you want to retrieve this document.

The solution to retrieving a document that does not have a stable location does not seem to be solved by having identifiers without any resolution mechanism since if you cannot resolve an identifier you certainly cannot retrieve what it is identifying.

Instead what you want to do is to have a resolution mechanism that allows one to return information about the status of identifier, such as http does, and to have resolution loosely coupled from the actual location of things which one does by having a server with access to resources that keeps track of what resource is identified by what identifier.

This allows other servers to maintain archives of identifiers over time etc. all the services built on top of http....

Bryan -- yeah, that's what I was thinking. In that case, then, using a uuid encoded as a urn would be perfectly reasonable; right? Same for the photograph case.

I was thinking about this recently as part of the OpenDocument metadata discussions. What if all your documents had URIs, and applications even automatically added smart relational metadata (a isVersionOf b, x isPartOf y, and so forth)?

So my initial thought was to just say all documents ought to as a baseline get a uuid. That doesn't mean they couldn't instead get an http uri; just that you don't want to force users to have to manage this.

Seems like you've got your work cut out for you, Norm. That'll teach you to keep opening this can of worms 8-)

Snicker Thanks, Mark. It's going to take me a few days, at least, to get back to this.

With respect to the question of metadata for office documents and photographs, I think the first step is to put the metadata in the documents themselves. That solves the question of what URI you need in that metadata because a local document reference always works.

If you really need to have that metadata externally, then I think the simplest thing to do is use the file: URI of the document. If your document infrastructure is capable of scanning a repository and finding documents based on their internal identifiers, then I suppose a UUID urn: is a good choice. But I expect it's a whole lot more work than just using a file: URI and it's only marginally more likely to be more persistent.

It's true, if I move or rename the file, you might be able to find it again based on its internal UUID urn. But if I copy the file, you'll find both and that's a problem. And if I make the copy by cutting-and-pasting (then deleting the original), you'll lose it.

When it comes time to publish a document, I give it an http URI, but I don't consider that published document to be "the same" as the original (that I might continue to edit anyway) so I don't mind that it gets a different URI.

While you can interpret an identifier as you see fit, why not refer to the social contract that maintains it? ISBNs are a lot less ambigous than URLs ...

What is the purpose of an ISBN? "The purpose of the ISBN is to establish and identify one title or edition of a title from one specific publisher and is unique to that edition, allowing for more efficient marketing of products by booksellers, libraries, universities, wholesalers and distributors."

I.e., an ISBN is a SKU used in the booktrade. Different editions of the same title get different ISBNs. In FRBR terms, the set of books with the same ISBN are 'items' of the same 'manifestation'. The ISBN identifies a 'manifestation' (of an 'expression' of a 'work').

Wrt. using URLs as identifiers in place of legacy identifiers, I can't GET (pun intended) why I should want to introduce ambiguity where none existed, while at the same time being subjected to bizarre syntax rules and fragment identifier hacks.

BTW, the resolution of httpRange-14 is still attracting flack.

Kind regards

ISBNs are essentially useless as identifiers because publishers reuse them for different books.

"It's true, if I move or rename the file, you might be able to find it again based on its internal UUID urn. But if I copy the file, you'll find both and that's a problem."

Why? If my goal is to find the document, I'm quite happy to find both copies.

Some publishers assign a new ISBN for a reprint (of the same edition), and some publishers re-use an ISBN when the book have been out of print for a year or so. They break the rules.

This does not change the intension of the ISBN system (identifying book editions) as opposed to the extension (some publishers break the rules of the ISBN system).

The ambiguity is introduced by someone breaking the social contract that maintains ISBN as a system of identiers; not by ISBN per se.

No system of identifiers can avoid this problem. You can, of course, introduce control mechanisms that punish bad behaviour and encourage good behaviour. Some systems of networked identifiers, e.g. CrossRef (based on DOI), go to great lenghts to maintain the quality of the identifiers and the associated bibliographic information.

If you are in the business of identifying book editions, you will always have to record a number of properties in order to identify this thingie called a book edition. And some of the properties that you will want to record are related to the process of record-keeping itself, asserting that someone at a certain date belived an ISBN (+ some other properties) to identify a certain book edition.

There is, BTW, an URN namespace for ISBN, described in this memo.

The example about documents and versions and uris resonates ... I'm most interested with this in the context of distributed data objects, some of which are formally published and some are not, but it's a world that expects duplicates, and would like to be able to tell when they are duplicates.

If I have an identifier (say X) within a document, I want it to change every time I save that document. That way, I might be able to believe if I find two copies, I'll know if they are the same.

I then want my protocol://location/id=X (say), to imply that I am pointing at the same thing as someOtherProtocol://someOtherLocation/id=X

To do that, surely I need some way having faith that X is meaningful? That means (1) I need someone to allocate me X to avoid my X being used by someone else, and (2) I need to have faith in the application that manipulates my documents ensuring when the document is saved differently it gives me a different X

(NB: an acceptable error would be for me to have a different X even if the doc didn't change, but the reverse would not be. Also, just to avoid arguments others have raised, I need only get an X once for each document, as long as it's used as a root identifier ...)

Norm: doesn't a URL only do part of this?

Bryan

Hi All,

Several of you seem to be pointing out difficulties with binding identifiers like ISBN or UUID to the intended resource, and maintaining that binding. For digital resources, what would you think about using a hash of a resource as its identifier? The hex representation of a 128 bit hash would be just 32 characters in length - which I think is shorter than a UUID.

If hashes are identifiers, then if you send me a file, but forget to let me know its identifier, I can just rehash the file to figure out what its identifier is. If someone changes the file, it automatically gets a new identifier; the original identifier always identifies the original file, or an unaltered copy of the original file.

You are all having so much fun discussing ISBNs and UUIDs that nobody is responding to my comments/questions from back on Sept 6: http://norman.walsh.name/2006/09/05/identifiers#comment0004. I'm wondering if you think there are compelling reasons for URIs beyond the http: scheme, and if the reasons I suggested are convincingly articulated.

Marty -- it sounds good to me, but I wasn't feeling confident enough to say much about it.

The business of versioning and identification -- and then relationships based on those -- is a pretty tricky one as I think about it more. I mean, if every time I save a document -- not "save as" with a new file name -- it gets a new URI ID, then it becomes more awkward to express relations among those documents; doesn't it?

Say I have chapters as separate documents assembled into a book (think "master documents" in word processors)? I'd want my software to simply know that those chapters are a part of the (conceptual I guess) book, rather than for it to get tripped up because I happened to have saved the main document, thus assigning it a new URI.

Easier to handle, it seems to me, if you think about saving out new files as assigning new versions/identifiers. For example, I have a RAW iamge file, edit it, and then save it out as a JPEG. The two files have different IDs, but the second can include a dcterms:isVersionOf property pointing to the RAW image.

Worth nothing that Adobe side-steps these issues by giving their XMP embedded (RDF) metadata empty rdf:about attributes. But I can't help thinking that this might be also missing out on a lot of possibilities.

Norm: "It's true, if I move or rename the file, you might be able to find it again based on its internal UUID urn. But if I copy the file, you'll find both and that's a problem."

Bruce: Why? If my goal is to find the document, I'm quite happy to find both copies.

Sorry, Bruce, I should have said "But if I copy the file, then edit it's content to be something different, you'll find both and that's a problem.

To produce a new presentation in some specific template, it's common (in my experience) to copy some existing presentation in that template then open it up and edit every slide to something new.

BryanL, it sounds like you want to search on some cryptographic hash of the document. That might be valuable in some contexts, but it seems like it would generate an overwhelming amount of data. And in many cases, on my laptop at least, there would be thousands of such identifiers that no longer had any referent. That is, I don't keep every version of every document I edit. (Though I suppose some sort of hook into my subversion repository would give you more of them than is at first immediately obvious.)

Anyway, it's not immediately obvious to me that the scheme of the URI is very relevant in this case.

John Black: I agree with the basic argument of these essays, http URI make good names. But I take issue with some of the assumptions you make about names.

You're right, John. They're names and as such they don't solve any problems that names don't solve. My point, if I have one, is simply that http: URIs are good names, as good as the names defined in any other URI scheme, and better than many for reasons I've already tried to articulate.

Marty, it's true. Inventing a new scheme means that you get to say how identifiers in that scheme are constructed and, therefore, you can do a better job of decoding those identifiers. The price you pay for this benefit is deploying an infrastructure for dealing with those identifiers.

In order to decode the identifier, I have to understand the scheme. The cost of changing deployed software on the net to understand a new scheme is enormous. How many applications are going to be updated to cover how many schemes?

And given that I have to know how to decode the identifier, it's not clear "legacy-identifier-scheme:12345" is really better than "http://legacy-identifier.org/12345". Software-wise, the work isn't much different.

Conversely, the http: scheme URI is potentially much more useful to an agent that doesn't understand the legacy-identifier scheme because it may be able to do a simple GET on the URI and learn something (or at least a human can).

The idea of having some mechanism for distributing information about how to decode identifiers is interesting, though I don't think it would require a new TLD. If they were http: scheme URIs, you could, you know, put a pointer to that information in the document that you get if you do a GET on it.

Norm: The idea about a .spec TLD is just an idea that I think would work. Other ideas would be fine with me if they can unambiguously inform an application of the availability and location of decoding instructions. Then future applications could be dynamically "teachable" when they encounter a URI with unknown characteristics, and the cost of deploying URIs with new characteristics would go down.

There still needs to be something that lets applications know that a string is indeed a URI, and I think having a single scheme (i.e., http) would make that pretty easy. However, we can't just take all the concisely defined specifications currently associated with non-http schemes and nebulously toss them into a single big bucket with a single scheme. I argue that trying to teach applications to deal with that ambiguity would cost more than teaching them to deal with multiple concisely defined schemes. UNLESS there's a way to avoid the ambiguity.

The TLD idea is one way -- there may be other ways that I don't know about.

http://www.ietf.org/internet-drafts/draft-gregorio-uritemplate-00.txt describes a "URI Template" as a way to define the structure of certain URIs. Some more info at http://weblogs.java.net/blog/mhadley/archive/2006/10/uri_templating.html

More comments from me