Caching is Not Enough

Volume 6, Issue 45; 26 Jun 2003

XML Catalogs solve problems that proxy caches can't. They both have their place, but neither is a practical substitute for the other.

When I explain XML Catalogs, some people suggest that they aren't really necessary, that the right answer to the problem of access to sporadically available resources is some form of cache. Most recently, I noticed that Mark Baker said it, but I've heard others say it to.

Caches have their place; I use wwwoffle myself, and it's very convenient. But there are at least three good reasons why an automatic, cache-based approach is not sufficient. (I say “automatic” because XML Catalogs are a kind of cache at a high level, so I think the folks who advocate caches instead of catalogs are thinking of the proxy sort.)

In case you aren't familiar with the way a cache like wwwoffle works, here's a quick summary. Instead of getting resources directly from the web, you use the caching software as a proxyOne of the really cool benefits of using a caching proxy, unrelated to the subject of this essay, is the fact that you can use it to manage VPN issues. When I'm behind the corporate VPN, I must use the corporate web proxy, when I'm not behind the VPN, I must not use the corporate web proxy (when I'm not behind the VPN, I can't even see it).Switching the browser and all my applications back and forth was an enormous pain in the butt. The answer: always use the local cache as my proxy and configure the scripts that start and stop VPN to adjust its configuration to either use the corporate web proxy or not. Problem solved.. The caching software keeps copies of all the resources you retrieve and, provided you've successfully retrieved it recently enough to have a copy in your cache, can return a copy even when you aren't connected. The caching software manages the collection of documents in the cache, subject to policies that you may have some control over, to discard old documents so that you don't eventually fill your disk completely. Most browsers cache resources as well, but those caches aren't usually useful to applications that run outside the browser.

Back to catalogs. Here are three things you can do with XML Catalogs that you can't do with caching proxies:

Populate the cache

Caching proxies rely on the fact that you can access the resource at least once from the web. What if you can't? Then the resource won't be in the cache when you need it.

This isn't as silly as it sounds, not only have I sometimes installed software when I'm disconnected, but there may very well be situations where the schemas, stylesheets, and other documents you need are not publically available. I know of one medical consortium, for example, with very strict access control policies. Just because they identify a schema with the URI http://example.com/schemas/medicalRecord.xsd doesn't mean that you're ever going to be able to get it from there. You might find yourself in similar situations with various corporate firewalls.

XML Catalogs solve this problem by allowing you to identify the mapping from http://example.com/schemas/medicalRecord.xsd to the local copy (that you obtained by jumping through appropriate legal hoops) without ever having to retrieve it from the “real” URI.
Access Development Resources

Caching proxies return the resource you last retrieved, but what if that isn't the version you want?

For example, the DocBook XSL Stylesheets are a personal project to which I devote a fair amount of time. Not only do I work on the stylesheets themselves, but I also work on a bunch of customization layers of various sorts. It's not all that uncommon for me to identify a bug or misfeature in the base stylesheets while I'm working on some customization layer.

Fixing the bug is usually pretty easy, but then I have a problem. The customization layer refers to http://docbook.sf.net/releases/xsl/current/html/docbook.xsl, but I haven't fixed that version yet, I've fixed my local copy. So a cache is useless. I could make another real, public release of the base stylesheets (that will cause the caching proxy to refresh its cache), but I may not have time or I may not be connected.

I could also change the customization layer so that it refers to file:///sourceforge/docbook/xsl/html/docbook.xsl but if I do that I'm bound to forget I did it and the next time I publish the customization layer, I'll get 11 bug reports about the broken URI and nothing will work for end users until I publish another version.

XML Catalogs solve this problem by allowing me to identify a private mapping from http://docbook.sf.net/releases/xsl/current/html/docbook.xsl to my local development copy of the base stylesheets.
Devise Your Own Resolution Policies

In connection with my work on DocBook and its stylesheets, a lot of people send me test documents. These test documents usually contain a reference to the DocBook DTD (and if they don't, 99 times out of 100, the bug is an invalid document. Always validate your documents!). People install DocBook on their local systems in accordance with local conventions that vary greatly. And because, alas, not everyone uses XML Catalogs, they often point directly to the local copy:
```
<!DOCTYPE book
SYSTEM "/random/schema/directory/docbookx.dtd">
...
```
And as you can see, they don't always use the public identifier either. (Public identifiers are a good thing, the failure of other working groups to allow authors to identify resources by name is a real shame, but that's an essay for another day.).

I got really tired of editing nearly every test document that came my way so that it identified DocBook in a way that my system would resolve it. So I tweaked the resolver. XML Catalogs allow you to remap the beginnings of system identifiers and URIs. I added an extension that allows you to remap the end:
```
<ext:systemSuffix
     xmlns:ext="http://nwalsh.com/xcatalog/1.0"
     suffix="docbookx.dtd"
     uri="docbook/xml/docbookx.dtd"/>
```
This entry says that any system identifier than ends in “ docbookx.dtd ” will be mapped to my local copy of DocBook. (This extension is actually implemented in the Java resolver at Apache.)

In short, caches have their place, but they don't solve all the problems that XML Catalogs address.

Comments

Catalogs do not care about URI schemes, caches do.

Say you are using a custom URI scheme (which I know is a contentious issue all of it's own) or some set of URNs you have cooked up, which map into an XML repository for instance. You find yourself disconnected from the repository, a caching proxy (an HTTP caching proxy anyway :) ) isn't going to help you no matter what, while a catalog will.

Response at: http://www.mnot.net/blog/archives/000106.html