<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" version="pto" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:gal="http://norman.walsh.name/rdf/gallery#">
<info>
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
<title>Caching is Not Enough</title><biblioid class="uri">http://norman.walsh.name/2003/06/26/cache</biblioid>
<volumenum>6</volumenum>
<issuenum>45</issuenum>
<pubdate>2003-06-26</pubdate>
<date>$Date: 2005-09-11 10:27:02 -0400 (Sun, 11 Sep 2005) $</date>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2003</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>XML Catalogs solve problems that proxy caches can't. They both have
their place, but neither is a practical substitute for the other.</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#TheWeb"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#XML"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#XMLCatalogs"/>
</info>

<para xml:id="p1">When I <link xlink:href="../05/xmlcatalogs">explain XML
Catalogs</link>, some people suggest that they aren't really
necessary, that the right answer to the problem of access to
sporadically available resources is some form of cache. Most recently,
I noticed that
<personname>
      <firstname>Mark</firstname>
      <surname>Baker</surname>
    </personname>
<link xlink:href="http://www.markbaker.ca/2002/09/Blog/2003/06/05#2003-06-catalogs">said it</link>, but I've heard others say it to.
</para>

<para xml:id="p2">Caches have their place; I use
<link xlink:href="http://www.gedanken.demon.co.uk/wwwoffle/">wwwoffle</link>
myself, and it's very convenient.
But there are at least three good reasons why an automatic,
cache-based approach is not sufficient. (I say <quote>automatic</quote>
because XML Catalogs<indexterm>
      <primary>XML Catalogs</primary>
    </indexterm>
are a kind of cache at a high level, so I think the folks who advocate caches
instead of catalogs are thinking of the proxy sort.)</para>

<para xml:id="p3">In case you aren't familiar with the way a cache like
<application>wwwoffle</application> works, here's a quick summary.
Instead of getting resources directly from the web, you use the
caching software as a proxy<footnote>
<para xml:id="p4">One of the really cool benefits of using a caching proxy, unrelated
to the subject of this essay, is the fact that you can use it to manage
VPN issues. When I'm behind the corporate VPN, I must use the corporate
web proxy, when I'm not behind the VPN, I must not use the corporate web proxy
(when I'm not behind the VPN, I can't even see it).</para>
<para xml:id="p5">Switching the browser and all my applications back and forth was an
enormous pain in the butt. The answer: always use the local cache as my proxy
and configure the scripts that start and stop VPN to adjust its configuration
to either use the corporate web proxy or not. Problem solved.</para>
</footnote>. The caching software keeps copies of all
the resources you retrieve and, provided you've successfully retrieved
it recently enough to have a copy in your cache, can return a copy
even when you aren't connected. The caching software manages the
collection of documents in the cache, subject to policies that you may
have some control over, to discard old documents so that you don't
eventually fill your disk completely. Most browsers cache resources as
well, but those caches aren't usually useful to applications that run
outside the browser.</para>

<para xml:id="p6">Back to catalogs. Here are three things you can do with XML Catalogs
that you can't do with caching proxies:</para>

<orderedlist>
<listitem>
<para xml:id="p7">
	<emphasis role="bold">Populate the cache</emphasis>
      </para>
<para xml:id="p8">Caching proxies rely on the fact that you can access the resource
at least once from the web. What if you can't? Then the resource won't be
in the cache when you need it.</para>
<para xml:id="p9">This isn't as silly as it sounds, not only have I sometimes installed
software when I'm disconnected, but there may very well be situations where
the schemas, stylesheets, and other documents you need are not
publically available. I know of one medical consortium, for example,
with very strict access control policies. Just because they identify a schema
with the URI <emphasis role="uri">http://example.com/schemas/medicalRecord.xsd</emphasis>
doesn't mean that you're ever going to be able to get it from there. You might
find yourself in similar situations with various corporate firewalls.</para>
<para xml:id="p10">XML Catalogs solve this problem by allowing you to identify the mapping
from <emphasis role="uri">http://example.com/schemas/medicalRecord.xsd</emphasis>
to the local copy (that you obtained by jumping through appropriate legal hoops)
without ever having to retrieve it from the <quote>real</quote> URI.</para>
</listitem>

<listitem>
<para xml:id="p11">
	<emphasis role="bold">Access Development Resources</emphasis>
      </para>
<para xml:id="p12">Caching proxies return the resource you last retrieved, but what if that
isn't the version you want?
</para>
<para xml:id="p13">For example, the <link xlink:href="http://sf.net/projects/docbook">DocBook
XSL Stylesheets</link> are a personal project to which I devote a
fair amount of time. Not only do I work on the stylesheets themselves, but I also
work on a bunch of customization layers of various sorts. It's not all that
uncommon for me to identify a bug or misfeature in the base stylesheets while
I'm working on some customization layer.</para>
<para xml:id="p14">Fixing the bug is usually pretty easy, but then I have a problem. The
customization layer refers to
<emphasis role="uri">http://docbook.sf.net/releases/xsl/current/html/docbook.xsl</emphasis>,
but I haven't fixed that version yet, I've fixed my local copy. So a cache is
useless. I could make another real, public release of the base stylesheets (that
will cause the caching proxy to refresh its cache), but
I may not have time or I may not be connected.</para>
<para xml:id="p15">I could also change the customization layer so that it refers to
<emphasis role="uri">file:///sourceforge/docbook/xsl/html/docbook.xsl</emphasis>
but if I do that I'm bound to forget I did it and the next time I publish
the customization layer, I'll get 11 bug reports about the broken URI and
nothing will work for end users until I publish another version.</para>
<para xml:id="p16">XML Catalogs solve this problem by allowing me to identify a private mapping
from <emphasis role="uri">http://docbook.sf.net/releases/xsl/current/html/docbook.xsl</emphasis>
to my local development copy of the base stylesheets.</para>
</listitem>

<listitem>
<para xml:id="p17">
	<emphasis role="bold">Devise Your Own Resolution Policies</emphasis>
      </para>
<para xml:id="p18">In connection with my work on DocBook and its stylesheets, a lot of people
send me test documents. These test documents usually contain a reference to the
DocBook DTD (and if they don't, 99 times out of 100, the bug is an invalid document.
<emphasis>Always</emphasis> validate your documents!). People install
DocBook on their local systems in accordance with local conventions
that vary greatly. And because, alas, not everyone uses XML Catalogs,
they often point directly to the local copy:</para>

<screen>&lt;!DOCTYPE book
SYSTEM "/random/schema/directory/docbookx.dtd"&gt;
...
</screen>

<para xml:id="p19">And as you can see, they don't always use the public identifier either.
(Public identifiers are a good thing, the failure of other working groups to
allow authors to identify resources by name is a real shame, but that's
an essay for another day.).</para>

<para xml:id="p20">I got really tired of editing nearly every test document that came my way
so that it identified DocBook in a way that my system would resolve it. So
I tweaked the resolver. XML Catalogs allow you to remap the beginnings of
system identifiers and URIs. I added an extension that allows you to remap
the <emphasis>end</emphasis>:</para>

<screen>&lt;ext:systemSuffix
     xmlns:ext="http://nwalsh.com/xcatalog/1.0"
     suffix="docbookx.dtd"
     uri="docbook/xml/docbookx.dtd"/&gt;</screen>

<para xml:id="p21">This entry says that any system identifier than ends in
<quote>
	  <literal>docbookx.dtd</literal>
	</quote> will be mapped to my
local copy of DocBook. (This extension is actually implemented in the
<application>Java</application> resolver at
Apache<indexterm>
	  <primary>Apache</primary>
	  <secondary>XML
Commons</secondary>
	</indexterm>.)</para>

</listitem>
</orderedlist>

<para xml:id="p22">In short, caches have their place, but they don't solve all the
problems that XML Catalogs address.</para>

</essay>

