Building a better resolver

Volume 10, Issue 8; 06 Feb 2007; last modified 08 Oct 2010

I've been working on a reimplementation of my XML Catalog-based entity/URI resolver. It has a more sensible design, includes a caching feature, and supports a new API for dealing with XML Namespace names.

The first substantial body of Java code released into the wild with my name on it was the entity resolver code that eventually made its way into the Apache XML Commons project.

The origins of that code stretch back at least six years, maybe closer to ten. Its design seems…“odd”, at best, from a modern perspective but it's been in use for a long time and a lot of people use it every day. Some of the features that I implemented in that code eventually made it into the XML Catalogs V1.1 Standard.

One of the complaints raised against the catalog-based approach to URI management is that the end user has to write and maintain the catalog. On some systems, the catalog is automatically updated for packages that are installed locally, but that doesn't address the issue of random web resources accessed by the user.

If the resolver doesn't find an entry in the catalog for the resource requested, it goes out to the web and fetches it, so there's a straight-forward and obvious way to attack the manual maintenance issue: have the resolver cache the resources that it fetches. I've been meaning to implement caching for years.

Another feature that occurred to me more recently is improved support for XML Namespace names. The RDDL approach of assigning a nature and purpose to a Namespace URI can easily be implemented as an XML Catalog extension.

A few weeks ago, I set out to refactor the resolver and add these features. The first fruit of that effort is now available at http://xmlresolver.dev.java.net/.

Feature-wise, the new resolver:

Is backwards compatible with the existing catalog resolver in Apache XML Commons.
Supports automatic caching of resources retrieved from web eliminating the need for manual catalog maintenance.

Implementation-wise, the new resolver:

Abandons the complex internal data structures used to represent catalogs. Each catalog is simply loaded as a DOM. This greatly simplifies the code and makes implementing extensions much more practical.
Uses the java.util.logging framework instead of a home-grown logging class.
Supports only OASIS XML Catalogs. It wouldn't be impossible to add support for other catalog formats, but I don't have any immediate plans to do so.
Has a more sensible design with three levels of catalog resolution: a simple, string-based lookup service that interrogates the catalog and determines what mapping, if any, the catalog specifies; a resolver that returns ordinary Java InputStreams; and a resolver that returns XML Source and InputSource objects.
Supports a new NamespaceResolver interface to retrieve resources associated with XML Namespace names.
Represents web resources (and catalog results) with a URI, a MIME type, and a content body.
Is thread-safe so that a single resolver instance can be shared across an entire application.
Uses file-based locking to assure that a single cache can be shared across an entire application or even multiple applications possibly running in different VMs.
Includes a (not quite complete) set of unit tests for catalog lookup results.

Using the resolver

If you've never used a resolver before, simply put the xmlresolver.jar file on your CLASSPATH, instantiate a org.xmlresolver.Resolver object, and use it as the entity resolver or URI resolver in your application.

As a convenience, you can simply instantiate a org.xmlresolver.tools.ResolvingXMLReader. That implementation of an XMLReader will automatically use the resolver.

(The next release will probably include more convenience features including some code to plug into the standard JAXP factory mechanism making it trivial to add the resolver to all parsers used by any application.)

Upgrading to the new code

If you've been using the XML Commons resolver in your application, the new code is designed to be backwards compatible. Simply put the xmlresolver.jar file on your CLASSPATH and use org.xmlresolver.* instead of org.apache.xml.resolver.*.

For example, if you've been running:

java … com.saxonica.Transform \
        -x org.apache.xml.resolver.tools.ResolvingXMLReader \
        -y org.apache.xml.resolver.tools.ResolvingXMLReader \
        -r org.apache.xml.resolver.Resolver \
        …

You can run this instead:

java … com.saxonica.Transform \
        -x org.xmlresolver.tools.ResolvingXMLReader \
        -y org.xmlresolver.tools.ResolvingXMLReader \
        -r org.xmlresolver.Resolver \
        …

Other Java tools may have similar options.

Enabling the cache

In order to use the new caching feature, you have to explicitly enable it. Caching requires write-access to a cache directory which you must identify through a Catalog property. Note that this directory should be under the exclusive control of the resolver.

The format of the caching control file is described briefly in the JavaDoc.

Resolving XML Namespaces

The resolver proposes a new interface, NamespaceResolver with a single method, resolveNamespace. The method takes three parameters: an absolute Namespace URI, a nature, and a purpose. The method returns a resource associated with the namespace URI that has the specified nature and purpose. If no matching resource can be found, the document at the namespace URI is returned.

The catalog can identify the nature and purpose of a URI with extension attributes:

<uri xmlns:r="http://www.rddl.org/"
     name="http://www.w3.org/2001/XMLSchema"
     r:nature="http://www.w3.org/2001/XMLSchema"
     r:purpose="http://www.rddl.org/purposes#schema-validation"
     uri="/cache/xrc1234.xsd"/>

If there isn't a match, the resolver attempts to parse the namespace document as a RDDL document (1.0 for the moment, though I plan to support more) and find the match that way.

Disclaimer

I've been running this code for a week or two “in production” on my laptop. It seems to work for me, but I wouldn't put it into production use anywhere else without careful consideration. It's quite likely that some of the work to make it thread-safe is incomplete. It's not documented very well yet. In short: it's beta. Your milage may vary. It may not work. It may work badly. It's not my fault.

Share and enjoy.

Comments

Thats fantastic news. I've had implementing a caching resolver on my TODO list for a while, so its nice to see that someone else has done it! :)

One item that was on my feature list was to have the cache interaction with HTTP headers, e.g. making use of ETags and Last-Modified where available. Is this something you've considered?

Cool, thanks! Do you think that there is chance of integrating this code directly into JDK, so there will be no need for installing and configuring this manually for each Java applications which deals with XML?

the caching seems a great addition Norm.

You say 'Caching requires write-access to a cache directory which you must identify through a Catalog property.' Is that a relative path to the directory from the properties file (or the catalog file), and can we ignore the write permissions for Windows please?

regards DaveP

Leigh, the cache does check Last-Modified headers when it retrieves an HTTP URI from the cache. I'll see about doing ETag checking as well.

Jirka, I think that's a possibility. :-)

Dave, I'm not sure I've tested a relative path for the cache; I suppose making it relative to the property file (if there is one) is as good an idea as any.

There's no way to "ignore" the write permissions, either the application can write to the directory that you specify for the cache or it can't. If it can't, uhm, caching won't work :-)