I've been working on a reimplementation of my XML Catalog-based entity/URI resolver. It has a more sensible design, includes a caching feature, and supports a new API for dealing with XML Namespace names.

The first substantial body of Java code released into the wild with my name on it was the entity resolver code that eventually made its way into the Apache XML Commons project.

The origins of that code stretch back at least six years, maybe closer to ten. Its design seems…“odd”, at best, from a modern perspective but it's been in use for a long time and a lot of people use it every day. Some of the features that I implemented in that code eventually made it into the XML Catalogs V1.1 Standard.

One of the complaints raised against the catalog-based approach to URI management is that the end user has to write and maintain the catalog. On some systems, the catalog is automatically updated for packages that are installed locally, but that doesn't address the issue of random web resources accessed by the user.

If the resolver doesn't find an entry in the catalog for the resource requested, it goes out to the web and fetches it, so there's a straight-forward and obvious way to attack the manual maintenance issue: have the resolver cache the resources that it fetches. I've been meaning to implement caching for years.

Another feature that occurred to me more recently is improved support for XML Namespace names. The RDDL approach of assigning a nature and purpose to a Namespace URI can easily be implemented as an XML Catalog extension.

A few weeks ago, I set out to refactor the resolver and add these features. The first fruit of that effort is now available at http://xmlresolver.dev.java.net/.

Feature-wise, the new resolver:

Implementation-wise, the new resolver:

Using the resolver 

If you've never used a resolver before, simply put the xmlresolver.jar file on your CLASSPATH, instantiate a org.xmlresolver.Resolver object, and use it as the entity resolver or URI resolver in your application.

As a convenience, you can simply instantiate a org.xmlresolver.tools.ResolvingXMLReader. That implementation of an XMLReader will automatically use the resolver.

(The next release will probably include more convenience features including some code to plug into the standard JAXP[L] factory mechanism making it trivial to add the resolver to all parsers used by any application.)

Upgrading to the new code 

If you've been using the XML Commons resolver in your application, the new code is designed to be backwards compatible. Simply put the xmlresolver.jar file on your CLASSPATH and use org.xmlresolver.* instead of org.apache.xml.resolver.*.

For example, if you've been running:

java … com.saxonica.Transform \
        -x org.apache.xml.resolver.tools.ResolvingXMLReader \
        -y org.apache.xml.resolver.tools.ResolvingXMLReader \
        -r org.apache.xml.resolver.Resolver \
        …

You can run this instead:

java … com.saxonica.Transform \
        -x org.xmlresolver.tools.ResolvingXMLReader \
        -y org.xmlresolver.tools.ResolvingXMLReader \
        -r org.xmlresolver.Resolver \
        …

Other Java tools may have similar options.

Enabling the cache 

In order to use the new caching feature, you have to explicitly enable it. Caching requires write-access to a cache directory which you must identify through a Catalog property. Note that this directory should be under the exclusive control of the resolver.

The format of the caching control file is described briefly in the JavaDoc.

Resolving XML Namespaces 

The resolver proposes a new interface, NamespaceResolver with a single method, resolveNamespace. The method takes three parameters: an absolute Namespace URI, a nature, and a purpose. The method returns a resource associated with the namespace URI that has the specified nature and purpose. If no matching resource can be found, the document at the namespace URI is returned.

The catalog can identify the nature and purpose of a URI with extension attributes:

<uri xmlns:r="http://www.rddl.org/"
     name="http://www.w3.org/2001/XMLSchema"
     r:nature="http://www.w3.org/2001/XMLSchema"
     r:purpose="http://www.rddl.org/purposes#schema-validation"
     uri="/cache/xrc1234.xsd"/>

If there isn't a match, the resolver attempts to parse the namespace document as a RDDL document (1.0 for the moment, though I plan to support more) and find the match that way.

Disclaimer 

I've been running this code for a week or two “in production” on my laptop. It seems to work for me, but I wouldn't put it into production use anywhere else without careful consideration. It's quite likely that some of the work to make it thread-safe is incomplete. It's not documented very well yet. In short: it's beta. Your milage may vary. It may not work. It may work badly. It's not my fault.

Share and enjoy.

Comments:

Thats fantastic news. I've had implementing a caching resolver on my TODO list for a while, so its nice to see that someone else has done it! :)

One item that was on my feature list was to have the cache interaction with HTTP headers, e.g. making use of ETags and Last-Modified where available. Is this something you've considered?

Posted by Leigh Dodds on 07 Feb 2007 @ 09:18a UTC [link]

Cool, thanks! Do you think that there is chance of integrating this code directly into JDK, so there will be no need for installing and configuring this manually for each Java applications which deals with XML?

Posted by Jirka Kosek on 07 Feb 2007 @ 12:22p UTC [link]

the caching seems a great addition Norm.

You say 'Caching requires write-access to a cache directory which you must identify through a Catalog property.' Is that a relative path to the directory from the properties file (or the catalog file), and can we ignore the write permissions for Windows please?

regards DaveP

Posted by Dave Pawson on 07 Feb 2007 @ 12:39p UTC [link]

Leigh, the cache does check Last-Modified headers when it retrieves an HTTP URI from the cache. I'll see about doing ETag checking as well.

Jirka, I think that's a possibility. :-)

Posted by Norman Walsh on 07 Feb 2007 @ 12:46p UTC [link]

Dave, I'm not sure I've tested a relative path for the cache; I suppose making it relative to the property file (if there is one) is as good an idea as any.

There's no way to "ignore" the write permissions, either the application can write to the directory that you specify for the cache or it can't. If it can't, uhm, caching won't work :-)

Posted by Norman Walsh on 07 Feb 2007 @ 01:03p UTC [link]
There are 2 comments awaiting moderator approval.
Add a comment or subscribe to (existing) comments on this essay.