Tread lightly

Volume 10, Issue 89; 07 Sep 2007; last modified 08 Oct 2010

Take advantage of the catalog resolver built into GlassFish to treat your neighbors more gently and maybe improve performance.

By its very nature, the web encourages dereference of URIs. This is a good thing, it's how we surf the web in our browser of choice and it's how web applications take advantage of distributed resources.

The more popular a resource is, the more likely it is to get dereferenced. This too, is usually a good thing. Lots of folks keep track of the number of “hits” they get (for weblog postings, press releases, product downloads, etc.). More is better.

But it is possible to get too much of a good thing, especially where web applications are concerned. A popular web application can hit a resource thousands of times an hour (maybe more), faster than even the most caffeine-fueled web surfer.

The W3C, for example, gets an astonishing number of hits for DTDs (especially the HTML and XHTML DTDs), schemas, and namespace documents. So many, in fact, that sometimes it looks like a denial-of-service attack. And sometimes that'll get you locked out completely for several days.

Addressing the problem of scalable access to web resources is not a simple one. There are a number of ways it can be approached at a number of different levels in the web architecture stack. The W3C Technical Architecture Group has agreed to investigate the issue.

In the meantime, if you're writing GlassFish servlets or other applications that are performing XML processing, you can take advantage of the XML Catalog resolver built into GlassFish to directly reduce the burden your applications are creating. (Never heard of a XML Catalogs? I wrote some background information a while back.)

The secret is to make local copies of static resources and then tell the catalog resolver to use them instead. I'm going to use the XHTML DTD for my example, but it applies to any web resource that your application might be accessing.

The first step is to make local copies of the representations you need and then create an XML Catalog for them. In this case, I grabbed the xhtml1-transitional.dtd file and the entities it relies on and built this catalog:

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"
         prefer="public">

  <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"
	  uri="xhtml1-transitional.dtd"/>

  <public publicId="-//W3C//ENTITIES Latin 1" uri="xhtml-lat1.ent"/>
  <public publicId="-//W3C//ENTITIES Symbols" uri="xhtml-symbol.ent"/>
  <public publicId="-//W3C//ENTITIES Special" uri="xhtml-special.ent"/>
</catalog>

The next step relies on the fact that GlassFish ships with the entity resolver that was developed as part of the Apache XML Commons project. All you have to do is make sure that it's used as the entity resolver or URI resolver by your application.

In the interest of choosing a simple, common example, let's look at how we can use this resolver if we're parsing a document with SAX.

First, make sure that you're importing the resolver:

import com.sun.org.apache.xml.internal.resolver.tools.CatalogResolver;

I know this ships with GlassFish, but if you're uncomfortable with the slightly dubious practice of relying on “private” classes, you can install the standard distribution yourself.

(And if you're using NetBeans, you can skip right over the import step, NetBeans will suggest the import for you when you need it and stick it in the right place.)

Next, make yourself a catalog resolver:

CatalogResolver resolver = new CatalogResolver();

And finally, make sure it gets used. I setup my own SAX handler and used it in there:

private class MyHandler extends DefaultHandler {
    public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException {
        return resolver.resolveEntity(publicId, systemId);
    }
}

The complete class file is available if you want to see all the code. I hacked the “hello2” example from Chapter 2 of The Java EE 5 Tutorial. It treats the “name” you give it as a URI and attempts to parse it.

In the GlassFish context, there's one more step. You have to configure the server so that it sets the xml.catalog.files system property to point to your catalog. (There are other ways of getting to the right catalog, but this is the simplest.)

I added the system property to the domain configuration file:

<system-property name="xml.catalog.files" value="file:///tmp/catalog.xml"/>

Of course, /tmp/ is a silly place to put the file, but it was enough for this demonstration.

Not only do you get the benefit of being a better net citizen by using a resolver to reduce your burden on your net neighbors, but you may see a performance improvement as well. No matter how good your server bandwidth is, it's still slower to hit the net than your local file system.

Next time, we'll look at using some slightly more bleeding edge code to avoid the task of constructing the catalog by hand.

Comments

Nice tip; do you have any stats on the volume of requests for those w3c schemas? I'm trying to get a picture of an "astonishing number" :)

We have had some around 90 million hit days, lower lately from a few services and our some automated blocking both at http and tcp level.

Hi Norm,

AIUI, the argument for using a catalogue over an HTTP cache is that it assures that the documents are available, even when there's a network segment if the cache needs to fetch something.

Putting aside the arguments that you can pre-seed a cache, and have it intelligently fail over, which I think address this concern about caching adequately in all but the most pathological case, why is a catalogue the appropriate solution in this use case?

I.e., if people are already going to the network to get their dependencies, and you just want to make that more efficient and less of a burden on the origin server, a cache is just as efficient, and far less intrusive and less failure prone; you don't have to do all of the mucking about, and you don't have to remember all of the different URIs you're using -- with the possibility of missing a few, or having more added later.

Hi Mark,

Yes, caching has a lot of advantages. And maybe anyone competent to setup a Glassfish application could install and manage one.

The advantage of catalogs (or one of them) is that they're dead simple to install and use. Many people with insufficient skill or experience or ability to setup their own local proxy cache have had great success with catalogs.

I've tried to bring the benefits of caching to the ease of catalogs with a caching catalog resolver. I'll write that up in the Glassfish context "real soon now".

Another advantage of catalogs, from my perspective, is it's easy to use them to lie. Send me a document that claims to need a DTD that has a system identifier that ends with "/docbook.dtd" and I point it to my local copy of DocBook V4.5. I don't care what version you claimed to need or what else you said about its identifier. Similarly, I use them to tinker with entity sets for online and print publishing too.

Hello all,

I was trying to use XML catalog to try to solve my problem with intensive Internet shutdown. When I call the method

org.jdom.Document document = sxb.build(list[i]);

where sxb = new SAXBuilder()

I have the following error: 504 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent

How do I use your system to solve that?

I've done CatalogResolver resolver = new CatalogResolver();

But how I make it used by the build method?

Thx