Using XML Catalogs and XProc together

Volume 12, Issue 24; 22 Jul 2009; last modified 08 Oct 2010

XML Calabash, my implementation of XProc, is my go-to tool these days for manipulating XML documents. Adding XML Catalogs into the mix just makes it sweeter.

Recently, I was presented with several hundred books comprised of many thousands of chapters. My goal: load them into the server so that they could become part of a larger application. Easy peasy.

Two snags: all the chapters contained references to named entities declared in an external subset and none of the metadata in each file was actually reliable.

Still pretty straight-forward. Parse the document to expand the entity references, do a little cleanup, and push them into the database. The details of the pipeline aren't that important, the bit I want to highlight today is the parsing.

Everything remained pretty easy until I discovered that there were a half-dozen or more flavors of DTD in use across this corpus. And naturally, every external subset was referenced only by a system identifier with some random, absolute path:

<!DOCTYPE chapter SYSTEM "/path/to/dtd10.dtd">

Where “10” was “10”, “11”, “21”, “25”, etc. for some substantial enough set of versions to be go well beyond my limit for tedium.

Luckily, all of them were including a standard suite of ISO entities and (as far as I could easily tell), that's all the entity references ever were.

XML Catalogs to the rescue.

First, grab a recent version of the DTD and stick it somewhere local, then construct the following catalog:

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <systemSuffix systemIdSuffix=".dtd" uri="local/dtd21.dtd"/>
</catalog>

Next, tell XML Calabash to use catalogs. You can do this from the command line, but I set it up in my configuration file, ~/.calabash:

<cc:xproc-config xmlns:cc="http://xmlcalabash.com/ns/configuration">
  <cc:schema-aware>false</cc:schema-aware>
  <cc:log-level level="warning"/>
  <cc:serialization
      omit-xml-declaration="false"/>
  <cc:entity-resolver class-name="org.xmlresolver.Resolver"/> 
  <cc:uri-resolver class-name="org.xmlresolver.Resolver"/>
</cc:xproc-config>

The first few lines just set some defaults I like, it's the last two that are relevant here. I tell XML Calabash to use my XML Resolver catalog implementation for entity and URI resolution.

Now my pipeline simply does The Right Thing™.

When the parser attempts to load the external subset, the catalog resolver returns the local DTD (because all the system identifiers end with “.dtd”). The p:load step doesn't do validation by default, so the fact that some of the files aren't valid according to the particular version of the DTD that I have locally doesn't matter. The entities get expanded correctly. (If any of the documents had relied on other entities only present in a particular version of the DTD, that would have been an error, so I know I didn't miss any.) I do a couple of lightweight transformations on the resulting document and shove it into the database FTW!

Nothing earth shattering here, and not the only way to solve the problem, but one that looks like a nail to my particular hammer of choice at the moment.

Comments

One feature request is obvious: the ability to embed a catalog in the pipeline (in p:pipeinfo, for example). That'd either be quite hard or would require a change to the XML Resolver API. It's on my list, but not a real high priority.

—Posted by Norman Walsh on 22 Jul 2009 @ 09:10 UTC #

'Your' XML resolver catalog implementation Norm?

What's new about it please? Differences to the Apache one?

I'd read the documentation.... https://xmlresolver.dev.java.net/servlets/ProjectDocumentList but...

DaveP

—Posted by Dave Pawson on 23 Jul 2009 @ 06:49 UTC #

Nice. Catalog support in Calumet (yes, I admit it: we don't support XML catalogs yet and it has bitten us a couple of times already) is definitely on my immediate TO-DO list. I hope to get to it soon.

—Posted by Vojtěch Toman on 23 Jul 2009 @ 09:02 UTC #
—Posted by Norman Walsh on 23 Jul 2009 @ 11:17 UTC #