Using XML Catalogs and XProc together
XML Calabash, my implementation of XProc, is my go-to tool these days for manipulating XML documents. Adding XML Catalogs into the mix just makes it sweeter.
Recently, I was presented with several hundred books comprised of many thousands of chapters. My goal: load them into the server so that they could become part of a larger application. Easy peasy.
Two snags: all the chapters contained references to named entities declared in an external subset and none of the metadata in each file was actually reliable.
Still pretty straight-forward. Parse the document to expand the entity references, do a little cleanup, and push them into the database. The details of the pipeline aren't that important, the bit I want to highlight today is the parsing.
Everything remained pretty easy until I discovered that there were a half-dozen or more flavors of DTD in use across this corpus. And naturally, every external subset was referenced only by a system identifier with some random, absolute path:
<!DOCTYPE chapter SYSTEM "/path/to/dtd10.dtd">
Where “10” was “10”, “11”, “21”, “25”, etc. for some substantial enough set of versions to be go well beyond my limit for tedium.
Luckily, all of them were including a standard suite of ISO entities and (as far as I could easily tell), that's all the entity references ever were.
XML Catalogs to the rescue.
First, grab a recent version of the DTD and stick it somewhere local, then construct the following catalog:
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<systemSuffix systemIdSuffix=".dtd" uri="local/dtd21.dtd"/>
</catalog>
Next, tell XML Calabash
to use catalogs. You can do this from the command line, but I set it up
in my configuration file, ~/.calabash
:
<cc:xproc-config xmlns:cc="http://xmlcalabash.com/ns/configuration">
<cc:schema-aware>false</cc:schema-aware>
<cc:log-level level="warning"/>
<cc:serialization
omit-xml-declaration="false"/>
<cc:entity-resolver class-name="org.xmlresolver.Resolver"/>
<cc:uri-resolver class-name="org.xmlresolver.Resolver"/>
</cc:xproc-config>
The first few lines just set some defaults I like, it's the last two that are relevant here. I tell XML Calabash to use my XML Resolver catalog implementation for entity and URI resolution.
Now my pipeline simply does The Right Thing™.
When the parser attempts to load the external subset, the
catalog resolver returns the local DTD (because all the system
identifiers end with “.dtd
”). The p:load
step doesn't do validation by default, so the fact that some of the
files aren't valid according to the particular version of the DTD that
I have locally doesn't matter. The entities get expanded correctly.
(If any of the documents had relied on other entities only present in
a particular version of the DTD, that would have been an error, so I
know I didn't miss any.) I do a couple of lightweight transformations
on the resulting document and shove it into the database FTW!
Nothing earth shattering here, and not the only way to solve the problem, but one that looks like a nail to my particular hammer of choice at the moment.
Comments
One feature request is obvious: the ability to embed a catalog in the pipeline (in p:pipeinfo, for example). That'd either be quite hard or would require a change to the XML Resolver API. It's on my list, but not a real high priority.
'Your' XML resolver catalog implementation Norm?
What's new about it please? Differences to the Apache one?
I'd read the documentation.... https://xmlresolver.dev.java.net/servlets/ProjectDocumentList but...
DaveP
Nice. Catalog support in Calumet (yes, I admit it: we don't support XML catalogs yet and it has bitten us a couple of times already) is definitely on my immediate TO-DO list. I hope to get to it soon.
Dave, see http://norman.walsh.name/2007/02/06/xmlresolver