<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" version="5.0" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:gal="http://norman.walsh.name/rdf/gallery#" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<info>
    
    
    
    
    
    
    
    
    
    
<title>Using XML Catalogs and XProc together</title><biblioid class="uri">http://norman.walsh.name/2009/07/22/xmlCatalogsandXProc</biblioid>
<volumenum>12</volumenum>
<issuenum>24</issuenum>
<pubdate>2009-07-22T16:15:27-04:00</pubdate>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2009</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>XML Calabash, my implementation of XProc, is my go-to tool these
days for manipulating XML documents. Adding XML Catalogs into the mix
just makes it sweeter.</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#Calabash"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#XMLCatalogs"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#XProc"/>
</info>

<para xml:id="p1">Recently, I was presented with several hundred books
comprised of many thousands of chapters. My goal: load them into the
server so that they could become part of a larger application. Easy
peasy.</para>

<para xml:id="p2">Two snags: all the chapters contained references to named entities
declared in an external subset and none of the metadata in each file
was actually reliable.</para>

<para xml:id="p3">Still pretty straight-forward. Parse the document to expand the
entity references, do a little cleanup, and push them into the
database. The details of the pipeline aren't that important, the bit
I want to highlight today is the parsing.</para>

<para xml:id="p4">Everything remained pretty easy until I discovered that there
were a half-dozen or more flavors of DTD in use across this corpus.
And naturally, every external subset was referenced
<emphasis>only</emphasis> by a system identifier with some random,
absolute path:</para>

<programlisting>&lt;!DOCTYPE chapter SYSTEM "/path/to/dtd10.dtd"&gt;</programlisting>

<para xml:id="p5">Where “10” was “10”, “11”, “21”, “25”, etc. for some substantial
enough set of versions to be go well beyond my limit for tedium.</para>

<para xml:id="p6">Luckily, all of them were including a standard suite of ISO entities
and (as far as I could easily tell), that's all the entity references ever
were.</para>

<para xml:id="p7">XML Catalogs to the rescue.</para>

<para xml:id="p8">First, grab a recent version of the DTD and stick it somewhere
local, then construct the following catalog:</para>

<programlisting>&lt;catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"&gt;
  &lt;systemSuffix systemIdSuffix=".dtd" uri="local/dtd21.dtd"/&gt;
&lt;/catalog&gt;</programlisting>

<para xml:id="p9">Next, tell <link xlink:href="/2008/projects/calabash">XML Calabash</link>
to use catalogs. You can do this from the command line, but I set it up
in my configuration file, <filename>~/.calabash</filename>:</para>

<programlisting>&lt;cc:xproc-config xmlns:cc="http://xmlcalabash.com/ns/configuration"&gt;
  &lt;cc:schema-aware&gt;false&lt;/cc:schema-aware&gt;
  &lt;cc:log-level level="warning"/&gt;
  &lt;cc:serialization
      omit-xml-declaration="false"/&gt;
  &lt;cc:entity-resolver class-name="org.xmlresolver.Resolver"/&gt; 
  &lt;cc:uri-resolver class-name="org.xmlresolver.Resolver"/&gt;
&lt;/cc:xproc-config&gt;</programlisting>

<para xml:id="p10">The first few lines just set some defaults I like, it's the last
two that are relevant here. I tell <citetitle>XML Calabash</citetitle>
to use my <link xlink:href="http://xmlresolver.org/">XML Resolver</link>
catalog implementation for entity and URI resolution.</para>

<para xml:id="p11">Now my pipeline simply does The Right Thing™.</para>

<para xml:id="p12">When the parser attempts to load the external subset, the
catalog resolver returns the local DTD (because all the system
identifiers end with “<literal>.dtd</literal>”). The <tag>p:load</tag>
step doesn't do validation by default, so the fact that some of the
files aren't valid according to the particular version of the DTD that
I have locally doesn't matter. The entities get expanded correctly.
(If any of the documents had relied on other entities only present in
a particular version of the DTD, that would have been an error, so I
know I didn't miss any.) I do a couple of lightweight transformations
on the resulting document and shove it into the database FTW!</para>

<para xml:id="p13">Nothing earth shattering here, and not the only way to solve the
problem, but one that looks like a nail to my particular hammer of choice
at the moment.</para>

</essay>

