<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" version="5.0" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:gal="http://norman.walsh.name/rdf/gallery#" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<info>
    
    
    
    
    
    
    
    
    
    
<title>Implementing XProc, III</title><biblioid class="uri">http://norman.walsh.name/2007/05/13/implXProcIII</biblioid>
<volumenum>10</volumenum>
<issuenum>47</issuenum>
<pubdate>2007-05-13T09:29:33-04:00</pubdate>
<date>$Date: 2007-05-13 15:52:46 -0400 (Sun, 13 May 2007) $</date>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2007</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>Part the third, in which we consider looping.</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#Java"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#XProc"/>
</info>

<para xml:id="p1">This essay is part of a series of essays about implementing an
<wikipedia page="XML_pipeline">XProc</wikipedia> processor.
<citetitle xlink:href="http://www.w3.org/TR/xproc/">XProc: An XML Pipeline
Language</citetitle> is a W3C specification for specifying a sequence of operations
to be performed on one or more XML documents. I'm
<link xlink:href="http://xproc.dev.java.net/">implementing XProc</link> as
the specification progresses. Elsewhere you'll find background
<link xlink:href="http://norman.walsh.name/2004/06/20/pipelines">about
pipelines</link> and other essays
<link xlink:href="http://norman.walsh.name/knows/what/xproc">about XProc</link>.</para>

<para xml:id="p2">As a general rule, steps consume their input and produce output. Consider
<link xlink:href="examples/pipe1.xml">this</link> slightly contrived example:</para>

<programlisting>&lt;p:pipeline name="pipeline"
	    xmlns:p="http://www.w3.org/2007/03/xproc"&gt;
&lt;p:input port="source"&gt;
  &lt;p:document href="input.xml"/&gt;
&lt;/p:input&gt;
&lt;p:output port="result"/&gt;

&lt;p:load name="loadStyle"&gt;
  &lt;p:option name="href" value="style.xsl"/&gt;
&lt;/p:load&gt;

&lt;p:xslt&gt;
  &lt;p:input port="source"&gt;
    &lt;p:pipe step="pipeline" port="source"/&gt;
  &lt;/p:input&gt;
  &lt;p:input port="stylesheet"&gt;
    &lt;p:pipe step="loadStyle" port="result"/&gt;
  &lt;/p:input&gt;
&lt;/p:xslt&gt;

&lt;/p:pipeline&gt;</programlisting>

<para xml:id="p3">This pipeline takes an input document on its <literal>source</literal>
port (which defaults to <uri>input.xml</uri> if you don't specify a
<literal>source</literal>),
loads a stylesheet, formats the pipeline's <literal>source</literal> document
with the loaded stylesheet, and returns the result of that XSLT step on
its <literal>result</literal> port.</para>

<para xml:id="p4">Of particular interest is the interaction between the
<literal>p:load</literal> and <literal>p:xslt</literal> steps. The
<literal>p:load</literal> step produces some output and the
<literal>p:xslt</literal> step consumes it. (I've used a load step because I
want the input to come from a pipe in my next example; in practice,
you'd load the stylesheet directly in the <literal>p:xslt</literal>
step.)</para>

<para xml:id="p5">But what happens when we put the <literal>p:xslt</literal>
step in a loop? Consider
<link xlink:href="examples/pipe2.xml">this</link> example:</para>

<programlisting>&lt;?xml version="1.0"?&gt;
&lt;p:pipeline name="pipeline"
	    xmlns:p="http://www.w3.org/2007/03/xproc"&gt;
&lt;p:input port="source"&gt;
  &lt;p:document href="input.xml"/&gt;
&lt;/p:input&gt;
&lt;p:output port="result"/&gt;

&lt;p:load name="loadStyle"&gt;
  &lt;p:option name="href" value="style.xsl"/&gt;
&lt;/p:load&gt;

&lt;p:viewport name="format-sections" match="section"&gt;
  &lt;p:viewport-source&gt;
    &lt;p:pipe step="pipeline" port="source"/&gt;
  &lt;/p:viewport-source&gt;
  &lt;p:output port="result"/&gt;

  &lt;p:xslt&gt;
    &lt;p:input port="stylesheet"&gt;
      &lt;p:pipe step="loadStyle" port="result"/&gt;
    &lt;/p:input&gt;
  &lt;/p:xslt&gt;
&lt;/p:viewport&gt;

&lt;/p:pipeline&gt;</programlisting>

<para xml:id="p6">Now, instead of processing the entire document, we
process it one section at a time: the steps inside this
<literal>p:viewport</literal> get run once for each <tag>section</tag> in the
<literal>source</literal> document. What that
means is that the <literal>p:xslt</literal> step is going to read the
stylesheet several times.</para>

<para xml:id="p7">Unfortunately, the <literal>p:load</literal> is outside the loop. It produced
its output, closed up shop, and went home. Somehow, we have to make the
loaded stylesheet available more than once.</para>

<para xml:id="p8">A really clever implementation might cache the underlying
transformer so that the stylesheet didn't have to be read and compiled
more than once, but let's satisfy ourselves today with just being able
to reread the data.</para>

<para xml:id="p9">Inputs can come from three places: from <tag>p:inline</tag> elements
in the pipeline document, from URIs via <tag>p:document</tag> elements,
or through pipes from other components.</para>

<para xml:id="p10">The hardest problem is dealing with the pipes. I tackled this issue
by making it possible to “reset” a pipe. If the reset feature is enabled
for a pipe then the pipe keeps a copy of all the events in all the documents
that flow through it. A subsequent “reset” of the pipe tells it to begin playing
from the beginning again.</para>

<para xml:id="p11">This has to be managed with care. You can't enable the reset after you
begin writing to the pipe, naturally, and for the time being, it's an error
to attempt to write to the pipe after a reset. I'm not sure that's an absolutely
necessary condition, but I'm imposing it for the moment.</para>

<para xml:id="p12">A <literal>p:inline</literal> document is treated like a pipe, if the reset feature
has been enabled, it buffers the events before it sends them on. For
a <literal>p:document</literal>,
if the reset feature is enabled, the document is read into a buffer and treated
like an inline document.</para>

<para xml:id="p13">One practical consequence of this approach are that steps don't have
to care if they're in a loop or not. A step just does what a step does.
That's good. Another consequence is that pipes can wind up buffering
<emphasis>a lot</emphasis> of data. This is especially true in cases where
nested loops are involved. Given that <wikipedia>StAX</wikipedia>
<classname>XMLEvent</classname>s are no where near the most
compact representation imaginable for XML documents, I think it may become
necessary to allow pipes to dump buffers out to disk when memory runs low.
But that's a struggle for another day.</para>

</essay>

