Implementing XProc, III
Part the third, in which we consider looping.
This essay is part of a series of essays about implementing an XProc processor. XProc: An XML Pipeline Language is a W3C specification for specifying a sequence of operations to be performed on one or more XML documents. I'm implementing XProc as the specification progresses. Elsewhere you'll find background about pipelines and other essays about XProc.
As a general rule, steps consume their input and produce output. Consider this slightly contrived example:
<p:pipeline name="pipeline"
xmlns:p="http://www.w3.org/2007/03/xproc">
<p:input port="source">
<p:document href="input.xml"/>
</p:input>
<p:output port="result"/>
<p:load name="loadStyle">
<p:option name="href" value="style.xsl"/>
</p:load>
<p:xslt>
<p:input port="source">
<p:pipe step="pipeline" port="source"/>
</p:input>
<p:input port="stylesheet">
<p:pipe step="loadStyle" port="result"/>
</p:input>
</p:xslt>
</p:pipeline>
This pipeline takes an input document on its source
port (which defaults to input.xml
if you don't specify a
source
),
loads a stylesheet, formats the pipeline's source
document
with the loaded stylesheet, and returns the result of that XSLT step on
its result
port.
Of particular interest is the interaction between the
p:load
and p:xslt
steps. The
p:load
step produces some output and the
p:xslt
step consumes it. (I've used a load step because I
want the input to come from a pipe in my next example; in practice,
you'd load the stylesheet directly in the p:xslt
step.)
But what happens when we put the p:xslt
step in a loop? Consider
this example:
<?xml version="1.0"?>
<p:pipeline name="pipeline"
xmlns:p="http://www.w3.org/2007/03/xproc">
<p:input port="source">
<p:document href="input.xml"/>
</p:input>
<p:output port="result"/>
<p:load name="loadStyle">
<p:option name="href" value="style.xsl"/>
</p:load>
<p:viewport name="format-sections" match="section">
<p:viewport-source>
<p:pipe step="pipeline" port="source"/>
</p:viewport-source>
<p:output port="result"/>
<p:xslt>
<p:input port="stylesheet">
<p:pipe step="loadStyle" port="result"/>
</p:input>
</p:xslt>
</p:viewport>
</p:pipeline>
Now, instead of processing the entire document, we
process it one section at a time: the steps inside this
p:viewport
get run once for each section
in the
source
document. What that
means is that the p:xslt
step is going to read the
stylesheet several times.
Unfortunately, the p:load
is outside the loop. It produced
its output, closed up shop, and went home. Somehow, we have to make the
loaded stylesheet available more than once.
A really clever implementation might cache the underlying transformer so that the stylesheet didn't have to be read and compiled more than once, but let's satisfy ourselves today with just being able to reread the data.
Inputs can come from three places: from p:inline
elements
in the pipeline document, from URIs via p:document
elements,
or through pipes from other components.
The hardest problem is dealing with the pipes. I tackled this issue by making it possible to “reset” a pipe. If the reset feature is enabled for a pipe then the pipe keeps a copy of all the events in all the documents that flow through it. A subsequent “reset” of the pipe tells it to begin playing from the beginning again.
This has to be managed with care. You can't enable the reset after you begin writing to the pipe, naturally, and for the time being, it's an error to attempt to write to the pipe after a reset. I'm not sure that's an absolutely necessary condition, but I'm imposing it for the moment.
A p:inline
document is treated like a pipe, if the reset feature
has been enabled, it buffers the events before it sends them on. For
a p:document
,
if the reset feature is enabled, the document is read into a buffer and treated
like an inline document.
One practical consequence of this approach are that steps don't have
to care if they're in a loop or not. A step just does what a step does.
That's good. Another consequence is that pipes can wind up buffering
a lot of data. This is especially true in cases where
nested loops are involved. Given that StAX
XMLEvent
s are no where near the most
compact representation imaginable for XML documents, I think it may become
necessary to allow pipes to dump buffers out to disk when memory runs low.
But that's a struggle for another day.
Comments
It's a little hard to understand the limitations without access to the underlying ( hint hint ;), but why not implement some sort of lazy-evaluation for the pipeline elements.
You could construct a graph of the entire pipeline before evaluating any step. In this case, after the graph is constructed, you'd know to construct the view-ports, of the input document, before you loaded the stylesheet. Then, since you know the view-ports, you can register a "listener" for each of them to be fed the results of the pipe when it evaluates the stylesheet.
Thus you'd be able to only read from the pipe only once, and still be able to publish the stylesheet to all of the view-ports to transformation engine.
Then again, this might require the entire n-elemenets of the view-port list to be in memory, as opposed to resetting the pipe n times. It seems like it might be an either/or implementation decision or addition to standard.
There might be a way to hook up everything so that the events just "flow" once, but I can't think of a concrete example.
Thoughts?
Oh, and keep the implementation articles coming!
Thanks for the interest, Scott. I do have the whole graph before I begin. I think there's lots of room for interesting optimizations at that stage, but (1) I know I'll need to think hard about that when I get to trying to implement threads, so I'm not spending a lot of time thinking about it now, and (2) I'm trying to get something complete before worry too much about getting something that's necessarily fast, so I'm picking the easiest solution that will get the job done in every case.