Implementing XProc, III

Volume 10, Issue 47; 13 May 2007; last modified 08 Oct 2010

Part the third, in which we consider looping.

This essay is part of a series of essays about implementing an XProc processor. XProc: An XML Pipeline Language is a W3C specification for specifying a sequence of operations to be performed on one or more XML documents. I'm implementing XProc as the specification progresses. Elsewhere you'll find background about pipelines and other essays about XProc.

As a general rule, steps consume their input and produce output. Consider this slightly contrived example:

<p:pipeline name="pipeline"
	    xmlns:p="http://www.w3.org/2007/03/xproc">
<p:input port="source">
  <p:document href="input.xml"/>
</p:input>
<p:output port="result"/>

<p:load name="loadStyle">
  <p:option name="href" value="style.xsl"/>
</p:load>

<p:xslt>
  <p:input port="source">
    <p:pipe step="pipeline" port="source"/>
  </p:input>
  <p:input port="stylesheet">
    <p:pipe step="loadStyle" port="result"/>
  </p:input>
</p:xslt>

</p:pipeline>

This pipeline takes an input document on its source port (which defaults to input.xml if you don't specify a source), loads a stylesheet, formats the pipeline's source document with the loaded stylesheet, and returns the result of that XSLT step on its result port.

Of particular interest is the interaction between the p:load and p:xslt steps. The p:load step produces some output and the p:xslt step consumes it. (I've used a load step because I want the input to come from a pipe in my next example; in practice, you'd load the stylesheet directly in the p:xslt step.)

But what happens when we put the p:xslt step in a loop? Consider this example:

<?xml version="1.0"?>
<p:pipeline name="pipeline"
	    xmlns:p="http://www.w3.org/2007/03/xproc">
<p:input port="source">
  <p:document href="input.xml"/>
</p:input>
<p:output port="result"/>

<p:load name="loadStyle">
  <p:option name="href" value="style.xsl"/>
</p:load>

<p:viewport name="format-sections" match="section">
  <p:viewport-source>
    <p:pipe step="pipeline" port="source"/>
  </p:viewport-source>
  <p:output port="result"/>

  <p:xslt>
    <p:input port="stylesheet">
      <p:pipe step="loadStyle" port="result"/>
    </p:input>
  </p:xslt>
</p:viewport>

</p:pipeline>

Now, instead of processing the entire document, we process it one section at a time: the steps inside this p:viewport get run once for each section in the source document. What that means is that the p:xslt step is going to read the stylesheet several times.

Unfortunately, the p:load is outside the loop. It produced its output, closed up shop, and went home. Somehow, we have to make the loaded stylesheet available more than once.

A really clever implementation might cache the underlying transformer so that the stylesheet didn't have to be read and compiled more than once, but let's satisfy ourselves today with just being able to reread the data.

Inputs can come from three places: from p:inline elements in the pipeline document, from URIs via p:document elements, or through pipes from other components.

The hardest problem is dealing with the pipes. I tackled this issue by making it possible to “reset” a pipe. If the reset feature is enabled for a pipe then the pipe keeps a copy of all the events in all the documents that flow through it. A subsequent “reset” of the pipe tells it to begin playing from the beginning again.

This has to be managed with care. You can't enable the reset after you begin writing to the pipe, naturally, and for the time being, it's an error to attempt to write to the pipe after a reset. I'm not sure that's an absolutely necessary condition, but I'm imposing it for the moment.

A p:inline document is treated like a pipe, if the reset feature has been enabled, it buffers the events before it sends them on. For a p:document, if the reset feature is enabled, the document is read into a buffer and treated like an inline document.

One practical consequence of this approach are that steps don't have to care if they're in a loop or not. A step just does what a step does. That's good. Another consequence is that pipes can wind up buffering a lot of data. This is especially true in cases where nested loops are involved. Given that StAX XMLEvents are no where near the most compact representation imaginable for XML documents, I think it may become necessary to allow pipes to dump buffers out to disk when memory runs low. But that's a struggle for another day.

Comments

It's a little hard to understand the limitations without access to the underlying ( hint hint ;), but why not implement some sort of lazy-evaluation for the pipeline elements.

You could construct a graph of the entire pipeline before evaluating any step. In this case, after the graph is constructed, you'd know to construct the view-ports, of the input document, before you loaded the stylesheet. Then, since you know the view-ports, you can register a "listener" for each of them to be fed the results of the pipe when it evaluates the stylesheet.

Thus you'd be able to only read from the pipe only once, and still be able to publish the stylesheet to all of the view-ports to transformation engine.

Then again, this might require the entire n-elemenets of the view-port list to be in memory, as opposed to resetting the pipe n times. It seems like it might be an either/or implementation decision or addition to standard.

There might be a way to hook up everything so that the events just "flow" once, but I can't think of a concrete example.

Thoughts?

Oh, and keep the implementation articles coming!

Thanks for the interest, Scott. I do have the whole graph before I begin. I think there's lots of room for interesting optimizations at that stage, but (1) I know I'll need to think hard about that when I get to trying to implement threads, so I'm not spending a lot of time thinking about it now, and (2) I'm trying to get something complete before worry too much about getting something that's necessarily fast, so I'm picking the easiest solution that will get the job done in every case.