Implementing XProc, II

Volume 10, Issue 41; 02 May 2007; last modified 08 Oct 2010

Part the second, in which we consider pipeline documents.

This essay is part of a series of essays about implementing an XProc processor. XProc: An XML Pipeline Language is a W3C specification for specifying a sequence of operations to be performed on one or more XML documents. I'm implementing XProc as the specification progresses. Elsewhere you'll find background about pipelines and other essays about XProc.

It's all well and good to speak of a pipeline as a sequence of steps and talk about how events flow between them, but that doesn't answer the question of where the steps come from in the first place or how they're connected together.

What pipeline authors actually write, and what pipeline processors actually start with, isn't a connected sequence (read: acyclic graph) of steps connected by pipes, it's a pipeline document that describes such a pipeline.

One of the things that has changed most since the work on XProc began is the syntax of the pipeline document. Where some of the earliest proposals for the language were quite verbose and required the author to make very explicit connections, we now have a language that's much simpler but somewhat more difficult to parse. (This is as it should be, authoring should be as easy as possible as long as ambiguity can be avoided.)

Here's a rather pointless pipeline that will illustrate some of the steps that an implementation (or at least my implementation) has to go through in order to transform a pipeline document into an actual pipeline.

<p:pipeline name="pipeline" xmlns:p="http://www.w3.org/2007/03/xproc">
<p:input port="source"/>
<p:output port="result"/>

<p:identity name="identity"/>

<p:identity name="joiner">
  <p:input port="source">
    <p:document href="example.xml"/>
    <p:inline>
      <div xmlns="http://www.w3.org/1999/xhtml">
	<p>This is a test.</p>
      </div>
    </p:inline>
    <p:pipe step="identity" port="result"/>
    <p:pipe step="pipeline" port="source"/>
  </p:input>
</p:identity>

<p:count name="counter"/>

</p:pipeline>

First, let's make sure we understand what this pipeline does:

  1. It's a pipeline that accepts a single input from the outside world and produces a single output.

  2. The first step is an identity step. It just copies the input it receives to the output. In this case, there isn't an explicit input, so the default input will be copied. This is the first step in the pipeline, so the pipeline input is the default input.

  3. The second step is also an identity step. This step has four explicit inputs: a document at the relative URI example.xml, an inline document that consists of a single XHTML div element, a pipe that reads the output of the step named “identity”, and a pipe that reads the input to the pipeline. It's an identity step, so it'll just blindly copy that sequence.

  4. The third step is a counter step. This step will read all the documents it receives and output a single document that contains a count of the number of documents it read. There isn't an explicit input, so the default input will be used. This step has a preceding sibling, so the output of the preceding sibling is the default input.

In other words, the pipeline input goes through an identity step, then gets combined with three other documents in a second identity step, and finally goes through a count step to produce the number “4”. I said it was pointless.

But what does the pipeline processor do in order to instantiate that pipeline?. In order to explore that, we're going to augment the pipeline document in ways that can't be expressed in the XProc language, so let's start by switching to another notation:

pipeline pipeline
  input source
  output result

  identity identity

  identity joiner
    input source
      URI binding to http://norman.walsh.name/2007/05/02/examples/example.xml
      inline binding
      pipe name binding to result on identity
      pipe name binding to source on pipeline

  count counter

Here, indentation shows nesting and we can read that as: a pipeline step named “pipeline” that has an input named “source”, an output named “result”, and a subpipeline that consists of an identity step named “identity” followed by an identity step named “joiner” followed by a count step named “counter”. The identity step named “joiner” has four inputs, a URI, an inline, and two pipes.

This notation isn't intended to be machine processable (though a compact syntax for XProc is something I think about from time to time). It's just a debugging aid doing double duty here for explaining what the processor does.

Eventually, we're going to be joining steps by making connections from inputs to outputs. In order to do that, we're going to have to figure out where some of the unspecified (defaulted) connections go. The first step is to make sure we know where all the explicit connections are; that means adding the “external” bindings:

pipeline pipeline
  input source
    URI binding to samples/defaults.xml
  output result
    Binding to stdio

  identity identity

  identity joiner
    input source
      URI binding to http://norman.walsh.name/2007/05/02/examples/example.xml
      inline binding
      pipe name binding to result on identity
      pipe name binding to source on pipeline

  count counter

I passed “samples/defaults.xml” as the binding for source and didn't specify a binding for result so it defaulted to the console.

Now we know that we have bindings for all the pipeline inputs and all the pipeline outputs. A pipeline can't run if it has inputs that aren't connected to anything, but now we know that isn't the case here.

Next, we check the specified steps against their declarations. If a pipeline contains a step we've never heard of, or if a step uses a port that isn't declared, that's an error. Assuming there are no errors, we can use the declarations to make all the inputs and outputs explicit:


pipeline pipeline
  input source
    URI binding to samples/defaults.xml
  output result
    Binding to stdio

  identity identity
    input source
    output result

  identity joiner
    input source
      URI binding to http://norman.walsh.name/2007/05/02/examples/example.xml
      inline binding
      pipe name binding to result on identity
      pipe name binding to source on pipeline
    output result

  count counter
    input source
    output result

Ok, there were no errors and now we know the names of all the ports on all the steps. (In this case, all the steps have just one input and one output, but in the general case there may be many.)

In order to understand the next phase, we have to stop for a moment and talk about inputs and outputs. Imagine that you're holding an identity step (or the real world analog of one, anyway) in your hands. It's a closed box that has an input port (named “source”) on the top and an output port (named “result”) on the bottom. You pour water into the box through the source port and it flows out onto your feet through the result port.

From this perspective, an input is a “sink” into which you can pour events. An output is a “source” from which events flow.

Now imagine that you're standing inside a box which represents a pipeline. Above you, in the ceiling, is an input port (named “source”). In the floor between your feet is an output port (named “result”).

From this perspective, an input is a “source” from which events flow and an output is a “sink” into which you can pour them.

In other words, from the perspective of the steps inside a pipeline, inputs are really outputs and outputs are really inputs. It's a little confusing from the implementation perspective, but I think it would be a whole lot more confusing if we labelled them “correctly” and tried to explain to users that pipeline inputs come through p:output elements.

The point of all that explanation was to make it clear why the next thing we do is add new ports with magic names to each compound step. In this case, the pipeline is the only compound step.


pipeline pipeline
  input source
    URI binding to samples/defaults.xml
  input |result
  output result
    Binding to stdio
  output source|

  identity identity
    input source
    output result

  identity joiner
    input source
      URI binding to http://norman.walsh.name/2007/05/02/examples/example.xml
      inline binding
      pipe name binding to result on identity
      pipe name binding to source on pipeline
    output result

  count counter
    input source
    output result

Now we can hook up the pipeline input port named “source” as a sink which will slurp up data from the outside world (from a URI in this case) and we can hook up steps that want to read from the source to the output named “source|”. In the same way, we can hook up the pipeline output port named “result” as a source which will blast data back to the real world (to STDIO in this case) and we can hook up the output of the step that actually produces that result to the input named “|result”. This allows us to uniformly connect sources to sinks. One of the jobs of the pipeline step, when we eventually get around to running it, will be to provide those transitions.

With all the inputs and outputs in hand, and with the explicit bindings from the pipeline document in place, it's time to go through and make all the remaining connections explicit, validating them as we go. The result is a set of fully explicit connections:

pipeline pipeline
  input source
    URI binding to samples/defaults.xml
  input |result
    pipe name binding to result on counter
  output result
    Binding to stdio
  output source|

  identity identity
    input source
      pipe name binding to source on pipeline
    output result

  identity joiner
    input source
      URI binding to http://norman.walsh.name/2007/05/02/examples/example.xml
      inline binding
      pipe name binding to result on identity
      pipe name binding to source on pipeline
    output result

  count counter
    input source
      pipe name binding to result on joiner
    output result

So far, we've only identified the pipes on one end in most cases. We'll fix that in a moment. But first, look closely at the bindings on the identity and joiner steps. Both of them connect to the same port.

Logically of course, that's not a problem. But in my implementation pipes are straight tubes with no logic. I work around this problem by introducing a new step to the pipeline, a custom step that copies its input to an arbitrary number of outputs:

pipeline pipeline
  input source
    URI binding to samples/defaults.xml
  input |result
    pipe name binding to result on counter
  output result
    Binding to stdio
  output source|

  identity identity
    input source
      pipe name binding to S1 on #ANON.6.30.1
    output result

  identity joiner
    input source
      URI binding to http://norman.walsh.name/2007/05/02/examples/example.xml
      inline binding
      pipe name binding to result on identity
      pipe name binding to S0 on #ANON.6.30.1
    output result

  count counter
    input source
      pipe name binding to result on joiner
    output result

  Step #ANON.6.30.1 ({http://xproc.org/2007/03/xproc/ex}split)
    input source
      pipe name binding to source on pipeline
    output S0
    output S1

So far so good. Next, We need to make sure that there are no loops in the graph and we need to put the steps into an execution order. The counter step consumes the output from the joiner step so it doesn't make sense to run the counter step before the joiner step. (In a multi-threaded implementation, you might be able to start several at once, but we'll come back to that another time.)

Luckily, finding loops falls naturally out of a search for an execution order.

This is also a good time to replace the “pipe name” bindings with real pipes and connect both the input and the output ends. (I'll write another essay about all the different flavors of bindings.)

With all these machinations, the fully decorated pipeline looks like this:

pipeline pipeline
  input source
    URI binding to samples/defaults.xml
  input |result
    pipe binding [pipe #0] from [output result on counter] to [input |result on pipeline]
  output result
    Binding to stdio
  output source|
    pipe binding [pipe #1] from [output source| on pipeline] to [input source on #ANON.6.30.1]

  Step #ANON.6.30.1 ({http://xproc.org/2007/03/xproc/ex}split)
    input source
      pipe binding [pipe #1] from [output source| on pipeline] to [input source on #ANON.6.30.1]
    output S0
      pipe binding [pipe #4] from [output S0 on #ANON.6.30.1] to [input source on joiner]
    output S1
      pipe binding [pipe #2] from [output S1 on #ANON.6.30.1] to [input source on identity]

  identity identity
    input source
      pipe binding [pipe #2] from [output S1 on #ANON.6.30.1] to [input source on identity]
    output result
      pipe binding [pipe #3] from [output result on identity] to [input source on joiner]

  identity joiner
    input source
      URI binding to http://norman.walsh.name/2007/05/02/examples/example.xml
      inline binding
      pipe binding [pipe #3] from [output result on identity] to [input source on joiner]
      pipe binding [pipe #4] from [output S0 on #ANON.6.30.1] to [input source on joiner]
    output result
      pipe binding [pipe #5] from [output result on joiner] to [input source on counter]

  count counter
    input source
      pipe binding [pipe #5] from [output result on joiner] to [input source on counter]
    output result
      pipe binding [pipe #0] from [output result on counter] to [input |result on pipeline]

Believe it or not, that's something I can run. Well, it's something from which I can build a set of components that will run. But that's a story for another time.

Comments

In the first example, the callout numbers have gone slightly awry.

—Posted by Ed Davies on 03 May 2007 @ 09:27 UTC #

Oops. Fixed. Thanks, Ed.

—Posted by Norman Walsh on 03 May 2007 @ 01:16 UTC #

So, short of implementing an ESB or something of that nature, what uses do you see for XProc?

I've been reading some of these entries with a detached interest. Interest, because it's novel, and I can see how the pieces will work in a non-abstract way. Detached, because none of the problems I face day-to-day seem well-suited for XProc.

Accordingly, I'm wondering if I'm missing something, if there are avenues I haven't considered where XProc will be very valuable in the sorts of work I do.

So, back to the question: what uses do you imagine for XProc?

—Posted by Geoffrey Wiseman on 07 May 2007 @ 06:25 UTC #

Re: Geoffrey Wiseman

"So, short of implementing an ESB or something of that nature, what uses do you see for XProc?"

I am quite interested it XProc personally because my job is intimately related to series of XML transformations with custom logic (currently in Java, mostly http calls to web services) in between. Our platform is currently a very expensive SOA/ESB product that we've never been quite happy with.

Norm, when do expect that you'll make the source available? I've been waiting to get my hands on an implementation. I looked at yax (yax.sf.net) but I need to get my hands on the source to really evaluate it, and so far yax has only release binaries (and not since Feb).

Do you expect that you'll release the source to your implementation sometime in the near future?

—Posted by Chris Scott on 07 May 2007 @ 08:34 UTC #

Geoffrey, does the about pipelines link near the top of this essay help at all?

—Posted by Norman Walsh on 07 May 2007 @ 08:54 UTC #

Yes, Chris, "real soon now" :-)

I'm pushing it through the appropriate internal channels as fast as I can.

—Posted by Norman Walsh on 07 May 2007 @ 08:55 UTC #

Great. Just nice to know it's coming "soon" ;)

Anyway, I really think it will really drive interest in the standard to have a decent (and active) implementation.

—Posted by Chris Scott on 08 May 2007 @ 12:49 UTC #