Pipeline graph analysis

Volume 13, Issue 45; 11 Nov 2010

Building a better XProc processor will depend, I think, on the ability to analyze the data flow graph of the pipeline. As a happy side-effect, it seems likely that this will produce nice pipeline diagrams.

Two of my goals for an XML Calabash V.next are streaming (or at least better memory utilization) and using multiple threads. Alex Milowski and I talked about this ages ago and again last week. Alex contends, and I think he's right, that the way to achieve these goals is to build the pipeline data flow graph and analyze that. The outcome of that analysis will include a possibly modified graph, an execution plan, and a set of streamable segments, among other things.

I've got a much nicer parser and object model for pipelines now, so I started working on the graph analysis. Step 1 is to build the graph, but step 0 is to find some way to output the graph such that I can tell if I got it right. Graphviz to the rescue, and not for the first time. It's straightforward to output a DOT language representation of the graph at which point the magic of Graphviz takes over.

Here's a toy pipeline that I used as my initial test case:

<p:declare-step xmlns:t="http://xproc.org/ns/testsuite"
                xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:err="http://www.w3.org/ns/xproc-error"
                xmlns:test="http://xproc.org/ns/testsuite/test"
                version="1.0" name="main">
  <p:input port="source"/>
  <p:output port="result">
    <p:pipe step="count" port="result"/>
  </p:output>

  <p:identity name="style-source">
    <p:input port="source">
      <p:inline>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

<xsl:output method="xml" encoding="utf-8" indent="no"
	    omit-xml-declaration="yes"/>

<xsl:preserve-space elements="*"/>

<xsl:template match="/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="*">
  <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

<xsl:template match="comment()|processing-instruction()|text()">
  <xsl:copy/>
</xsl:template>

</xsl:stylesheet>
      </p:inline>
    </p:input>
  </p:identity>

  <p:for-each name="chaps">
    <p:iteration-source select="/book/chap">
      <p:pipe step="main" port="source"/>
    </p:iteration-source>
    <p:output port="result"/>

    <p:xslt name="style">
      <p:input port="stylesheet">
        <p:pipe step="style-source" port="result"/>
      </p:input>
    </p:xslt>
  </p:for-each>

  <p:wrap-sequence>
    <p:with-option name="wrapper" select="local-name(/*)">
      <p:pipe step="main" port="source"/>
    </p:with-option>
    <p:input port="source">
      <p:pipe step="style-source" port="result"/>
      <p:pipe step="chaps" port="result"/>
    </p:input>
  </p:wrap-sequence>

  <p:count name="count"/>

</p:declare-step>

It's a totally stupid pipeline. It ostensibly styles the “chapters” of an input document (using an inlined copy of the identity transform) then wraps them, plus the stylesheet, in a sequence and counts the resulting documents.

Here's what the data flow looks like so far:

A few observations:

I'm representing the beginning and end of compound steps as two distinct nodes in the graph. This is consistent, I think, with the dichotimous nature of inputs and outputs on compound steps.
I'm not sure if the box around the “for each” steps is important or not from a graph analysis point of view. It's not really there in the graph, it's just for rendering purposes.
When two different steps read from the same output port, that's going to require a change to the graph. I think it'll be easier to stick in a special splitter step or steps than require every step be able to buffer its output.
When a step in a loop reads the output of a step outside the loop, that's going to require a special buffering step.
How much optimization I can do is an interesting question. For example, the output of a p:wrap-sequence is always a single document, irrespective of the input. That means the output of this pipeline is always <c:result>1<c:result>, so can I just produce that? I don't think so, but I'm not sure all the cases will be obvious.

There's still lots to do, but it's a start.