SXPipe: Simple XML Pipelines

Volume 7, Issue 103; 20 Jun 2004; last modified 08 Oct 2010

SXPipe is a language for building Simple XML Pipelines and a Java toolkit that implements it. This is hardly a new idea; a quick web search will turn up a number of similar projects. I’ve written elsewhere about why I did it and why I think pipelines are important. This essay just describes SXPipe.

SXPipe loads a document, subjects it to a number of processing stages, and (usually) writes out the result. Along the way, stages may load additional documents, but the essential model is that a pipeline functions as a simple linear sequence of operations over an Infoset. (Pragmatically, the Infoset is modelled with a Document object from the W3C Document Object Model.)

A few words about what SXPipe isn’t:

SXPipe isn’t implemented as a series of SAX Filters, instead the stages of the pipeline operate by passing Infosets along.

I don’t think there’s anything intrinsicly better (or worse) about this strategy than using SAX Filters, but it feels a little different to me and it makes the stages very simple.

There are probably good arguments in favor of the SAX approach; certainly a good, streaming pipeline implementation will be able to begin producing output faster and might require a smaller memory footprint, but neither of those things is particularly important to me.
SXPipe isn’t part of a larger framework. It runs from the command line and stands by itself: no web servers, no servlets, no containers, no content management infrastructure. It’s just a pipeline.

There’s nothing to stop you embedding it in another application, but that’s not how it works now.
SXPipe doesn’t have a complex expression language. It has one very primitive conditionality feature (maybe one too many). You can’t write loops, or track dependencies, or directly instantiate complex nested transformations. It’s just a pipeline.

What’s it good for? It’s good for reasonably straightforward pipelines like this one:

<pipeline>
<stage process="XInclude"/>
<stage process="Transform" stylesheet="profile.xsl"/>
<stage process="Validate" schema="schema.rng"/>
<stage process="Transform" stylesheet="doc.xsl"/>
</pipeline>

It is explicitly a lot simpler than shell scripts, make files, or ant build scripts. Running it requires nothing more complex than the jar file that contains the classes:

java Pipeline pipe.xml < input.xml > output.xml

Where pipe.xml contains your pipeline file, like the one shown above, and input.xml and output.xml are your input and output, respectively.

The Language

The language consists of four elements: pipeline, param, stage, and choose. The pipeline element is just the document element, param lets you set some simple parameters, and stage and choose do all the actual work.

The one conditionality feature is that each stage has an optional skip attribute. If skip is “yes”, then the stage is ignored. The choose element lets you make sure that exactly one of a list of stages is executed: the first one that isn’t skipped.

Here’s a slightly more complicated example:

<pipeline>
  <param name="draft" value="no"/>

  <stage skip="${draft}" process="XInclude"/>
  <choose>
    <stage skip="${draft}"
           process="Transform"
           stylesheet="profile.xsl"/>
    <stage process="Transform"
           stylesheet="strip.xsl"/>
  </choose>
  <stage skip="${draft}" process="Validate" schema="schema.rng"/>
  <stage process="Transform" stylesheet="doc.xsl"/>
</pipeline>

If the draft parameter is “no”, this pipeline will perform XInclude, then Transform with the profile.xsl stylesheet (which fulfills the choose), then Validate, then Transform with the doc.xsl stylesheet.

If the draft parameter is “yes”, which could be specified on the command line, XInclude will be skipped and so will the profiling, but the Transform with strip.xsl will be performed this time, then the Transform with the doc.xsl stylesheet (because the Validate will also be skipped).

There’s a little more detail in the JavaDocs, but clearly I should write a real spec. (Yeah, the irony is plain to me, thanks for asking.)

The Implementation

I’ve coded up an implementation in Java. I’m still in the process of setting up a home for it, so I don’t have pointers to the sources yet. I expect that will resolve itself fairly quickly.

The implementation is built on top of Java 1.5.0 because (a) 1.5 is really cool, (b) 1.5 includes JAXP 1.3 out of the box, and (c) well, it’s good for my career to be testing the latest releases, right :-).

In practice, I haven’t started using any of the cool new Java 1.5 features like generics, metadata, typesafe enumerations, and autoboxing. But I don’t promise not to, at least not after Java 1.5.0 has officially shipped. Until then, it should run under Java 1.3 or 1.4. You will need JAXP 1.3 though.

Out of the box, SXPipe implements six stages: reading, writing, XInclude processing, XSLT transformation, validation, and a no-op identity stage. I’ll probably code up an XSL FO processor stage at some point, and of course, you can write your own.

The PipelineStage interface is nothing more than:


public interface PipelineStage {
  /**
   * <p>Initializes the pipeline stage.</p>
   *
   * @param config The PipelineConfiguration used by this
   * pipeline.
   * @param stage The <code>stage</code> element that is
   * being processed.
   * @throws PipelineException If there is something wrong.
   */
  public void init(PipelineConfiguration config,
                   Element stage) throws PipelineException;

  /**
   * <p>Run the stage.</p>
   *
   * @param input The input DOM.
   * @throws PipelineException If there is something wrong.
   * For example, if the attempt to load a schema needed for
   * validation failed.
   * @throws StageFailedException If the stage executed
   * properly but was unsuccessful. For example, if the
   * stage was able to validate the document but the
   * document was not valid.
   * @return The output DOM.
   */
  public Document run(Document input)
     throws PipelineException, StageFailedException;
}

Finally, SXPipe is the result of a few late nights of coding in anger. If it’s never good for anything else, it was good for my soul.

Comments

Cocoon user from a long time, I'm completely agree with the pipe approach. I discover that with them, so that I thought it was their idea. I never notice [http://www.w3.org/TR/xml-pipeline/] before. So, it's Norman Walsh behind pipelines ?

You said : "I don’t have pointers to the sources yet. I expect that will resolve itself fairly quickly". Is it now resolved ? I would be glad to try my "sitemaps" under a lighter logic.

This is an interesting approach because of its simplicity, I don't think that a first pass at this problem should be much more complex than that (but a few little things may be added).

If you keep the <choose> feature, you might want to make it closer to SVG's <switch>. The basic difference is that instead of having a single 'skip' attributes, there are several test attributes that are a little bit more powerful (the latest draft adds to that with for instance the ability to test for support for various mime types, namespaces, etc. It's currently W3C member-only so email me if you want the link). But I would call that quite optional.

Another optional feature is the support for flagging a filter in the pipe as "last in chain". Both AxKit and Cocoon support this. It basically is a way of saying that the output from that filter will not be an Infoset, and can only happen at the very end of the transformation. It allows for stricter checking and clearer self-documenting pipelines. AxKit and Cocoon also have Providers, which are first-in-chain filters that are the kind that accepts non-XML input.

Another interesting feature is support for multiple Infoset representations. I'm generally happy with passing DOMs around, but in some cases it's really not what you want. AxKit will normally use DOMs as much as it possibly can (because it's faster) but items in the pipeline can prefer XML or SAX for instance, and they negotiate what they get with the previous filter (if negotiations fail the pipeline manager will convert for them). The big gain is that each filter uses whatever is best for itself. I believe that JAXP should make this trivial to implement.

Anyway, cool stuff! :)