XProc: An XML Pipeline Language

Volume 9, Issue 91; 28 Sep 2006; last modified 08 Oct 2010

The XML Processing Model Working Group has published the First Public Working Draft of the pipeline language document.

Progress isn't made by early risers. It's made by lazy men trying to find easier ways to do something.

—Robert Heinlein

I'm delighted to report that the XML Processing Model Working Group has published the First Public Working Draft of the XProc specification: XProc: An XML Pipeline Language, our processing model language specification.

As you read it, bear in mind that, as a first working draft, it's not without its rough edges and unresolved issues. Nevertheless, I think it charts the working group's direction pretty clearly and covers a good chunk of the distance from requirements to Recommendation. I think we're on schedule. (Ok, it's a revised schedule, but we're peddling as fast as we can.) I'm aware of two, possibly three, implementations tracking the spec pretty closely, so I'm expecting that we'll have some implementation experience to guide us the rest of the way.

Please tell us what you think.

For those of you still wondering what this is all about and why it makes sense to spend time on a standard processing model specification, consider the following simple pipeline:


<p:pipeline xmlns:p="http://www.w3.org/2006/09/xproc"
	    name="pipeline">

<p:declare-input port="document"/>
<p:declare-input port="schema"/>
<p:declare-input port="stylesheet"/>
<p:declare-output port="result" step="transform" source="result"/>

<p:step name="xinclude" type="xinclude">
  <p:input port="document" step="pipeline" source="document"/>
</p:step>

<p:step name="validate" type="validate">
  <p:input port="document" step="xinclude" source="result"/>
  <p:input port="schema" step="pipeline" source="schema"/>
</p:step>

<p:step name="transform" type="xslt">
  <p:input port="document" step="validate" source="result"/>
  <p:input port="stylesheet" step="pipeline" source="stylesheet"/>
</p:step>

</p:pipeline>

Can you tell me what that pipeline does? I bet you can, without even reading the specification: it performs XInclude processing on a document, validates it against a schema, transforms it with XSLT, and returns the result.

Although I think this pipeline is relatively clear, it is arguably a little bit verbose. I think our final language will have some defaulting that will simplify many pipelines, including this one. But, with my chair's hat on, I've asked the working group to postpone all discussion of abbreviation and defaulting until after we have an unabbreviated syntax that describes a whole language on which we have consensus. (I'm of a mind to fiddle with a compact syntax too at that point.)

That explains what the pipeline is, but doesn't really address why it's a good thing. To understand that, let's look at an alternative. Suppose you're a Java coder and you're asked to implement the same pipeline using, for example, JAXP. You're likely to come up with something like this:

DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setXIncludeAware(true);
DocumentBuilder builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(xmlURI);

SAXSource xsdSource = new SAXSource(new InputSource(xsdURI));
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(xsdSource);
Validator validator = schema.newValidator();

validator.validate(new DOMSource(doc));

SAXSource xslSource = new SAXSource(new InputSource(xslURI));
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer(xslSource);

StreamResult resultStream = new StreamResult(resultURI);
transformer.transform(new DOMSource(doc), resultStream);

Could you have as quickly and as easily told me what that code does? Just on the surface of it, I think the pipeline example wins the clarity competition hands down. And just in case you're about to observe that this isn't an apples-to-apples comparison, let's look at what a Java implementation that uses the pipeline might look like:

Pipeline pipeline = PipelineFactory.newInstance().newPipeline();
pipeline.load(pipelineURI);
pipeline.input("document", xmlURI);
pipeline.input("schema", xsdURI);
pipeline.input("stylesheet", xslURI);
pipeline.output("result", resultURI);
pipeline.run();

In fairness, and in the spirit of full disclosure, the code fragment that demonstrates pipeline use is hypothetical. I expect my implementation to look something like that, and maybe one day for a Java standard API to look something like that, but I don't actually have running code that looks like that today.

I think from an ease-of-use and programmer productivity point of view, pipelines are an obvious win: they're more declarative, they allow application behavior to be modified (within limits) without touching a line of codeChanging the pipeline so that it performs validation before XInclude is straightforward. Changing the Java code in the same way is, uhm, left as an exercise for the reader., and they potentially have much better performance.

That last point is probably worth a little exploration. There are at least two ways in which using pipelines can lead to improved performance. One is a generality, if you're coding up your processing directly in Java (or C or Ruby or whatever), then optimization is your problem. If lots of folks are using pipelines then it makes sense to invest resources in improving the performance of the code that implements them. Making that code perform better automatically helps you (and everyone using pipelines).Yes, I'm aware that this argument can be run in reverse, that there are situations where performance is so critical that hand coding the fastest possible solution is the right thing to do. That's why there's still assembler code in some applications. But those applications are comparitively rare compared to the legions of programmers processing XML for a living.

A less immediately obvious benefit arises from the fact that XProc has a fairly large vocabulary of built in steps. They aren't all spelled out in the current draft, but will eventually include steps to add, rename, and delete elements and attributes; process regions of a document based on XPath expressions, combine documents, split documents, extract content, inject content, etc.

Programmers today often implement these operations using XSLT (I know I do). But in the general case, XSLT doesn't stream very well and requires building an entire in-memory representation of the document to be processed. The XProc operations, in contrast, are much simpler and many of them can (and will) be implemented in components that operate in a streaming fashion, neither slowing down the pipeline nor requiring enough memory to load the entire document.

Of course the important word way back up there in the paragraph before the pipeline example was “standard”. None of this work is a real win for users unless they can reasonably expect interoperable implementations. I want (desperately sometimes) to be able to distribute pipelines with the same ease that I now distribute XSLT stylesheets. By the same token, this work isn't a real win for programmers until they can reasonably expect some basic, interoperable APIs. But those can come a little later, and they're most emphatically not part of the charter of our W3C Working Group.

Comments

This does indeed look quite clear. The only *slight* problem I had was with the concept of "implicit output ports". I couldn't find anything in the draft about these - but as you say, it is kinda obvious :-)

ndw said

OK it's a pipeline step. I can see an input, I can guess what it does. I can't see an output?

then

OK a step with two inputs... yet the sourcs is 'result'? And still no output?

Then

Can you tell me what that pipeline does? I bet you can, without even reading the specification:

No, but I feel confused seeing a source called result? And no I haven't read the spec (yet :-), but having just written a 5 stage transform I think I'm going to!

Please don't remove verbosity too much Norm. It helps the less able.

regards DaveP

In working with DocBook, I've often thought that some kind of declarative approach for specifying the pipeline, etc. was needed.

In specifying the rules for building a (large) set of output documents from a (large) set of input doucments, one problem I have wanted to solve is avoiding redundant processing when input documents have not changed.

XInclude and XSLT make this difficult to do in an Ant or Maven build file since there is no tool (that I know of) that can track whether any of the included files have changed. Currently, it seems, you can either do the entire transform each time or only run it if the "root" XML or XSLT document has changed.

I realize this would be an implementation detail/optimization but it might be worthwhile considering this possibility while working on the spec.

I remember the performance breakthrough when C language tools solved this problem and would like to see the same thing for XML/XSLT.

I'm wondering what is different between this "new" pipeline language and the one defined previously here: http://www.w3.org/TR/xml-pipeline/ also by the W3C* but dated from 28 February 2002. Anyway it's great that there is work on this subject again, 'cause i'm actually digging into http://sxpipe.dev.java.net/ and this doesn't implement all things described in the schemas, although this is a good starting point.

* btw the schemas provided here seems to be wrong regarding the use of 2 ids attributes in the processdef element, but i guess this as already been reported in the right place.

Hi, thanks for the article - but I am having troubling getting to grips with the XProc pipeline.

Could someone please go through the above XProc document and explain the details, explicitly pointing out the 'implicit output ports' ;) Kind regards

How is this whole thing different than building a few Spring POJOs and wiring them together? I haven't read the most recent draft, but the original example here sure looks like it could be implemented in 35 minutes with the Spring IoC.

You could, no doubt, implement any particular pipeline with (insert application/platform specific framework of your choice) in reasonably short order.

But XProc is language and platform neutral and reasonably declarative.

Lots and lots of folks who write XSLT stylesheets wouldn't be comfortable writing Java code to do the same translations. I expect that XProc will provide that level of interoperability and accessibility for pipelining.