Implementing XProc, IV

Volume 10, Issue 50; 16 May 2007; last modified 08 Oct 2010

Part the fourth, in which we consider more buffering.

This essay is part of a series of essays about implementing an XProc processor. XProc: An XML Pipeline Language is a W3C specification for specifying a sequence of operations to be performed on one or more XML documents. I'm implementing XProc as the specification progresses. Elsewhere you'll find background about pipelines and other essays about XProc.

As we saw, looping requires the ability to buffer and replay some inputs. We also have to buffer in order to implement p:try/p:catch.

To the largest extent possible, steps should support streaming by producing results as soon as they can. For example, while an XSLT step has to build an entire tree because that's the nature of XSLT processing, a step that simply deletes an attribute can, in principle, be completely streaming. It can produce results as fast as it consumes them, never holding more than a single event in memory.

But the p:try component has to be more conservative. While it can allow all of its subpipeline components stream, it has to buffer all of its output until the steps in its subpipeline have run sucessfully to completion.

If any of those steps fail, it discards all of the buffered output and processes the p:catch. The output from p:catch doesn't have to be buffered because a failure there causes the whole step to fail. (Of course, if the p:try/p:catch is nested inside another p:try, then there will be buffering at that level.)

Buffering all the output is a very conservative approach. It effectively blocks streaming across a p:try. In other words, we force every p:try to pay the penalty for failure, even when it succeeds!

A more optimistic approach would be to generate some sort of checkpoint event and initiate roll back behavior if the p:try failed. Even though this could be hugely more expensive than buffering, you'd only have to pay the price when the p:try actually failed.

At least, that's the theory. In practice, some steps would still have to block. For example, there's no way to “roll back” an p:http-request in the general case. I'm also not motivated to design and implement the infrastructure that would be required to support checkpoints and roll back. Not today, anyway.

Repeat after me: “the simplest thing that can possibily work.”

Comments

“the simplest thing that can possibily (sic) work.”

Yeah, I agree, it's best to keep everything feature oriented for the foreseeable future since it's, in general, more important for reference implementations to be conformant rather than fast.

Not worrying too much about performance would also allow you to get a release out sooner rather than later ;)

We are implementing a .NET version XProc engine. Our way is very simple, all inputs and outputs are XmlDocument, that is, they are DOM based, of course DOM is easy to be cached. For a sequence of input documents, we create a temp XmlDocument with a special root element, then we can distinct a single document from a sequence of document. It makes our work extreme easy. My first version of XProc engine only takes one week, with processors like: identity, xinclude, xstl, xquery, load, store, choose, foreach etc. I find XProc very powerful. Our workflow engine is in fact built on it.

Passing around DOM trees definitely simplifies some things. Does your .NET implementation have a home page?