Implementing XProc, IV
Part the fourth, in which we consider more buffering.
This essay is part of a series of essays about implementing an XProc processor. XProc: An XML Pipeline Language is a W3C specification for specifying a sequence of operations to be performed on one or more XML documents. I'm implementing XProc as the specification progresses. Elsewhere you'll find background about pipelines and other essays about XProc.
As we saw, looping requires
the ability to buffer and replay some inputs. We also have to buffer in order
to implement p:try
/p:catch
.
To the largest extent possible, steps should support streaming by producing results as soon as they can. For example, while an XSLT step has to build an entire tree because that's the nature of XSLT processing, a step that simply deletes an attribute can, in principle, be completely streaming. It can produce results as fast as it consumes them, never holding more than a single event in memory.
But the p:try
component has to be more conservative. While
it can allow all of its subpipeline components stream, it has to buffer
all of its output until the steps in its subpipeline
have run sucessfully to completion.
If any of those steps fail, it discards all of the buffered output and
processes the p:catch
. The output from p:catch
doesn't
have to be buffered because a failure there causes the whole step to fail.
(Of course, if the p:try
/p:catch
is nested inside
another p:try
, then there will be buffering at that level.)
Buffering all the output is a very conservative approach. It
effectively blocks streaming across a p:try
. In other words,
we force every p:try
to pay the penalty for failure, even when
it succeeds!
A more optimistic approach would be to generate some sort of
checkpoint event and initiate roll back behavior if the
p:try
failed. Even though this could be hugely more
expensive than buffering, you'd only have to pay the price when the
p:try
actually failed.
At least, that's the theory. In practice, some steps would still
have to block. For example, there's no way to “roll back” an
p:http-request
in the general case. I'm also not motivated
to design and implement the infrastructure that would be required to support
checkpoints and roll back. Not today, anyway.
Repeat after me: “the simplest thing that can possibily work.”
Comments
“the simplest thing that can possibily (sic) work.”
Yeah, I agree, it's best to keep everything feature oriented for the foreseeable future since it's, in general, more important for reference implementations to be conformant rather than fast.
Not worrying too much about performance would also allow you to get a release out sooner rather than later ;)
We are implementing a .NET version XProc engine. Our way is very simple, all inputs and outputs are XmlDocument, that is, they are DOM based, of course DOM is easy to be cached. For a sequence of input documents, we create a temp XmlDocument with a special root element, then we can distinct a single document from a sequence of document. It makes our work extreme easy. My first version of XProc engine only takes one week, with processors like: identity, xinclude, xstl, xquery, load, store, choose, foreach etc. I find XProc very powerful. Our workflow engine is in fact built on it.
Passing around DOM trees definitely simplifies some things. Does your .NET implementation have a home page?