Implementing XProc, I

Volume 10, Issue 38; 25 Apr 2007; last modified 08 Oct 2010

Part the first, in which we consider the heart of the problem.

This essay is part of a series of essays about implementing an XProc processor. XProc: An XML Pipeline Language is a W3C specification for specifying a sequence of operations to be performed on one or more XML documents. I'm implementing XProc as the specification progresses. Elsewhere you'll find background about pipelines and other essays about XProc.

I hope that my implementation evolves to be complete and robust; I also hope that it achieves respectable performance, but those are not the most important immediate goals. The most important immediate goal is to produce a conformant implementation of the whole spec. I'll cross the other bridges when I get to them. Presented with a decision about how something should be implemented, I have without reservation selected the answer that seemed easiest.

With that preamble out of the way, let's start in the middle.

At the end of the day, the fundamental operation that an XML pipeline processor performs is that it passes the output of one process to the input of another. Consider a simple, two step pipeline that expands XIncludes and then runs XSLT. At a high level, the processor:

Starts with an XML document (where that initial document comes from is an orthogonal issue).
Passes that XML document to an XInclude step.
The XInclude step does some work and produces, as its output, a new XML document.
The processor takes that new document and a stylesheet document and passes them both to an XSLT step.
The XSLT step does some work and produces, as its output, a new XML document.
That document is the result of the pipeline (and for the moment, like the initial document, what the processor does with the final result is an orthogonal issue.)

The first question to ask then is, how are we going to pass documents from one step to the next?

There are lots of possibilities: documents could be passed as serialized octet streams, of course, or more efficiently as DOMs or object models of some sort. The steps could be wired together as SAX or StAX filters. StAX events could be passed between them. There are probably other choices too.

In this particular case, I know a little bit about what lies down the road. I know that some steps will have to accept multiple inputs and I know that some output streams will have to be “split” so that multiple steps can use them. I also know that while some components require whole documents, many can operate on streams, never needing the entire document at once.

With those things in mind, I chose to implement the connections between steps using StAX “XMLEvent” objects. This approach has the additional feature that it fits perfectly into the “water flowing through pipes” analogy that's sometimes used to describe pipelines.

A pipeline is a sequence (or directed, acyclic graph at any rate) of steps. The steps are connected by pipes. Just as water flows through the pipes in your home, XMLEvent objects flow through the pipes in my XProc pipelines.

Pipes naturally have a readable end, a faucet you can draw from, and a writable end, a drain into which you can pour things. From inside a step, you can only see the ends of the pipe, sources and sinks, readable pipes and writable pipes. The pipeline processor can see the whole pipe. It looks something like this:

public class Pipe implements ReadablePipe, WritablePipe {
    public XMLEventWriter getWriter () { … }
    public XMLEventReader getReader() { … }
    …
}

(There's more to it, of course, but we'll come back to look at other aspects of pipes later. In particular, we're going to have to deal with sequences of documents.)

The step holding the writable end of the pipe can get the XMLEventWriter and pour events into it. The step holding the readable end of the pipe can get the XMLEventREader and read events from it.

Like a real pipe, events poured in one end don't instantaneously get drawn out on the other. And just because you open the faucet, that doesn't mean there's water ready to flow through the pipe. So the implementation of pipes has to handle some capacity and must be prepared to block the reader while waiting for the writer.

At the moment, the pipe between the two ends is a simple Vector. This will require some synchronization when I enable threading, but for the moment, it's sufficient.

Comments

You should just be using java.util.concurrent.ArrayBlockingQueue<XMLEvent> right away, and then there's nothing to change.

Indeed, John. Luckily, changing the internals of the Pipe will be painless and invisible to the users of the ReadablePipe and Writable pipe ends.