<?xml version='1.0' encoding='utf-8'?>
<?xml-stylesheet href="/style/browser.xsl" type="text/xsl"?>
<essay xmlns="http://docbook.org/ns/docbook"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
       xmlns:dc='http://purl.org/dc/elements/1.1/'
       xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:gal='http://norman.walsh.name/rdf/gallery#'
       xmlns:foaf="http://xmlns.com/foaf/0.1/"
       xml:lang="en"
       version='5.0'>
<info>
<title>XProc: An XML Pipeline Language</title>
<volumenum>9</volumenum>
<issuenum>91</issuenum>
<pubdate>2006-09-28T05:45:09-04:00</pubdate>
<date>$Date: 2007-04-05 09:55:02 -0400 (Thu, 05 Apr 2007) $</date>
<author><personname>
<firstname>Norman</firstname><surname>Walsh</surname>
</personname></author>
<copyright><year>2006</year><holder>Norman Walsh</holder></copyright>
<abstract>
<para>The XML Processing Model Working Group has published the First
Public Working Draft of the pipeline language document.</para>
</abstract>
</info>

<epigraph>
<attribution><personname>
<firstname>Robert</firstname><surname>Heinlein</surname>
</personname></attribution>
<para xml:id='p2'>Progress isn't made by early risers. It's made by lazy men
trying to find easier ways to do something.
</para>
</epigraph>

<para xml:id='p1'>I'm delighted to report that the
<link xlink:href="http://www.w3.org/XML/Processing/">XML Processing Model
Working Group</link> has published the First Public Working Draft
of the
<wikipedia page="XML_pipeline">XProc</wikipedia>
specification:
<link xlink:href="http://www.w3.org/TR/xproc/">XProc: An XML Pipeline
Language</link>, our processing model language specification.</para>

<para xml:id='p3'>As you read it, bear in mind that, as a first working draft, it's
not without its rough edges and unresolved issues. Nevertheless, I
think it charts the working group's direction pretty clearly and
covers a good chunk of the distance from
<link xlink:href="http://www.w3.org/TR/xproc-requirements/">requirements</link>
to Recommendation. I think we're
<link xlink:href="http://www.w3.org/XML/Processing/#schedule">on schedule</link>.
(Ok, it's a <emphasis>revised</emphasis> schedule, but we're peddling
as fast as we can.) I'm aware of two, possibly three, implementations
tracking the spec pretty closely, so I'm expecting that we'll have
some implementation experience to guide us the rest of the way.
</para>

<para xml:id='p4'>Please
<link xlink:href="mailto:public-xml-processing-model-comments@w3.org">tell
us</link> what <emphasis>you</emphasis> think.</para>

<para xml:id='p5'>For those of you still wondering what this is all about and why
it makes sense to spend time on a standard processing model
specification, consider the following simple pipeline:</para>

<programlisting><![CDATA[
<p:pipeline xmlns:p="http://www.w3.org/2006/09/xproc"
	    name="pipeline">

<p:declare-input port="document"/>
<p:declare-input port="schema"/>
<p:declare-input port="stylesheet"/>
<p:declare-output port="result" step="transform" source="result"/>

<p:step name="xinclude" type="xinclude">
  <p:input port="document" step="pipeline" source="document"/>
</p:step>

<p:step name="validate" type="validate">
  <p:input port="document" step="xinclude" source="result"/>
  <p:input port="schema" step="pipeline" source="schema"/>
</p:step>

<p:step name="transform" type="xslt">
  <p:input port="document" step="validate" source="result"/>
  <p:input port="stylesheet" step="pipeline" source="stylesheet"/>
</p:step>

</p:pipeline>]]></programlisting>

<para xml:id='p6'>Can you tell me what that pipeline does? I bet you can, without
even reading the specification: it performs XInclude processing on a
document, validates it against a schema, transforms it with XSLT, and
returns the result.</para>

<para xml:id='p7'>Although I think this pipeline is relatively clear, it is
arguably a little bit verbose. I think our final language will have
some defaulting that will simplify many pipelines, including this one.
But, with my chair's hat on, I've asked the working group to postpone
all discussion of abbreviation and defaulting until after we have an
<emphasis>unabbreviated</emphasis> syntax that describes a whole
language on which we have consensus. (I'm of a mind to fiddle with a
compact syntax too at that point.)</para>

<para xml:id='p8'>That explains what the pipeline is, but doesn't really address why
it's a good thing. To understand that, let's look at an alternative. Suppose
you're a Java coder and you're asked to implement the same pipeline using, for
example, <wikipedia>JAXP</wikipedia>. You're likely to come up with something
like this:</para>

<programlisting>DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setXIncludeAware(true);
DocumentBuilder builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(xmlURI);

SAXSource xsdSource = new SAXSource(new InputSource(xsdURI));
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(xsdSource);
Validator validator = schema.newValidator();

validator.validate(new DOMSource(doc));

SAXSource xslSource = new SAXSource(new InputSource(xslURI));
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer(xslSource);

StreamResult resultStream = new StreamResult(resultURI);
transformer.transform(new DOMSource(doc), resultStream);</programlisting>

<para xml:id='p9'>Could you have as quickly and as easily told me what that code does?
Just on the surface of it, I think the pipeline example wins the clarity
competition hands down. And just in case you're about to observe that this isn't
an apples-to-apples comparison, let's look at what a
<wikipedia page="Java_%28Sun%29">Java</wikipedia> implementation
that <emphasis>uses</emphasis> the pipeline might look like:</para>

<programlisting>Pipeline pipeline = PipelineFactory.newInstance().newPipeline();
pipeline.load(pipelineURI);
pipeline.input("document", xmlURI);
pipeline.input("schema", xsdURI);
pipeline.input("stylesheet", xslURI);
pipeline.output("result", resultURI);
pipeline.run();</programlisting>

<para xml:id='p10'>In fairness, and in the spirit of full disclosure, the code
fragment that demonstrates pipeline use is hypothetical. I expect my
implementation to look something like that, and maybe one day for a
Java standard API to look something like that, but I don't actually
have running code that looks like that today.</para>

<para xml:id='p11'>I think from an ease-of-use and programmer productivity point of
view, pipelines are an obvious win: they're more declarative, they
allow application behavior to be modified (within limits) without
touching a line of code<footnote><para xml:id='p12'>Changing the pipeline so that
it performs validation <emphasis>before</emphasis> XInclude is straightforward.
Changing the Java code in the same way is, uhm, left as an exercise for
the reader.</para></footnote>, and they potentially have much better performance.</para>

<para xml:id='p13'>That last point is probably worth a little exploration. There
are at least two ways in which using pipelines can lead
to improved performance. One is a generality, if you're coding up your
processing directly in Java (or C or Ruby or whatever), then
optimization is your problem. If lots of folks are using pipelines
then it makes sense to invest resources in improving the performance
of the code that implements them. Making that code perform better
automatically helps you (and everyone using pipelines).<footnote>
<para xml:id='p14'>Yes, I'm aware that this argument can be
run in reverse, that there are situations where performance is so
critical that hand coding the fastest possible solution is the right
thing to do. That's why there's still assembler code in some
applications. But those applications are comparitively rare compared
to the legions of programmers processing XML for a
living.</para></footnote>
</para>

<para xml:id='p15'>A less immediately obvious benefit arises from the fact that
XProc has a fairly large vocabulary of built in steps. They aren't
all spelled out in the current draft, but will eventually include
steps to add, rename, and delete elements and attributes; process
regions of a document based on XPath expressions, combine documents,
split documents, extract content, inject content, etc.</para>

<para xml:id='p16'>Programmers today often implement these operations using XSLT (I
know I do). But in the general case, XSLT doesn't stream very well and
requires building an entire in-memory representation of the document
to be processed. The XProc operations, in contrast, are much simpler
and many of them can (and will) be implemented in components that
operate in a streaming fashion, neither slowing down the pipeline nor
requiring enough memory to load the entire document.</para>

<para xml:id='p17'>Of course the important word way back up there in the paragraph
before the pipeline example was “standard”. None of this work is a real
win for users unless they can reasonably expect interoperable implementations.
I want (<emphasis>desperately</emphasis> sometimes) to be able to distribute
pipelines with the same ease that I now distribute XSLT stylesheets. By the
same token, this work isn't a real win for programmers until they can reasonably
expect some basic, interoperable APIs. But those can come a little later, and
they're most emphatically <emphasis>not</emphasis> part of the charter of our
W3C Working Group.</para>

</essay>
