“Default” XML Processing

Volume 13, Issue 13; 09 Apr 2010; last modified 08 Oct 2010

A look at the intersection of the XML model PI, the XML stylesheet PI, and XProc.

What is the “default XML processing model?” That question has been open for a long time, since the very beginning of XML really. There are a lot of different opinions, some of them captured in the W3C’s Technical Architecture Group discussion of the issue “xmlFunctions-34”. (Disclaimer: I contributed to that issue while I was a member of the TAG.)

It's in the charter of the XML Processing Model Working Group, which I chair, to provide an answer to this question. I don't think it has an answer. I don't subscribe to the notion that XML documents have one and only one intrinsic meaning. I think the best we can do is describe one (or a few) possible models and give them labels. That will allow the authors of other specifications, and applications, to say “we do TYPEX processing on XML documents”, where “TYPEX” is one of the labels. That'll give us a shorthand for talking about some common processing models.

That may not seem very satisfying. Maybe it isn't. The point of this essay isn't to make or defend that position. When the WG produces it's first public working draft of a document that attempts to answer the “default XML processing model” question, I'll let you know. The right answer isn't what I think it is, it's what community consensus drives us to.

No, the point of this essay is something else, something a little more complicated than I think we would reasonably expect to put in that document (though I could be wrong).

The XML community has had an Associating Style Sheets with XML documents specification for a long time. It will soon have an Associating Schemas with XML documents specification. (That link is to an early editor's draft, there's nothing official yet, but it's coming soon.)

What are the two most common things that many (not all!) users want to do with XML documents? Validate them and transform them.

Well, if the document tells you how to validate it and how to style it, then isn't one possible answer to the default processing question simple: validate like I say and style like I say? If it is, shouldn't we be able to express that processing using an XProc pipeline?

Of course we should. And we can: default.xpl.

I'm happy and relieved to find that we can express that processing in XProc. I'm a little, but only a little, surprised to see how complex that pipeline is. Weighing in at 320+ lines it works for a few narrow cases. I still need to integrate support for RELAX NG compact syntax schemas and NVDL processing, at least. I may also want to support a few more stylesheet options, I'm not sure.

Comments

""" if the document tells you how to validate it and how to style it """

The problem is that this is rarely what we want (well ok, "rarely" is rather subjective, but that's my experience). For some cases, I guess that could be useful, but I expect those cases to be rather marginal.

—Posted by Florent Georges on 09 Apr 2010 @ 04:31 UTC #

I don't know. I see useful XML Stylesheet PIs all the time and there's been clamoring for the XML Model PI for years, suggesting that folks at least think it would be useful. Several editors already support it (or their own flavor of it).

As an author, I know what schema I'm authoring to and I usually know what style I want to use for my own display purposes (not likely to be the same as the production stylesheet someone else will apply, but that's not my problem while I'm authoring).

So if PIs point to those artifacts in a consistent way, maybe I can do more processing with less work. "Edit this" and the editor uses those PIs to find the documents it needs. "Format this" and the pipeline uses those PIs to find the documents it needs. Etc.

The PIs probably have no value (perhaps even pose a threat) in interop scenarios, but for authors working in a common environment, I bet they can be handy.

—Posted by Norman Walsh on 09 Apr 2010 @ 05:18 UTC #

IIRC James Clark once said that 'processing an instance' is a user application, hence the schema|stylesheet is a property of that combination of XML instance and Schema|Stylesheet. Different applications will use different tools/data to process the XML. Why pick on these two as 'default' and give them special status? Common yes, default? I'm less convinced.

DaveP

—Posted by Dave Pawson on 10 Apr 2010 @ 06:53 UTC #

In fact, no-one has ever claimed that documents can be processed in one and one only way, have they?

The issue is whether there should be a default processing that can be expected by some defined class of applications like web browsers.

When we look at the technologies that have failed, like xml:link, xinclude, xml:id, xfragments and so on, we see that none of them have a reliable processing model. Not having a processing model is the #1 predictor of failure for these standards. Who is going to use them if they know the browser at the other end can use them?

So it is great if XProc can be used with the stylesheet PI.

—Posted by Rick Jelliffe on 10 Apr 2010 @ 11:33 UTC #

any chance of seeing whats in tee.xpl ?

—Posted by James Fuller on 10 Apr 2010 @ 05:33 UTC #
Oh, sure. That's just a little debugging step. I didn't actually mean to leave the import in place :-)
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:cx="http://xmlcalabash.com/ns/extensions"
                type="cx:tee" name="main" version="1.0">
  <p:input port="source" sequence="true" primary="true"/>
  <p:output port="result" sequence="true" primary="true"/>
  <p:option name="href" required="true"/>
  <p:option name="debug" select="0"/>

  <p:choose>
    <p:when test="$debug != 0">
      <p:store name="saving-debugging-output" method="xml" indent="true">
        <p:with-option name="href" select="$href"/>
      </p:store>
    </p:when>
    <p:otherwise>
      <p:sink name="discarding-debugging-output"/>
    </p:otherwise>
  </p:choose>

  <p:identity name="identity">
    <p:input port="source">
      <p:pipe step="main" port="source"/>
    </p:input>
  </p:identity>
</p:declare-step>
—Posted by Norman Walsh on 10 Apr 2010 @ 06:52 UTC #

How does/will this interact with xsi:schemaLocation? (which seems to go even farther down the road of tying the content of a particular document to a particular processing frame)

-m

—Posted by Micah Dubinko on 11 Apr 2010 @ 05:03 UTC #

I think schema location hints come into play if the document doesn't have an xml-model PI. (And probably if it uses more than one namespace.) I think the default XML Schema validation behavior should be to obey schema location hints and to follow namespace URIs. So it should all "just work" if it can work.

In the RELAX NG and Schematron cases, I don't know what, if any, fallback behavior should be applied. Maybe just skip validation if there's no xml-model PI.

And maybe in the XML Schema case, if there are no hints, validation should be skipped (or only performed at user option).

Something like that.

—Posted by Norman Walsh on 11 Apr 2010 @ 03:37 UTC #

Thx for this posting Norman ... I think it revives an important discussion.

On first read, I decided to decompose the accompanying XProc and I too was a little surprised at the complexity ... I may have an alternate approach but for now I think we should include these kind of advanced things in an informal area of the XProc test suite that includes others (like Mohameds recursive xproc example from some time ago, etc).

Apart from transformation and validation maybe we should consider the concept of 'comprehension' ... I am not suggesting autodiscovery, but anything to help the author communicate intentional usage ... this could be as simple as generating javadoc style html documentation with svg diagram.

I know, its (very) debatable that this is a common xml processing scenario but I think it could be an important adjunct, especially considering the complexities that come with nested importing and different technologies (xquery, xslt, xpath).

—Posted by James Fuller on 12 Apr 2010 @ 09:27 UTC #