XProc + RDF

Volume 16, Issue 10; 12 Oct 2013

I finally got around to implementing some RDF extension steps for XML Calabash. I think they play nicely with semantics support in MarkLogic 7 and I'm really quite pleased with the results.

Semantics, by which I mean RDF, is hot again. Support for triples in MarkLogic is one of the most talked about new features in MarkLogic 7.

There are a lot of triples in this weblog (wack “.rdf” onto the end of the URI and take a look) but they aren't really used very much and they're not published in a very useful way.

One way that they could be published is in RDFa. I have friends who are big fans and it's something I've been wanting to explore. But I'm one of those “validation guys”. You know, the sort who think that it's important to check that the inputs and outputs from a process are actually correct; not just whether or not they “look right”.

Before I invest time trying to use RDFa, I want a tool that will let me check the triples I've encoded, to see if they're complete and correct. (A tool I can run locally, not a web service.) One of the things that didn't make it into MarkLogic 7 was native support for pulling triples out of RDFa documents.

It occurred to me that an XProc step that could perform that task and produce results that would be easy to injest into MarkLogic server would both satisfy my immediate requirements and also possibly be useful to a bunch of folks.

After some struggles with RDFa parsing libraries that didn't work reliably, I settled on the Semargl parser. Here's the step declaration:

<p:declare-step type="cx:rdfa"
                xmlns:cx="http://xmlcalabash.com/ns/extensions">
  <p:input port="source"/>
  <p:output port="result" sequence="true"/>
  <p:option name="max-triples-per-document" select="100"/>
</p:declare-step>

On 12 October 2013, using the Semargl 0.6.1 libraries, the following triples are extractedGiven the intended purpose of the page, I'm surprised more triples aren't found; perhaps the page is encoded in a way that the Semargl libraries don't recognize. from http://examples.tobyinkster.co.uk/hcard:

<sem:triples xmlns:sem="http://marklogic.com/semantics">
   <sem:triple>
      <sem:subject>http://examples.tobyinkster.co.uk/hcard</sem:subject>
      <sem:predicate>http://purl.org/dc/terms/abstract</sem:predicate>
      <sem:object xml:lang="en">This page is intended to be a demonstration of
                the use of RDFa (including FOAF, Dublin Core and W3C PIM vocabularies) in
                conjunction with Microformats (including hCard and rel-tag).</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://examples.tobyinkster.co.uk/hcard#jack</sem:subject>
      <sem:predicate>http://www.w3.org/2006/vcard/ns#category</sem:predicate>
      <sem:object xml:lang="en">Counter-Terrorist Unit</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://examples.tobyinkster.co.uk/hcard#jack</sem:subject>
      <sem:predicate>http://xmlns.com/foaf/0.1/plan</sem:predicate>
      <sem:object xml:lang="en">I will kick your terrorist ass!</sem:object>
   </sem:triple>
</sem:triples>

The format of a sem:triples file is straightforward, it contains a set of one or more sem:triple elements. Each sem:triple in turn contains a sem:subject, a sem:predicate, and a sem:object.

The subject and predicate are always IRIs, the object is either an IRI or a literal value. The object is an IRI unless it has a datatype or xml:lang attribute, in which case it is a literal.

If any IRI begins with “http://marklogic.com/semantics/blank/”, it represents a blank node.

This format is a serialization of the internal format that MarkLogic uses to represent semantics data. It's convenient for me and easy to convert into other formats.

It seemed odd to me at this point that I could load data parsed from RDFa documents, but couldn't actually load triples from RDF files. One of the things I did for MarkLogic 7 was extend mlcp to load semantics data. Building on that experience, I cooked up an RDF loader:

<p:declare-step type="cx:rdf-load">
                xmlns:cx="http://xmlcalabash.com/ns/extensions">
  <p:input port="source" sequence="true"/>
  <p:output port="result" sequence="true"/>
  <p:option name="href" required="true"/>
  <p:option name="language"/>
  <p:option name="graph"/>
  <p:option name="max-triples-per-document" select="100"/>
</p:declare-step>

It accepts zero or more sem:triples documents on its source port. Those documents are loaded into a model. The document identified by the href option is loaded and added to that model. The result is seralized in sem:triples documents on the result port. If a graph is specified, only triples in that named graph are output.

Once you can load a bunch of data, it seems weird not to be able to do anything with it. If it was just me, I'd store the documents in MarkLogic and run the query there, but since I already had the Jena libraries around, it was easy to construct a SPARQL step:

<p:declare-step type="cx:sparql">
                xmlns:cx="http://xmlcalabash.com/ns/extensions">
  <p:input port="source" sequence="true" primary="true"/>
  <p:input port="query"/>
  <p:output port="result" sequence="true"/>
</p:declare-step>

This step runs the query and returns the results in SPARQL Query Results XML Format.

And finally, by now, it seems odd that I've got no way to serialize the results in an interoperable semantics format. Enter the RDF store step:

<p:declare-step type="cx:rdf-store">
                xmlns:cx="http://xmlcalabash.com/ns/extensions">
  <p:input port="source" sequence="true"/>
  <p:output port="result" primary="false"/>
  <p:option name="href"/>
  <p:option name="language"/>
  <p:option name="graph"/>
</p:declare-step>

The model loaded from the documents that arrive on the source port is seralized in the specified format. If the graph is specified, only triples in that named graph are output.

But serializing isn't really what I want to do. I want to store the documents in MarkLogic server. I wrote a step to make that easier:

<p:declare-step type="ml:store-triples"
                xmlns:ml="http://xmlcalabash.com/ns/extensions/marklogic">
  <p:input port="source" sequence="true"/>
  <p:output port="result" primary="false" sequence="true">
    <p:pipe step="loop" port="result"/>
  </p:output>
  <p:option name="host"/>
  <p:option name="port"/>
  <p:option name="user"/>
  <p:option name="password"/>
  <p:option name="collections"/>

  <p:declare-step type="ml:insert-document">
    <p:input port="source"/>
    <p:output port="result" primary="false"/>
    <p:option name="host"/>
    <p:option name="port"/>
    <p:option name="user"/>
    <p:option name="password"/>
    <p:option name="content-base"/>
    <p:option name="uri" required="true"/>
    <p:option name="buffer-size"/>
    <p:option name="collections"/>
    <p:option name="format"/>
    <p:option name="language"/>
    <p:option name="locale"/>
    <p:option name="auth-method"/>
  </p:declare-step>

  <p:for-each name="loop">
    <p:iteration-source select="/sem:triples[cx:database-uri]"/>
    <p:output port="result">
      <p:pipe step="insert" port="result"/>
    </p:output>

    <p:variable name="uri" select="/sem:triples/cx:database-uri"/>
    <p:variable name="graph-name" select="/sem:triples/cx:graph-name"/>

    <p:delete match="/sem:triples/cx:graph-name"/>
    <p:delete match="/sem:triples/cx:database-uri"/>

    <p:choose name="insert">
      <p:when test="p:value-available('collections')">
        <p:output port="result">
          <p:pipe step="ml" port="result"/>
        </p:output>
        <ml:insert-document name="ml" format="xml">
          <p:with-option name="host" select="$host"/>
          <p:with-option name="port" select="$port"/>
          <p:with-option name="user" select="$user"/>
          <p:with-option name="password" select="$password"/>
          <p:with-option name="collections"
                         select="concat($collections, ' ', $graph-name)"/>
          <p:with-option name="uri" select="$uri"/>
        </ml:insert-document>
      </p:when>
      <p:otherwise>
        <p:output port="result">
          <p:pipe step="gins" port="result"/>
        </p:output>

        <p:choose name="gins">
          <p:when test="$graph-name = ''">
            <p:output port="result">
              <p:pipe step="ml" port="result"/>
            </p:output>
            <ml:insert-document name="ml" format="xml">
              <p:with-option name="host" select="$host"/>
              <p:with-option name="port" select="$port"/>
              <p:with-option name="user" select="$user"/>
              <p:with-option name="password" select="$password"/>
              <p:with-option name="uri" select="$uri"/>
            </ml:insert-document>
          </p:when>
          <p:otherwise>
            <p:output port="result">
              <p:pipe step="ml" port="result"/>
            </p:output>
            <ml:insert-document name="ml" format="xml">
              <p:with-option name="host" select="$host"/>
              <p:with-option name="port" select="$port"/>
              <p:with-option name="user" select="$user"/>
              <p:with-option name="password" select="$password"/>
              <p:with-option name="collections" select="$graph-name"/>
              <p:with-option name="uri" select="$uri"/>
            </ml:insert-document>
          </p:otherwise>
        </p:choose>
      </p:otherwise>
    </p:choose>
  </p:for-each>
</p:declare-step>

This step relies on the fact that my XProc RDF steps store a little extra data in the sem:triples documents: a randomly generated database URI and the name of the graph (if the triples come from a named graph).

I hope that the XProc Working Group's efforts on to make XProc V.next easier to use will simplify pipelines like this one.

In the meantime, I hope these steps are useful. They'll be in the next XML Calabash release. Share and enjoy.