XProc + RDF in use

Volume 16, Issue 11; 13 Oct 2013

Putting my extension steps to use.

Apparently it wasn't clear yesterday why I went to all the trouble I described. At least, I got mail to that effect.

The only step I really needed was the cx:rdfa step, everything else was just me satisfying a compulsion to make something complete and possibly useful to others.

I wanted the cx:rdfa step so that I could check the RDFa on pages, particularly any that I write. With the RDFa step in hand, I banged out a quick pipeline to use it:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:cx="http://xmlcalabash.com/ns/extensions"
                version="1.0" exclude-inline-prefixes="c cx">
  <p:input port="parameters" kind="parameter"/>
  <p:output port="result" sequence="true"/>
  <p:serialization port="result" indent="true"/>
  <p:option name="uri" required="true"/>

  <p:declare-step type="cx:rdfa">
    <p:input port="source"/>
    <p:output port="result" sequence="true"/>
    <p:option name="max-triples-per-document" select="10000"/>
  </p:declare-step>

  <p:template>
    <p:input port="template">
      <p:inline><c:request method="get" href="{$uri}"/></p:inline>
    </p:input>
    <p:input port="source"><p:empty/></p:input>
    <p:with-param name="uri" select="$uri"/>
  </p:template>

  <p:http-request/>

  <p:choose xmlns:h="http://www.w3.org/1999/xhtml">
    <p:when test="/h:html">
      <p:identity/>
    </p:when>
    <p:otherwise>
      <p:unescape-markup content-type="text/html"/>
      <p:unwrap match="/c:body"/>
    </p:otherwise>
  </p:choose>

  <cx:rdfa name="rdfa"/>
</p:declare-step>

The only interesting bit really is the use of p:http-request so that I can deal with text/html documents as well as application/xhtml+xml documents.

The first thing I did with this pipeline was point it at my homepage.

$ calabash rdfa.xpl uri=http://nwalsh.com

I didn't actually expect to get any results because I wasn't aware that I'd attempted to put any RDFa on that page (yet).

<sem:triples xmlns:sem="http://marklogic.com/semantics">
   <sem:triple>
      <sem:subject>http://nwalsh.com/</sem:subject>
      <sem:predicate>http://www.w3.org/1999/xhtml/vocab#stylesheet</sem:predicate>
      <sem:object>http://nwalsh.com/css/tabs.css</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://nwalsh.com/</sem:subject>
      <sem:predicate>http://www.w3.org/1999/xhtml/vocab#stylesheet</sem:predicate>
      <sem:object>http://nwalsh.com/css/nwalsh.css</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://nwalsh.com/</sem:subject>
      <sem:predicate>http://www.w3.org/1999/xhtml/vocab#icon</sem:predicate>
      <sem:object>http://nwalsh.com/images/nwalsh-icon16.png</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://nwalsh.com/</sem:subject>
      <sem:predicate>http://www.w3.org/1999/xhtml/vocab#stylesheet</sem:predicate>
      <sem:object>http://nwalsh.com/css/website.css</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://marklogic.com/semantics/blank/6be75486b8c7433/_:n0sbl</sem:subject>
      <sem:predicate>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</sem:predicate>
      <sem:object>http://www.w3.org/ns/rdfa#UnresolvedTerm</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://marklogic.com/semantics/blank/6be75486b8c7433/_:n0sbl</sem:subject>
      <sem:predicate>http://www.w3.org/ns/rdfa#context</sem:predicate>
      <sem:object datatype="http://www.w3.org/2001/XMLSchema#string">Can't resolve term publisher at 88:103</sem:object>
   </sem:triple>
</sem:triples>

Holy cow! There's RDFa on that page! What's more, there's broken RDFa on that page! So, from my point of view, yesterday's effort has already paid for itself.

It turns out, I don't actually care that much about the XHTML vocabulary triples. Yes, there are stylesheets and icons and such; that's a distraction. But I really care about the unresolved terms. I expanded my pipeline a bit to clean up the output.

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:cx="http://xmlcalabash.com/ns/extensions"
                xmlns:sem="http://marklogic.com/semantics"
                version="1.0" exclude-inline-prefixes="c cx sem">
  <p:input port="parameters" kind="parameter"/>
  <p:output port="result" sequence="true"/>
  <p:serialization port="result" indent="true"/>
  <p:option name="uri" required="true"/>

  <p:declare-step type="cx:rdfa">
    <p:input port="source"/>
    <p:output port="result" sequence="true"/>
    <p:option name="max-triples-per-document" select="10000"/>
  </p:declare-step>

  <p:variable name="r.type"  select="'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'"/>
  <p:variable name="r.term"  select="'http://www.w3.org/ns/rdfa#UnresolvedTerm'"/>

  <p:template>
    <p:input port="template">
      <p:inline><c:request method="get" href="{$uri}"/></p:inline>
    </p:input>
    <p:input port="source"><p:empty/></p:input>
    <p:with-param name="uri" select="$uri"/>
  </p:template>

  <p:http-request/>

  <p:choose xmlns:h="http://www.w3.org/1999/xhtml">
    <p:when test="/h:html">
      <p:identity/>
    </p:when>
    <p:otherwise>
      <p:unescape-markup content-type="text/html"/>
      <p:unwrap match="/c:body"/>
    </p:otherwise>
  </p:choose>

  <cx:rdfa name="rdfa"/>

  <p:template name="note">
    <p:input port="template">
      <p:inline><note>Suppressed {$count} HTML vocabulary triples</note></p:inline>
    </p:input>
    <p:with-param name="count"
                  select="count(//sem:triple[starts-with(sem:predicate,
                                      'http://www.w3.org/1999/xhtml/vocab#')])"/>
  </p:template>

  <p:insert match="/*" position="first-child">
    <p:input port="source">
      <p:pipe step="rdfa" port="result"/>
    </p:input>
    <p:input port="insertion">
      <p:pipe step="note" port="result"/>
    </p:input>
  </p:insert>

  <p:delete name="delete"
            match="sem:triple[starts-with(sem:predicate,
                              'http://www.w3.org/1999/xhtml/vocab#')]"/>

  <p:choose>
    <p:when test="//sem:triple[sem:predicate=$r.type and sem:object=$r.term]">
      <p:template name="warning">
        <p:input port="template">
          <p:inline><WARNING>{$count} unresolved terms</WARNING></p:inline>
        </p:input>
        <p:with-param name="count"
                      select="count(//sem:triple[sem:predicate=$r.type
                                                 and sem:object=$r.term])"/>
      </p:template>
      <p:insert match="/*" position="last-child">
        <p:input port="source">
          <p:pipe step="delete" port="result"/>
        </p:input>
        <p:input port="insertion">
          <p:pipe step="warning" port="result"/>
        </p:input>
      </p:insert>
    </p:when>
    <p:otherwise>
      <p:identity/>
    </p:otherwise>
  </p:choose>

</p:declare-step>

Much better:

<sem:triples xmlns:sem="http://marklogic.com/semantics">
   <note>Suppressed 4 HTML vocabulary triples</note>
   <sem:triple>
      <sem:subject>http://marklogic.com/semantics/blank/b35ddb2f3a791188/_:n0sbl</sem:subject>
      <sem:predicate>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</sem:predicate>
      <sem:object>http://www.w3.org/ns/rdfa#UnresolvedTerm</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://marklogic.com/semantics/blank/b35ddb2f3a791188/_:n0sbl</sem:subject>
      <sem:predicate>http://www.w3.org/ns/rdfa#context</sem:predicate>
      <sem:object datatype="http://www.w3.org/2001/XMLSchema#string">Can't resolve term publisher at 88:103</sem:object>
   </sem:triple>
   <WARNING>1 unresolved terms</WARNING>
</sem:triples>

Next, I fixed the broken RDFa by adding an appropriate vocabulary. Much, much better.

<sem:triples xmlns:sem="http://marklogic.com/semantics">
   <note>Suppressed 4 HTML vocabulary triples</note>
   <sem:triple>
      <sem:subject>http://nwalsh.com/</sem:subject>
      <sem:predicate>http://www.w3.org/ns/rdfa#usesVocabulary</sem:predicate>
      <sem:object>http://schema.org/</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://nwalsh.com/</sem:subject>
      <sem:predicate>http://schema.org/publisher</sem:predicate>
      <sem:object>https://plus.google.com/u/0/112652695581272951676</sem:object>
   </sem:triple>
</sem:triples>

Then I started pointing my new tool at random web pages. Who'd have guessed, for example, that there's so much RDFa on the MarkLogic homepage? Kind of broken RDFa, apparently, but still. And really, when there are that many triples, maybe a simpler summary format would be better. Let's bang in some XSLT:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:cx="http://xmlcalabash.com/ns/extensions"
                xmlns:sem="http://marklogic.com/semantics"
                version="1.0" exclude-inline-prefixes="c cx">
  <p:input port="parameters" kind="parameter"/>
  <p:output port="result" sequence="true"/>
  <p:serialization port="result" method="text"/>
  <p:option name="uri" required="true"/>

  <p:declare-step type="cx:rdfa">
    <p:input port="source"/>
    <p:output port="result" sequence="true"/>
    <p:option name="max-triples-per-document" select="10000"/>
  </p:declare-step>

  <p:template>
    <p:input port="template">
      <p:inline><c:request method="get" href="{$uri}"/></p:inline>
    </p:input>
    <p:input port="source"><p:empty/></p:input>
    <p:with-param name="uri" select="$uri"/>
  </p:template>

  <p:http-request/>

  <p:choose xmlns:h="http://www.w3.org/1999/xhtml">
    <p:when test="/h:html">
      <p:identity/>
    </p:when>
    <p:otherwise>
      <p:unescape-markup content-type="text/html"/>
      <p:unwrap match="/c:body"/>
    </p:otherwise>
  </p:choose>

  <cx:rdfa name="rdfa"/>

  <p:xslt>
    <p:input port="stylesheet">
      <p:inline>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:sem="http://marklogic.com/semantics"
		exclude-result-prefixes="xs"
                version="2.0">

<xsl:output method="xml"/>

<xsl:variable name="v.ss"   select="'http://www.w3.org/1999/xhtml/vocab#stylesheet'"/>
<xsl:variable name="v.icon" select="'http://www.w3.org/1999/xhtml/vocab#icon'"/>
<xsl:variable name="r.type" select="'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'"/>
<xsl:variable name="r.term" select="'http://www.w3.org/ns/rdfa#UnresolvedTerm'"/>
<xsl:variable name="r.ctxt" select="'http://www.w3.org/ns/rdfa#context'"/>

<xsl:template match="/">
  <doc>
    <xsl:apply-templates/>
  </doc>
</xsl:template>

<xsl:template match="sem:triples">
  <xsl:variable name="unresolvedterms"
                select="sem:triple[sem:predicate = $r.type and sem:object = $r.term]"/>
  <xsl:variable name="unresolved"
                select="sem:triple[sem:subject = $unresolvedterms/sem:subject]"/>

  <xsl:variable name="stylesheets"
                select="sem:triple[sem:predicate = $v.ss]"/>

  <xsl:variable name="icons"
                select="sem:triple[sem:predicate = $v.icon]"/>

  <xsl:variable name="skip" select="$unresolved | $stylesheets | $icons"/>

  <xsl:value-of select="concat('Omitting stylesheets: ', count($stylesheets),
                               ' and icons: ', count($icons), '&#10;')"/>


  <xsl:for-each-group select="sem:triple except $skip" group-by="sem:subject">
    <xsl:sort select="current-grouping-key()"/>
    <xsl:value-of select="current-grouping-key()"/>

    <xsl:if test="count(current-group()) &gt; 1">
      <xsl:text> (</xsl:text>
      <xsl:value-of select="count(current-group())"/>
      <xsl:text> triples)</xsl:text>
    </xsl:if>

    <xsl:text>&#10;</xsl:text>

    <xsl:for-each-group select="current-group()" group-by="sem:predicate">
      <xsl:sort select="current-grouping-key()"/>
      <xsl:text>  </xsl:text>
      <xsl:value-of select="current-grouping-key()"/>
      <xsl:if test="count(current-group()) &gt; 1">
        <xsl:text> (</xsl:text>
        <xsl:value-of select="count(current-group())"/>
        <xsl:text>)</xsl:text>
      </xsl:if>
      <xsl:text>&#10;</xsl:text>

      <xsl:for-each select="current-group()">
        <xsl:text>    </xsl:text>
        <xsl:choose>
          <xsl:when test="sem:object/@xml:lang">
            <xsl:text>"</xsl:text>
            <xsl:value-of select="sem:object"/>
            <xsl:text>"@</xsl:text>
            <xsl:value-of select="sem:object/@xml:lang"/>
          </xsl:when>
          <xsl:when test="sem:object/@datatype">
            <xsl:text>"</xsl:text>
            <xsl:value-of select="sem:object"/>
            <xsl:text>"^^</xsl:text>
            <xsl:value-of select="substring-after(sem:object/@datatype, '#')"/>
          </xsl:when>
          <xsl:otherwise>
            <xsl:value-of select="sem:object"/>
          </xsl:otherwise>
        </xsl:choose>
        <xsl:text>&#10;</xsl:text>
      </xsl:for-each>
    </xsl:for-each-group>
  </xsl:for-each-group>

  <xsl:variable name="uterms" as="xs:string*">
    <xsl:for-each select="$unresolved[sem:predicate = $r.ctxt]/sem:object">
      <xsl:value-of select="substring-before(substring-after(., 'term '), ' at')"/>
    </xsl:for-each>
  </xsl:variable>

  <xsl:if test="exists($uterms)">
    <xsl:text>&#10;WARNING: Unresolved terms: </xsl:text>
    <xsl:value-of select="distinct-values($uterms)"/>
    <xsl:text>&#10;</xsl:text>
  </xsl:if>
</xsl:template>

<xsl:template match="attribute()|text()|comment()|processing-instruction()">
  <xsl:copy/>
</xsl:template>

</xsl:stylesheet>
      </p:inline>
    </p:input>
  </p:xslt>

</p:declare-step>

(Honestly, I wouldn't usually stick the stylesheet inline like that, but it does make the pipeline a single, self-contained file.)

These results are definitely easier to read:

Omitting stylesheets: 7 and icons: 1
http://marklogic.com/semantics/blank/35c308ff3cd44ad2/_:n24sbl
  http://www.w3.org/1999/xhtml/vocab#role
    http://www.w3.org/1999/xhtml/vocab#main
http://marklogic.com/semantics/blank/35c308ff3cd44ad2/_:n2sbl
  http://www.w3.org/1999/xhtml/vocab#role
    http://www.w3.org/1999/xhtml/vocab#main
http://marklogic.com/semantics/blank/35c308ff3cd44ad2/_:n3sbl
  http://www.w3.org/1999/xhtml/vocab#role
    http://www.w3.org/1999/xhtml/vocab#main
http://www.marklogic.com/ (6 triples)
  http://ogp.me/ns#description
    "MarkLogic is the trusted enterprise NoSQL platform for Big Data applications to drive revenue, streamline operations, manage risk, and make the world safer."@en-US
  http://ogp.me/ns#locale
    "en_US"@en-US
  http://ogp.me/ns#site_name
    "MarkLogic"@en-US
  http://ogp.me/ns#title
    "Enterprise NoSQL Database | MarkLogic"@en-US
  http://ogp.me/ns#type
    "article"@en-US
  http://ogp.me/ns#url
    "http://www.marklogic.com/"@en-US

WARNING: Unresolved terms: canonical shortcut external 0 1 2 3

As you can see, I decided that the actual triples about the unresolved terms weren't very useful, so I simply enumerate the terms that are unresolved.

So there. That's why I wrote it.