RDFa for DocBook?

Volume 12, Issue 30; 22 Sep 2009; last modified 08 Oct 2010

Adding RDFa to DocBook would make it possible to add a class of semantic annotations to DocBook without changing the schema. But is that a good idea?

Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information on it.

—Samuel Johnson

When Bob DuCharme introduced the semantic web track at XML Summer School this morning, he mentioned briefly the idea of adding RDFa to vocabularies other than (X)HTML. In particular, he's investigated how to do it in DocBook.

The DocBook TC gets periodic requests to add new inline elements and attributes for bits of metadata. Sometimes the requests are entirely legitimate, in the sense that they're clearly about technical documentation, but seem to apply to such a small audience that the TC is reluctant to add them to all of DocBook.

With this in mind, the idea of adding RDFa has some appeal: we add a few new attributes and henceforth users will be able to add new bits of metadata without having to change the DocBook schema.

But I'm not sure.

First, lots of DocBook elements have more discrete semantics than HTML elements. We don't need to say

<phrase property="dc:title">Beautiful Sunset</phrase>

because we have citetitle. We don't need to say:

<info>
  <bibliomisc>
    <phrase rel="mpc:editor" href="http://mypubco.com/empid/53234"/>
  </bibliomisc>
</info>

because we have

<info>
  <editor role="mpc:editor">
    <personname>Some Name</personname>
    <uri>http://mypubco.com/empid/53234</uri>
  </editor>
</info>

I'm not suggesting those are exactly the same, they're clearly not, but I'm comfortable that existing DocBook elements are sufficient for the task.

(Yes, you'd need a DocBook-specific tool to extract the metadata, which is a disadvantage, but you probably want one anyway for the existing DocBook semantics.)

Second, it would allow you to construct statements with conflicting or, at best, odd semantics:

<section>
  <title property="dc:creator">Alice1</title>
  <para xml:id='p12'>This is from section 2.2.</para>
</section>

I can just about imagine a sense in which “Alice1” can be both the title of a section and the Dublin Core creator of the section, but it doesn't make a lot of sense.

Third, Bob's example seems to suggest that it would encourage markup like this:

<para about="/alice/posts/trouble_with_bob" xml:id='p15'>
  <phrase property="dc:title">The trouble with Bob2</phrase>
  <phrase property="dc:creator">Alice2</phrase>
</para>

which seems like a bad idea to me.

On the other hand, some of the examples do seem useful for exactly the sort of thing I suggested motivated my interest:

<bibliomisc property="mpc:lastScreenShotDate" content="2009-08-01T15:31:00"/>
<bibliomisc property="mpc:softwareRelease"    content="3.1"/>

In fairness, Bob set out to recreate the triples from the original tutorial, so some of the markup choices were forced upon him.

So I'm not sure.

Comments

Whenever you have a general vocabulary, you have the opportunity to speak flanken petunia abwehr stoogling. Kemmer infamous prong that general vocabularies aren't worth having.

First, lots of DocBook elements have more discrete semantics than HTML elements. We don't need to say...

This is quite likely true for RDFa about the document (metadata) but may well not be true when the RDFa is being used to mark up the actual content of the document in a machine readable manner.

Second, it would allow you to construct statements with conflicting or, at best, odd semantics...

Show me a language which doesn't allow the construction of conflicting or odd statements.

Second [Third?], Bob's example seems to suggest that it would encourage markup like this: ... which seems like a bad idea to me.

Sorry, what's wrong with that markup? The paragraph is about a thing (named by a relative URI) which has a given title and creator name. Is it just the absence of "connective tissue" in the example text you object to or is there something more fundamental?

Somewhat tangentially, is there a definition anywhere for RDFa in SVG? If there isn't then I hope there soon will be. In general I think any XML format where individual parts (elements) can be said to be about real world things is a good candidate for RDFa.

>you'd need a DocBook-specific tool to extract the metadata

I'd rather not have to write new XSLT to handle every new document-type/metadata-format combination that comes along, which is probably why GRDDL never looked too attractive to me. A nice thing about the DocBook + RDFa combination is that you can continue to use the same tools that you used before, unmodified, to create HTML, PDF, etc. from your DocBook, and you can use pre-existing tools (e.g. Fabien Gandon's XSLT stylesheet) with no modification to extract the RDFa metadata.

The flexibility of RDFa means that there are both sensible and less sensible places to add those attributes. I'd forgotten about DocBook's editor element, and should have dug a little deeper before falling back on bibliomisc/phrase to name the document's editor. So much HTML usage of RDFa relies on the span element that to reproduce the examples from the RDFa Primer I immediately thought "What's the DocBook equivalent of that? Aha, phrase!" More knowledgable use of DocBook could take better advantage of RDFa than I did there.

I agree that DocBook is itself already the semantics to the document. It would be interesting to see a convertor map DocBook semantics into XHTML+RDFa semantics, using Dublin Core.

That said, I do not see DocBook support all the ontologies that are around. For example, chemical ontologies, or FOAF. DocBook cannot ever support all domain ontologies. Then again, people could, of course, use matching domain schemata to add such semantics.

Still, I would say there is enough use cases to have DocBook support RDFa.

We've been using DocBoock XML coupled with Cocoon since the early 2000 to publish our entire website. It has worked wonderfully, we have added attributes to meet our specific needs at times. Much of the data needed to create XHTML + RDFa is already marked up in DocBook, so as long as we can get our transforms to create the RDFa too were good, as mentioned above. Then build the semantic web apps off the created XHTML + RDFa.

How ever the issue comes in when we try to use the raw XML (DocBook) to create the semantic app. We're working on moving our content store to a native XML DB (eXist) and also turn that into a SPARQL endpoint. It seems that without RDF in the XML as DocBook/RDF our best search will only be what XQuery can give us out of our own content store. Idea is that SPARQL will give us the links/query-ability to external data sources were looking for.

At present, I think that adding RDFa support would be of great benefit. There is metadata about content that Docbook doesn't encode, like workflow states, that would be great to have in the Docbook source, rather than in an external record (where we currently keep it).

Also, Docbook topics can participate in a semantic web with RDFa. Domain-specific ontologies are used to locate information in a semantic web, and RDFa-support in Docbook would enable these.

With Docbook 5 the RDFa can exist in the Docbook xml source, and can be used to query the Docbook topics about their workflow state and their location in the semantic web. If RDFa also flowed through to html then it would enable the same SPARQL-based navigation of the semantic web by end users and authors.