<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" version="5.0" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:gal="http://norman.walsh.name/rdf/gallery#" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<info>
    
    
    
    
    
    
    
    
    
    
<title>Validating microformats</title><biblioid class="uri">http://norman.walsh.name/2006/04/13/validatingMicroformats</biblioid>
<volumenum>9</volumenum>
<issuenum>43</issuenum>
<pubdate>2006-04-13T08:46:14-04:00</pubdate>
<date>$Date: 2006-04-13 13:10:56 -0400 (Thu, 13 Apr 2006) $</date>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2006</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>I'm on record as having concerns about the microformats approach
to marking up data on the web. One of those concerns is validation.
Can microformats be validated?</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#Microformats"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#TheWeb"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#XML"/>
</info>

<epigraph>
<attribution>Yoyodyne Propulsion Systems</attribution>
<para xml:id="p2">The Future Begins Tomorrow!</para>
</epigraph>

<para xml:id="p1">I'm on record as having concerns about the
<link xlink:href="http://en.wikipedia.org/wiki/Microformats">microformats</link>
approach to marking up data on the web. One of those
concerns is validation, without which I assert that vast quantities of
the data will be invalid and hence only marginally useful.</para>

<para xml:id="p3">So what can we do about validation? I believe I've learned two things
since the
<link xlink:href="http://norman.walsh.name/2005/09/05/microformats#p11">last
time</link> I considered this problem:</para>

<orderedlist>
<listitem>
<para xml:id="p4">The element names are utterly irrelevant. It isn't a question of
putting the <tag class="attribute">class</tag> attributes on
<tag>div</tag>s and <tag>span</tag>s in the right way, it's just a
matter of getting the <tag class="attribute">class</tag> attributes on
elements in the right order. So if you want to put
<link xlink:href="http://en.wikipedia.org/wiki/HCalendar">hCalendar</link>
on table rows and cells,
<link xlink:href="/2006/itinerary/06-19-xsl">go for it</link>.</para>
</listitem>
<listitem>
<para xml:id="p5">I was reminded that <tag class="attribute">class</tag> attribute
values aren't a single token, they're a token list.</para>
</listitem>
</orderedlist>

<para xml:id="p6">The second point blows direct
<link xlink:href="http://en.wikipedia.org/wiki/RELAX_NG">RELAX NG</link>
validation completely
out of the water. My concerns about interleaving turn out to be unfounded
and irrelevant, there's just no way to express a pattern that matches
an attribute that contains some token. Direct
<link xlink:href="http://en.wikipedia.org/wiki/XML_Schema">W3C XML Schema</link>
validation is impossible too.</para>

<para xml:id="p7">In fact, the only direct validation technology that I know of that
might work is
<link xlink:href="http://en.wikipedia.org/wiki/Schematron">Schematron</link>.
But the Schematron rules would be hard to write too because
you have to express both what is and what isn't allowed, I think, and
because you still have to mess with all that token parsing.</para>

<para xml:id="p8">So let's imagine a future in which we have a standard
<link xlink:href="http://xproc.org/">XML pipeline language</link>
so we can easily chain a few processes together. Then we might employ
the following strategy:</para>

<para xml:id="p9">For each microformat that a document contains (which I hope
is identified in a profile or something, otherwise I guess you just
have to test them all):</para>

<orderedlist>
<listitem>
<para xml:id="p10">Run a transformation that promotes class attribute values that are
allowed in that microformat to element names. Discard irrelevant elements.
</para>
</listitem>
<listitem>
<para xml:id="p11">If the microformat supports arbitrary optimizations, such as
<link xlink:href="http://en.wikipedia.org/wiki/HCard">hCard</link>'s
rule about “n”:</para>
<blockquote>
<para xml:id="p12">If “fn” and “org”
are not the same, and the value of the “fn”
property is exactly two words (separated by whitespace), and there is
no explicit “n” property, then the “n” property is inferred from the
“fn” property.</para>
</blockquote>
<para xml:id="p13">perform those optimizations. This pipeline component
clearly has to be arbitrarily powerful. I suggest that a good design
principle for microformats would be to avoid such optimizations wherever
practical.
</para>
</listitem>
<listitem>
<para xml:id="p14">Validate the resulting document with a grammar for the relevant
microformat.</para>
</listitem>
</orderedlist>

<para xml:id="p15">If all of the microformats validate, then the document is
valid. I think it makes sense to check if the document is valid
overall (according to an XHTML schema, for example), though I suppose
that step is technically optional if you're willing to accept random
well-formed XHTML element names.</para>

<para xml:id="p16">One interesting observation is that the necessary transformation
can (often? always?) be deduced from the grammar. In fact, I coded up
<link xlink:href="examples/microval.xsl">an XSLT stylesheet</link>
that does this. It reads a RELAX NG grammar and an XML document and
transforms <tag class="attribute">class</tag> attribute values to
element names. To test this, I constructed some toy grammars for
<link xlink:href="examples/hcal.rng">hCal</link>,
<link xlink:href="examples/hcard.rng">hCard</link>, and
<link xlink:href="examples/hreview.rng">hReview</link>.
(<emphasis>These are toy grammars, I don't assert that they're complete
or correct.</emphasis>)</para>

<para xml:id="p17">Consider the first
<link xlink:href="http://microformats.org/wiki/hreview#Restaurant_reviews">restaurant
review</link> example in the hReview specification:</para>

<programlisting>&lt;div class="hreview"&gt;
 &lt;span&gt;&lt;span class="rating"&gt;5&lt;/span&gt; out of 5 stars&lt;/span&gt;
 &lt;h4 class="summary"&gt;Crepes on Cole is awesome&lt;/h4&gt;
 &lt;span class="reviewer vcard"&gt;Reviewer: &lt;span class="fn"&gt;Tantek&lt;/span&gt; - 
 &lt;abbr class="dtreviewed" title="20050418T2300-0700"&gt;April 18, 2005&lt;/abbr&gt;&lt;/span&gt;
 &lt;div class="description item vcard"&gt;&lt;p&gt;
  &lt;span class="fn org"&gt;Crepes on Cole&lt;/span&gt; is one of the best little 
  creperies in &lt;span class="adr"&gt;&lt;span class="locality"&gt;San Francisco&lt;/span&gt;&lt;/span&gt;.
  Excellent food and service. Plenty of tables in a variety of sizes 
  for parties large and small.  Window seating makes for excellent 
  people watching to/from the N-Judah which stops right outside.  
  I've had many fun social gatherings here, as well as gotten 
  plenty of work done thanks to neighborhood WiFi.
 &lt;/p&gt;&lt;/div&gt;
 &lt;p&gt;Visit date: &lt;span&gt;April 2005&lt;/span&gt;&lt;/p&gt;
 &lt;p&gt;Food eaten: &lt;span&gt;Florentine crepe&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;</programlisting>

<para xml:id="p18">If we apply the pre-validation stylesheet to this document with
the hReview grammar, we get:</para>

<programlisting>&lt;reviews&gt;&lt;hreview&gt;
 &lt;rating&gt;5&lt;/rating&gt; out of 5 stars
 &lt;summary&gt;Crepes on Cole is awesome&lt;/summary&gt;
 &lt;reviewer&gt;Reviewer:  - 
 &lt;/reviewer&gt;&lt;dtreviewed title="20050418T2300-0700"&gt;April 18, 2005&lt;/dtreviewed&gt;
 &lt;description&gt;
   is one of the best little 
  creperies in .
  Excellent food and service. Plenty of tables in a variety of sizes 
  for parties large and small.  Window seating makes for excellent 
  people watching to/from the N-Judah which stops right outside.  
  I've had many fun social gatherings here, as well as gotten 
  plenty of work done thanks to neighborhood WiFi.
 &lt;/description&gt;&lt;item&gt;
  &lt;fn&gt;Crepes on Cole&lt;/fn&gt; is one of the best little 
  creperies in .
  Excellent food and service. Plenty of tables in a variety of sizes 
  for parties large and small.  Window seating makes for excellent 
  people watching to/from the N-Judah which stops right outside.  
  I've had many fun social gatherings here, as well as gotten 
  plenty of work done thanks to neighborhood WiFi.
 &lt;/item&gt;
 Visit date: April 2005
 Food eaten: Florentine crepe
&lt;/hreview&gt;&lt;/reviews&gt;</programlisting>

<para xml:id="p19">Applying the same transformation with the hCard schema produces:</para>

<programlisting>&lt;bag-of-cards&gt;&lt;vcard&gt;Reviewer: &lt;fn&gt;Tantek&lt;/fn&gt; - 
 &lt;/vcard&gt;&lt;vcard&gt;
  &lt;fn&gt;Crepes on Cole&lt;/fn&gt;&lt;org&gt;Crepes on Cole&lt;/org&gt; is one of the best little 
  creperies in .
  Excellent food and service. Plenty of tables in a variety of sizes 
  for parties large and small.  Window seating makes for excellent 
  people watching to/from the N-Judah which stops right outside.  
  I've had many fun social gatherings here, as well as gotten 
  plenty of work done thanks to neighborhood WiFi.
 &lt;/vcard&gt;&lt;/bag-of-cards&gt;</programlisting>

<para xml:id="p20">It's very hard to tell what text can be discarded, so we wind up
with a bunch of extra text. I think that's ok; we can make all of our
grammar patterns mixed content.</para>

<para xml:id="p21">The output of each of the preceding transformations is valid
according to my manufactured schemas for the respective
microformat.</para>

<para xml:id="p22">Another possibility would be to simply discard all the text:</para>

<programlisting>&lt;reviews&gt;
   &lt;hreview&gt;
      &lt;rating/&gt;
      &lt;summary/&gt;
      &lt;reviewer/&gt;
      &lt;dtreviewed title="20050418T2300-0700"/&gt;
      &lt;description/&gt;
      &lt;item&gt;
         &lt;fn/&gt;
      &lt;/item&gt;
   &lt;/hreview&gt;
&lt;/reviews&gt;</programlisting>

<para xml:id="p23">I don't know. On the whole, I'm going to guess it makes more
sense to preserve the (sometimes extra) text. That way your schema
could test for content too when appropriate.</para>

<para xml:id="p24">There are still some hard issues though:</para>

<orderedlist>
<listitem>
<para xml:id="p25">There's the whole optimization step. If a special-purpose component
is required for that, maybe that component ought to just do the whole
validation task.</para>
</listitem>
<listitem>
<para xml:id="p26">Because class values are simple strings (and not, for example,
in a namespace), there's the possibility of ambiguity. The
transformation can't promote all the possibly legal class values
indiscriminately, it
has to consider the context. You can see this in practice in the
example above. Both hCard and hReview allow the “fn” class, but they do so
in different places. If the “fn” from the hCard (Tantek) was left in
the hReview transformation, it would be in the wrong place.</para>

<para xml:id="p27">At the moment, the transformation only looks up one level (is
“fn” allowed in “reviewer”?). More complex context checking might be
required in general.</para>
</listitem>

<listitem>
<para xml:id="p28">Suppressing elements where they aren't allowed means that if
their presence <emphasis>is actually</emphasis> a markup error, this
approach won't catch the error.</para>
</listitem>
<listitem>
<para xml:id="p29">This sketch doesn't attempt to address nested microformats
(though perhaps it could be extended to do so).
The “reviewer” in hReview should <emphasis>be</emphasis> an
hCard, but that isn't checked.</para>
</listitem>
</orderedlist>

<para xml:id="p30">In short, this approach works sometimes (it does detect an
“hreview” with a missing “item”, for example), but I'm still not
satisfied. And I remain convinced that this problem
has to be solved before microformats can be considered a reliable way to
encode data.</para>

</essay>

