Validating microformats

Volume 9, Issue 43; 13 Apr 2006; last modified 08 Oct 2010

I'm on record as having concerns about the microformats approach to marking up data on the web. One of those concerns is validation. Can microformats be validated?

The Future Begins Tomorrow!

—Yoyodyne Propulsion Systems

I'm on record as having concerns about the microformats approach to marking up data on the web. One of those concerns is validation, without which I assert that vast quantities of the data will be invalid and hence only marginally useful.

So what can we do about validation? I believe I've learned two things since the last time I considered this problem:

The element names are utterly irrelevant. It isn't a question of putting the class attributes on divs and spans in the right way, it's just a matter of getting the class attributes on elements in the right order. So if you want to put hCalendar on table rows and cells, go for it.
I was reminded that class attribute values aren't a single token, they're a token list.

The second point blows direct RELAX NG validation completely out of the water. My concerns about interleaving turn out to be unfounded and irrelevant, there's just no way to express a pattern that matches an attribute that contains some token. Direct W3C XML Schema validation is impossible too.

In fact, the only direct validation technology that I know of that might work is Schematron. But the Schematron rules would be hard to write too because you have to express both what is and what isn't allowed, I think, and because you still have to mess with all that token parsing.

So let's imagine a future in which we have a standard XML pipeline language so we can easily chain a few processes together. Then we might employ the following strategy:

For each microformat that a document contains (which I hope is identified in a profile or something, otherwise I guess you just have to test them all):

Run a transformation that promotes class attribute values that are allowed in that microformat to element names. Discard irrelevant elements.
If the microformat supports arbitrary optimizations, such as hCard's rule about “n”:

If “fn” and “org” are not the same, and the value of the “fn” property is exactly two words (separated by whitespace), and there is no explicit “n” property, then the “n” property is inferred from the “fn” property.

perform those optimizations. This pipeline component clearly has to be arbitrarily powerful. I suggest that a good design principle for microformats would be to avoid such optimizations wherever practical.
Validate the resulting document with a grammar for the relevant microformat.

If all of the microformats validate, then the document is valid. I think it makes sense to check if the document is valid overall (according to an XHTML schema, for example), though I suppose that step is technically optional if you're willing to accept random well-formed XHTML element names.

One interesting observation is that the necessary transformation can (often? always?) be deduced from the grammar. In fact, I coded up an XSLT stylesheet that does this. It reads a RELAX NG grammar and an XML document and transforms class attribute values to element names. To test this, I constructed some toy grammars for hCal, hCard, and hReview. (These are toy grammars, I don't assert that they're complete or correct.)

Consider the first restaurant review example in the hReview specification:

<div class="hreview">
 <span><span class="rating">5</span> out of 5 stars</span>
 <h4 class="summary">Crepes on Cole is awesome</h4>
 <span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> - 
 <abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span>
 <div class="description item vcard"><p>
  <span class="fn org">Crepes on Cole</span> is one of the best little 
  creperies in <span class="adr"><span class="locality">San Francisco</span></span>.
  Excellent food and service. Plenty of tables in a variety of sizes 
  for parties large and small.  Window seating makes for excellent 
  people watching to/from the N-Judah which stops right outside.  
  I've had many fun social gatherings here, as well as gotten 
  plenty of work done thanks to neighborhood WiFi.
 </p></div>
 <p>Visit date: <span>April 2005</span></p>
 <p>Food eaten: <span>Florentine crepe</span></p>
</div>

If we apply the pre-validation stylesheet to this document with the hReview grammar, we get:

<reviews><hreview>
 <rating>5</rating> out of 5 stars
 <summary>Crepes on Cole is awesome</summary>
 <reviewer>Reviewer:  - 
 </reviewer><dtreviewed title="20050418T2300-0700">April 18, 2005</dtreviewed>
 <description>
   is one of the best little 
  creperies in .
  Excellent food and service. Plenty of tables in a variety of sizes 
  for parties large and small.  Window seating makes for excellent 
  people watching to/from the N-Judah which stops right outside.  
  I've had many fun social gatherings here, as well as gotten 
  plenty of work done thanks to neighborhood WiFi.
 </description><item>
  <fn>Crepes on Cole</fn> is one of the best little 
  creperies in .
  Excellent food and service. Plenty of tables in a variety of sizes 
  for parties large and small.  Window seating makes for excellent 
  people watching to/from the N-Judah which stops right outside.  
  I've had many fun social gatherings here, as well as gotten 
  plenty of work done thanks to neighborhood WiFi.
 </item>
 Visit date: April 2005
 Food eaten: Florentine crepe
</hreview></reviews>

Applying the same transformation with the hCard schema produces:

<bag-of-cards><vcard>Reviewer: <fn>Tantek</fn> - 
 </vcard><vcard>
  <fn>Crepes on Cole</fn><org>Crepes on Cole</org> is one of the best little 
  creperies in .
  Excellent food and service. Plenty of tables in a variety of sizes 
  for parties large and small.  Window seating makes for excellent 
  people watching to/from the N-Judah which stops right outside.  
  I've had many fun social gatherings here, as well as gotten 
  plenty of work done thanks to neighborhood WiFi.
 </vcard></bag-of-cards>

It's very hard to tell what text can be discarded, so we wind up with a bunch of extra text. I think that's ok; we can make all of our grammar patterns mixed content.

The output of each of the preceding transformations is valid according to my manufactured schemas for the respective microformat.

Another possibility would be to simply discard all the text:

<reviews>
   <hreview>
      <rating/>
      <summary/>
      <reviewer/>
      <dtreviewed title="20050418T2300-0700"/>
      <description/>
      <item>
         <fn/>
      </item>
   </hreview>
</reviews>

I don't know. On the whole, I'm going to guess it makes more sense to preserve the (sometimes extra) text. That way your schema could test for content too when appropriate.

There are still some hard issues though:

There's the whole optimization step. If a special-purpose component is required for that, maybe that component ought to just do the whole validation task.
Because class values are simple strings (and not, for example, in a namespace), there's the possibility of ambiguity. The transformation can't promote all the possibly legal class values indiscriminately, it has to consider the context. You can see this in practice in the example above. Both hCard and hReview allow the “fn” class, but they do so in different places. If the “fn” from the hCard (Tantek) was left in the hReview transformation, it would be in the wrong place.

At the moment, the transformation only looks up one level (is “fn” allowed in “reviewer”?). More complex context checking might be required in general.
Suppressing elements where they aren't allowed means that if their presence is actually a markup error, this approach won't catch the error.
This sketch doesn't attempt to address nested microformats (though perhaps it could be extended to do so). The “reviewer” in hReview should be an hCard, but that isn't checked.

In short, this approach works sometimes (it does detect an “hreview” with a missing “item”, for example), but I'm still not satisfied. And I remain convinced that this problem has to be solved before microformats can be considered a reliable way to encode data.

Comments

not completely related, but the XML version is easier to read than the XHTML version. Maybe just a question of presentation of the markup.

The more I think about, the more I am convinced that Schematron is really well fitted to validate microformats.

Eric

Nice work. I'm not sure what can be done about the optimisations (there must be some neat way...) but re. the ambiguities - the general µF strategy is to avoid clashes by prior agreement, so I would think is how the "fn" issue could be dealt with - tweaking the specs/profiles as needed. (Hoping to have a play with Schematron later ;-)

PS. I've just been looking at the specs around the "fn" issue, it doesn't seem ambiguous. Use of hCard inside hReview is encouraged, but I would imagine some kind of stack for context would be enough - er, but what about interleaving..?

http://microformats.org/wiki/hcard

http://microformats.org/wiki/hreview

The ambiguity issue arises when different microformats reuse class values. Consider the case where you're looking through a document for hReview markup. If you find an "fn", it might be because you're in an "item" in the "hreview" or it might be because you're inside some "hcard" (which might or might not be inside the "hreview"). You can't tell what kind of "fn" it is without looking around.

Could the XSLT step be avoided by using a custom datatype that took the class name as a parameter and accepted any list of tokens that contains the class name? Something like:

attribute class {
    foo:class {
        name = 'vevent'
    }
}

…which would match class='foo vevent bar' but would not match class='foo bar'.

Also, how did you decide what to write in the schemas? I tried to extract conformance criteria from the hCalendar and hCard spec, but I did not find any criteria to extract.

    element * {
        attribute class {
            xsd:token { pattern = "(.+\s)?foo(\s.+)?" }
        }
    }

I have written a more detailed explanation of the reason why RELAX NG cannot be used to validate microformats, how Schematron could be used and how Schematron schemas could be generated from your RELAX NG schemas.

Hi Norm. I read this a while back, but it didn't click until now that what you've done here is what I described in my blog post on Domain Specific Intermediate Languages. In my case, I was working from XMI (UML in XML), which is not much less generic than an HTML microformat. Looks like we both came to the same conclusion (as did Sean McGrath in his CSW XML Summer School talk last year) that an intermediate XML format is a very useful filter in processing data that is contained in an unconstrained, generic format.