I'm on record as having concerns about the microformats approach to marking up data on the web. One of those concerns is validation. Can microformats be validated?
The Future Begins Tomorrow!
I'm on record as having concerns about the microformats approach to marking up data on the web. One of those concerns is validation, without which I assert that vast quantities of the data will be invalid and hence only marginally useful.
So what can we do about validation? I believe I've learned two things since the last time I considered this problem:
The element names are utterly irrelevant. It isn't a question of putting the
<span>s in the right way, it's just a matter of getting the
classattributes on elements in the right order. So if you want to put hCalendar on table rows and cells, go for it.
I was reminded that
classattribute values aren't a single token, they're a token list.
The second point blows direct RELAX NG validation completely out of the water. My concerns about interleaving turn out to be unfounded and irrelevant, there's just no way to express a pattern that matches an attribute that contains some token. Direct W3C XML Schema validation is impossible too.
In fact, the only direct validation technology that I know of that might work is Schematron. But the Schematron rules would be hard to write too because you have to express both what is and what isn't allowed, I think, and because you still have to mess with all that token parsing.
So let's imagine a future in which we have a standard XML pipeline language so we can easily chain a few processes together. Then we might employ the following strategy:
For each microformat that a document contains (which I hope is identified in a profile or something, otherwise I guess you just have to test them all):
Run a transformation that promotes class attribute values that are allowed in that microformat to element names. Discard irrelevant elements.
If the microformat supports arbitrary optimizations, such as hCard's rule about “n”:
If “fn” and “org” are not the same, and the value of the “fn” property is exactly two words (separated by whitespace), and there is no explicit “n” property, then the “n” property is inferred from the “fn” property.
perform those optimizations. This pipeline component clearly has to be arbitrarily powerful. I suggest that a good design principle for microformats would be to avoid such optimizations wherever practical.
Validate the resulting document with a grammar for the relevant microformat.
If all of the microformats validate, then the document is valid. I think it makes sense to check if the document is valid overall (according to an XHTML schema, for example), though I suppose that step is technically optional if you're willing to accept random well-formed XHTML element names.
One interesting observation is that the necessary transformation
can (often? always?) be deduced from the grammar. In fact, I coded up
an XSLT stylesheet
that does this. It reads a RELAX NG grammar and an XML document and
class attribute values to
element names. To test this, I constructed some toy grammars for
(These are toy grammars, I don't assert that they're complete
Consider the first restaurant review example in the hReview specification:
1<div class="hreview"> 2 <span><span class="rating">5</span> out of 5 stars</span> 3 <h4 class="summary">Crepes on Cole is awesome</h4> 4 <span class="reviewer vcard">Reviewer: <span class="fn">Tantek</span> - 5 <abbr class="dtreviewed" title="20050418T2300-0700">April 18, 2005</abbr></span> 6 <div class="description item vcard"><p> 7 <span class="fn org">Crepes on Cole</span> is one of the best little 8 creperies in <span class="adr"><span class="locality">San Francisco</span></span>. 9 Excellent food and service. Plenty of tables in a variety of sizes 10 for parties large and small. Window seating makes for excellent 11 people watching to/from the N-Judah which stops right outside. 12 I've had many fun social gatherings here, as well as gotten 13 plenty of work done thanks to neighborhood WiFi. 14 </p></div> 15 <p>Visit date: <span>April 2005</span></p> 16 <p>Food eaten: <span>Florentine crepe</span></p> 17</div>
If we apply the pre-validation stylesheet to this document with the hReview grammar, we get:
1<reviews><hreview> 2 <rating>5</rating> out of 5 stars 3 <summary>Crepes on Cole is awesome</summary> 4 <reviewer>Reviewer: - 5 </reviewer><dtreviewed title="20050418T2300-0700">April 18, 2005</dtreviewed> 6 <description> 7 is one of the best little 8 creperies in . 9 Excellent food and service. Plenty of tables in a variety of sizes 10 for parties large and small. Window seating makes for excellent 11 people watching to/from the N-Judah which stops right outside. 12 I've had many fun social gatherings here, as well as gotten 13 plenty of work done thanks to neighborhood WiFi. 14 </description><item> 15 <fn>Crepes on Cole</fn> is one of the best little 16 creperies in . 17 Excellent food and service. Plenty of tables in a variety of sizes 18 for parties large and small. Window seating makes for excellent 19 people watching to/from the N-Judah which stops right outside. 20 I've had many fun social gatherings here, as well as gotten 21 plenty of work done thanks to neighborhood WiFi. 22 </item> 23 Visit date: April 2005 24 Food eaten: Florentine crepe 25</hreview></reviews>
Applying the same transformation with the hCard schema produces:
1<bag-of-cards><vcard>Reviewer: <fn>Tantek</fn> - 2 </vcard><vcard> 3 <fn>Crepes on Cole</fn><org>Crepes on Cole</org> is one of the best little 4 creperies in . 5 Excellent food and service. Plenty of tables in a variety of sizes 6 for parties large and small. Window seating makes for excellent 7 people watching to/from the N-Judah which stops right outside. 8 I've had many fun social gatherings here, as well as gotten 9 plenty of work done thanks to neighborhood WiFi. 10 </vcard></bag-of-cards>
It's very hard to tell what text can be discarded, so we wind up with a bunch of extra text. I think that's ok; we can make all of our grammar patterns mixed content.
The output of each of the preceding transformations is valid according to my manufactured schemas for the respective microformat.
Another possibility would be to simply discard all the text:
1<reviews> 2 <hreview> 3 <rating/> 4 <summary/> 5 <reviewer/> 6 <dtreviewed title="20050418T2300-0700"/> 7 <description/> 8 <item> 9 <fn/> 10 </item> 11 </hreview> 12</reviews>
I don't know. On the whole, I'm going to guess it makes more sense to preserve the (sometimes extra) text. That way your schema could test for content too when appropriate.
There are still some hard issues though:
There's the whole optimization step. If a special-purpose component is required for that, maybe that component ought to just do the whole validation task.
Because class values are simple strings (and not, for example, in a namespace), there's the possibility of ambiguity. The transformation can't promote all the possibly legal class values indiscriminately, it has to consider the context. You can see this in practice in the example above. Both hCard and hReview allow the “fn” class, but they do so in different places. If the “fn” from the hCard (Tantek) was left in the hReview transformation, it would be in the wrong place.
At the moment, the transformation only looks up one level (is “fn” allowed in “reviewer”?). More complex context checking might be required in general.
Suppressing elements where they aren't allowed means that if their presence is actually a markup error, this approach won't catch the error.
This sketch doesn't attempt to address nested microformats (though perhaps it could be extended to do so). The “reviewer” in hReview should be an hCard, but that isn't checked.
In short, this approach works sometimes (it does detect an “hreview” with a missing “item”, for example), but I'm still not satisfied. And I remain convinced that this problem has to be solved before microformats can be considered a reliable way to encode data.