Supporting Microformats

Volume 8, Issue 116; 05 Sep 2005; last modified 08 Oct 2010

Microformats, a technique for embedding machine readable data in human readable formats, are growing in popularity. I've added support for the hCalendar microformat in travel itineraries, but I'm not optimistic about the technique.

Dan Connolly and I are travelling to Edinburgh later this month for a TAG meeting. In the course of looking at our respective online schedules, we got to chatting about microformats, specifically hCalendar. Dan's been experimenting with using it and I've obviously got calendar data on the web, so you'd think I could use it too.

One of the reasons this blog exists is so that I have a place to experiment, so I spent a few hours one evening last week tinkering to get hCalendar supported on my itineraries pages. It turned out to be a little tricky because that page doesn't have all the detailed information needed to generate the event data, but I managed to work around that. Right now, only a few of the events are actually formatted with hCalendar, but over time I'll probably get all of them into that format.

Microformats are becoming quite popular. Old timers like myself recognize that these are what we used to call “architectural forms” being reinvented. Exactly what constitutes a microformat is probably open to debate. On one end of the scale there are really simple things, like adding a rel attribute to anchor tags, and on the other, considerably more complex things like hCalendar and hCard which have nested structure.

The idea develops pretty naturally. You start with some markup vocabulary (DocBook, XHTML, whatever you have lying around) that has an attribute that's used for specialization. In DocBook, we call it role. In XHTML, it's called class. You use it when you want to distinguish two pieces of data that are marked up with the same element.

This works perfectly well on an ad hoc basis, and if you pass the document to someone who isn't familiar with your extensions, the fallback is natural and obvious.

Microformats (and architectural forms, and all the other names under which this technique has been invented) take this one step further by standardizing some of these attribute values and possibly even some combination of element types and attribute values in one or more content models.

This technique has some stellar advantages: it's relatively easy to explain and the fallback is natural and obvious, new code can be written to use this “extra” information without any change being required to existing applications, they just ignore it.

Despite how compelling those advantages are, there are some pretty serious drawbacks associated with microformats as well. Adding hCalendar support to my itineraries page reinforced several of them.

They're not very flexible. While I was able to add hCalendar to the overall itinerary page, I can't add it to the individual pages because they don't use the right markup. I'm not using div and span to markup the individual appointments, so I can't add hCalendar to them.
I don't think they'll scale very well. Microformats rely on the existing extensibility point, the role or class attribute. As such, they consume that extensibility point, leaving me without one for any other use I may have.
They're devilishly hard to validate. DTDs and W3C XML Schema are right out the door for validating microformats. Of course, Schematron (and other rule-based validation languages) can do it, but most of us are used to using grammar-based validation on a daily basis and we're likely to forget the extra step of running Schematron validation.

It's interesting that RELAX NG can almost, but not quite, do it. RELAX NG has no difficulty distinguishing between two patterns based on an attribute value, but you can't use those two patterns in an interleave pattern. So the general case, where you want to say that the content of one of these special elements is “an abbr with class="dtstart" interleaved with an abbr with class="dtend" interleaved with…”, you're out of luck. If you can limit the content to something that doesn't require interleaving, you can use RELAX NG for your particular application, but most of the microformats I've seen use interleaving in the general case.

Is validation really important? Well, I have well over a decade of experience with markup languages at this point and I was reminded just last week that I can't be relied upon to write a simple HTML document without markup errors if I don't validate it. If they can't be validated, they will often be incorrect.

At the end of the day, I'm not a fan of microformats, at least not on the complex end of the spectrum. There are undoubtedly lots and lots of situations where they're the only practical answer, but if you don't have to use them, I'm not sure you should.

If you want to embed data in your documents, embed data. The XML source for the individual itinerary pages, for example, doesn't use DocBook littered with role attributes to store itinerary information, it uses markup suited to that purpose:

<trip xmlns="http://nwalsh.com/rdf/itinerary#"
      startDate="2005-09-18T12:45:00" endDate="2005-09-30T23:59:59"
      trip="09-18-tagxsl">
   <itinerary>
      <leg class="flight">
         <startDate>2005-09-18T12:45:00</startDate>
         <endDate>2005-09-18T14:30:00</endDate>
         <description>BDL-RDU/AA 4695</description>
         <depart>#BDL</depart>
         <arrive>#RDU</arrive>
         <flight>4695</flight>
         <airline>American Airlines</airline>
      </leg>
      …

I think that's a better answer when it's a practical answer.

Comments

"As such, they consume that extensibility point, leaving me without one for any other use I may have."

Not quite true since an element can have multiple classes. What it does consume is that particular class name. It doesn't stop you from adding other class attributes like <span class="vevent myclass1 myclass2">.

Isn't the fact that the class names exist in a global namespace also a significant problem? The hcalendar-issues page includes a number of items which can be seen in that light. You could usefully emphasize the namespace advantage for your <trip xmlns="http://nwalsh.com/rdf/itinerary#" ... example.

The only way I see where it could be made easier. It's with a templating system, but do not ask the user to enter the data. :)

I was thinking of another possibility too, but I have to try first.

@Ed-

About class values being in a global namespace.... yes, in a sense they are in a flat namespace. On the other hand, they can be (relatively) easily disambiguated by context. For example, .hreview .descripion and .vevent .description.

@Norman-

"I'm not using and to markup the individual appointments, so I can't add hCalendar to them."

I don't follow what you're saying here.

re: validation...

A validation approach is being explored by Brian Suda in his X2V system- he's running a preflight procedure before trying to parse hcards, which catches nearly all errors. Also, I'm finding that having a reference implementation (X2V), helps about as much as validation. When I write code in hcard or hcalendar, I'll check it against X2V to make sure that things come out the way I expect them to. This method seems to work pretty well.

Its good to hear your feedback about microformats. Feel free to offer constructive feedback anytime.

-ryan king

@Ryan,

I think Norm is saying that you are limited in the type of markup you can use. In the sense a certain architecture of imbrication of span/div or other kind of elements. More than a bit here and there with the elements of your choices.

The good thing would be to find a universal solution. Whatever the person is using for the class names, to have it easily processable by a transformation rules. Maybe the profile saying this class name correspond to this in hcard. this one to that. And so to really give freedom to the user.

Unfortunately, there's still something which is painful for everyone is that as soon you want to explicit data structures, you have to work for it (being n3, rdf/xml, microformats, wiki markup, etc.) It's why a template paradigm would be very useful in this domain of explicit data. You could look at datablogging for example, which has very neat features for templating.

@karl: The good thing would be to find a universal solution. Whatever the person is using for the class names, to have it easily processable by a transformation rules.

Absolutely right. It would have been much better if the hCalendar people had taken a step back to get a wider view. Using an approach like that of GRDDL would have allowed a lot more flexibility. Having a standard transformation available for the markup as they have defined it would allow those who don't wish to write XSLT to play.

On the other hand - I'm not too impressed with the details of GRDDL anyway.

Re: extensibility, my bad. I always forget that in XHTML, class="foo bar baz" means the union of the foo, bar, and baz classes.

I expected this essay to generate comments, and it did :-) I'll follow up in a few days with an attempt to synthesize what I learn from them.

Calendar data is very complicated and most people DON'T use divs and spans, instead they use tables to sperate date-time, location, and description information into individual cells. We are aware of this have have look into special rules for extracting information from tabular data. These are in the early stages, but HTML tables has very little used attributes called 'headers=' and 'axis=', these can be used to associate rows and cells to other cells.

There are no manditory elements that MUST be used in microformats, just more semantic ones. The most common is the abbr element. In all examples, properties like 'dtstart' and 'dtend' use abbr elements to give a human readable date-time as the node value and a machine readable date-time in the title attribute. There is nothing stopping someone from using any element they wish, just that the node value would have to be the machine readable date-time.

Another advantage of microformats is the ability to extract the data from the HTML to any format. hCard can be converted nicely to a vCard, but another advantage is the ability to use GRDDL to extract hCards to RDF vCards, RDF FoaF, or any other format.

XMDP is a way to describe a microformat, and for simpler microformats, you can use it to validate against. I have built some basic validators for XFN, GRDDL, and no-follow. When you get to more complex systems, it does get more difficult because it relies on knowledge about some of th RFCs referenced. Even with that draw-back it is possible to create a unique validator for microformat based on the XMDP file. A universal validator/parser has been proposed by used GRDDL and RDF to extract more of a machine readable schema from XMDP, but at the end of the day, microformats are designed first and foremost as human readable, machine readable second.

i look forward to a follow-up post about this topic.

-brian

which reminds me of something completely different.

I had created a while ago a calendar under the form of a list and rendered as a table. :)

So, given the objections you have, how would you respond to Tim Bray's "No New XML Languages" post on http://www.tbray.org/ongoing/When/200x/2006/01/08/No-New-XML-Languages if you haven't already?

Brian, "XMDP is a way to describe a microformat, and for simpler microformats, you can use it to validate against." - I'd say this isn't a true statement. I don't think that eyeballing something is the same as validation, which at least to my mind implies some sort of official stamp of approval.

I was wondering if interleave is really required for RELAX NG validation of microformats, cf. our very brief chat on #swig about it. I suppose that the problem is that you need to be able to select which child element classes will be used for each parent element class, for example... Couldn't you do that with substitution groups in W3C XML Schema, or something like that? I don't like to think that this is impossible.

Or perhaps the Schematron approach would be worth it. After all, it's just a load of XSLT, so the barrier for setting up your own local validator is actually a lot lower than for, especially, WXS, and to some extent RELAX NG (I use jing or rnv, but that hardly beats XSLT's implementation proliferation).