XML FTW!

Volume 13, Issue 4; 25 Jan 2010; last modified 08 Oct 2010

On the serendipitous joy of finding XML.

As I've said before, I'm very reluctant to use your application if it's a roach motel for my data. It would not be fair to say that I'll refuse to use your application, it's just a lot less likely.

For example, when it came to GSD, I decided that open access wasn't as important as picking an application that I'd actually use. If I let myself get distracted by exploring APIs, there'd be other things not getting done! (Priorities!)

Having made my bed, I figured I should see what I was lying in. Today I took a peek at how OmniFocus stores data. Now, the title of this essay no doubt gives away the punch line, so consider for a moment how this would have been done in the time before XML.

…go on, have a think, I'll wait…

In my experience it would probably have been in some proprietary format, almost certainly binary, and utterly opaque. How many tools document(ed) their proprietary data formats? On some platforms, there might have been system services for storing data, some sort of platform-supported database perhaps. Those systems are (often) only marginally better. They produce, instead of an opaque stream of bits, an opaque stream of atomic values. (Don't get me wrong, I've done the reverse-engineering thing on binary formats, I'd prefer the stream of atomic values, believe you me.)

What did I find when I went looking at the OmniFocus data? A directory full of ZIP files. And what's in each ZIP file? Why contents.xml, of course!

Now, it would not be fair to assert that this is perfectly transparent. XML isn't magic. There are clearly some cross-reference relationships in there that will take a little mental gymnastics to decode. But still, I'll trade this:

...
<task id="pJhk6REkEHC" op="update">
  <task idref="ggQv63WgCbw"/>
  <added>2010-01-21T16:23:08.983Z</added>
  <modified>2010-01-24T21:01:41.632Z</modified>
  <name>Add server-side support for multipart MIME to tests.xproc.org</name>
  <rank>2113929216</rank>
  <context idref="jYnYAAVroBT"/>
  <due>2010-01-27T22:00:00.000Z</due>
  <completed>2010-01-24T21:01:41.622Z</completed>
  <order>parallel</order>
</task>
...

for anything I would ever have gotten at any other point in the history of file formats!

XML has its detractors. It would not be fair to say they are all wrong. But I'll take XML over fair any day!

Comments

That business of the task element having a child element also named task is rather weird.

In addition, the JSON version of this would also be Quite Decent.

The "Add server-side support..." task is part of a larger task (or project) called "XProc Actions". I think that's the origin of the nesting.

And yes, in this particular case, JSON would also work. But I don't need another serialization for a subset of the full scope of documents I want to create. At least not outside the context of exchanging packets of data between JavaScript and some server process.

XML good. Turtle, N3... better. For example, John probably wouldn't find a <task> within a <task> weird if they were related by of:subTask or something. And you'd know what was going on with the idrefs.

I try to be an RDF fan, but I'm not sure I see how it would help here. The inner element could have simply been named subtask if that mattered, but I don't think it does.

How would RDF have made the ID/IDREF relationships any clearer?

The first thing I'd wonder, seeing data like that, would be whether the inner task element is always a reference or if sometimes it's a complete task element, e.g., if it's not referenced anywhere else. With RDF you wouldn't care, that decision is merely a whim of the serializer which will be dealt with by whatever library you use to read the serial form back in. It's of no more interest than whether the id attributes use single or double quotes.

For XML, in a way it would be better if the inner element was named subtask as that would clear up any confusion as to whether it's actually a reference to the supertask, for example. But the element wasn't so named, probably because XML, while mostly being fairly self-descriptive, leaves implicit the relationship between parent and child elements thereby encouraging designers to assume it's obvious.

Also giving the inner task element a different name makes it difficult to include tasks not referenced elsewhere in-line.

Similarly, the ID/IDREF relationship would be dealt with completely be the reader; you'd only need to think about the subtask relationship. Though id attributes are usually unique throughout the document that's only a convention; in the absence of an accessible DTD you might have to do some careful investigation to make sure that other element types can't have clashing ids, for example.

Yes, there are alternatives to XML... I'm fond of JSON, and seeing SQLite replace previous binary gorp under stuff like iPhoto makes me very happy.

But I still credit the XML movement with establishing the culture of open exchange behind all of these.

e.g. see slide 7 in my ALA midwinter 2000 talk.

Of course there are alternatives, they're just not as good. Yes, JSON is fine, I suppose for simple, perhaps even nested key value pairs. And SQLite is better than random binary goo, but mostly IMHO because there's a standard path to text.

I reverse engineered AddressBook's SQLite data and it was a minor PITA.

Sometimes there are good reasons to do something else, no doubt, but I'm still happiest when I find XML when I go looking for data.