Micro-blogging Backup, part the third

Volume 12, Issue 28; 03 Sep 2009; last modified 08 Oct 2010

In which we peel back the covers on what's been built so far.

There's more functionality to come, but first, I thought it might be useful to spend a few minutes looking at what we've got so far.

The setup code in /mbb/init isn't very interesting, and I'm not going to attempt to explain how CQ works, so we'll begin in the /mbb/modules directory.

accounts.xqy

This module contains some utility and convenience functions for dealing with account data. I changed my mind about how to store the data a couple of times early on, so these functions were supposed to protect me a little bit from that. I didn't follow through all the way so the account abstraction is pretty leaky, but I left this module in place anyway.

twitter.xqy

This module is a thin skin over the actual Twitter API. Ideally, I'd flesh this module out to support the rest of the endpoints, but I haven't bothered yet.

One school of thought on this kind of API module is that it should be as thin as possible, providing only the thinest skin over the underlying API. I mostly agree, but I did take a few liberties. If you wanted to adapt this module for some other purpose, you might have reason to carve it a little closer to the bone.

One decision I made was to have the account/rate_limit_status method return the number of calls remaining directly as a number, rather than returning the XML response. That's pretty simple. The other changes I made are a bit deeper.

The Twitter timeline methods are designed to be “paged”; the caller can select the page size and the page they want to retreive. I decided that what I really want is all the pages; so my versions of the timeline methods always request all the pages and return all of the results in a single call (by performing the requisite paging for you, behind the scenes). Twitter limits you to 16 pages, but Identi.ca servers seem to offer more pages. In order to avoid recursing beyond the size of the call stack, I placed an arbitrary limit on the number of pages.

Finally, I decided to protect the caller from exceptions that can occur if the underlying HTTP requests fail. Most of the public methods in twitter.xqy return an element, either the Twitter API response, or a t:error element containing the HTTP error code if an error occurred.

I think an argument could be made for not doing this, for letting the lowest-level API calls throw the exception, but I decided not to. You're free to change that, of course.

twitproc.xqy

This module is mostly responsible for taking Twitter status and user elements and inserting them into the database. Along the way, we transform them just a little:

  1. I move them from no namespace into the “t:” namespace. First, I subscribe to the position that XML vocabularies should place elements in a namespace. I'm aware that there are people who believe otherwise. They're wrong. Second, XQuery’s interpretation of unqualified names exacerbates the problem. So you could look at this as patching a bug in the Twitter API.

  2. I transform the contents of the created_at element into ISO 8601 format (so it fits more naturally into the data model).

  3. The Twitter APIs return a user element embedded in each status message. This is probably a net win for limiting round-trip calls to the API, but it doesn't strike me as a very sensible way to store things in the database. I break out the users and store them separately.

  4. I add a few more elments to each status message. These record information about subsequent processing to perform (more about that later), the screen name of the user who uttered the message, and information about who was logged in to retreive this message.

    This is a little lazy on my part. Arguably, I should introduce another namespace for these additional elements (so that some future Twitter API change doesn't walk all over them), or maybe not store them in the messages at all. I invite you to fix it if it bothers you.

    If you know something about MarkLogic Server, this may sound like a job for document properties. That's a good idea, particularly for the downstream processing markers. However, document properties are associated, as the name suggests, with documents in the database. Later on in the code for displaying messages, we're sometimes going to make a copy of the message (giving it a new parent element). Doing that breaks the association with document properties. I was trying to keep things simple, so I didn't use properties for one set of information and child nodes for another, I just pushed it all into child nodes. My bad.

update.xqy

This module wraps up the functionality of the twitter.xqy and twitproc.xqy modules, getting all the tweets for a user and inserting them into the database. The code for finding the most recent messages by (and not by) a particular user might be interesting to you. Ignore the $tweet-collection variable; it's a holdover from an earlier approach, no longer used.

get-new-tweets.xqy

This module exists only to be invoked from another module. It declares an external variable that identifies a single account then simply calls the get-tweets function from the update.xqy module for that account.

The last bit of code that we've got so far is get-tweets.xqy in the top level of the application server. This module loops over all the accounts that we've defined and, for each one, downloads and inserts any new status messages into the database. It does this by invoking the get-new-tweets.xqy module.

What's all this invoking stuff about?

The server takes a completely safe, transactional approach to database updates. You are guaranteed that every query that updates the database either succeeds in its entirety or fails. One of the things that you aren't allowed to do is make conflicting updates to the same document in the same transaction. You can demonstrate this easily, just run the following expression in CQ:

let $doc := <foo>some document</foo>
return
  (xdmp:document-insert("/scratch/foo", $doc),
   xdmp:document-insert("/scratch/foo", $doc))

The server will bark “XDMP-CONFLICTINGUPDATES” and no inserts will be made to the database.

Why does this matter to us? Well, imagine that you setup two Twitter accounts in our micro-blogging backup system. Imagine further that both of those accounts follow marklogic.

What's going to happen when we run the backup? Both accounts are going to download all of the status messages on their “friends” timeline, so they're both going to download all of the recent marklogic tweets. And they're both going to try to insert them into the database. And that's going to generate a “conflicting updates” error.

Using the two-step xdmp:invoke dance as shown in get-tweets.xqy and get-new-tweets.xqy avoids this problem. The semantics of xdmp:invoke are that it runs the specified module in a separate transaction.

Since no single user is going to download the same message twice, each transaction will succeed. In fact, some messages will get updated twice in the database, but that doesn't do any harm because the content of the message will be the same in each case.

An alternate approach to this problem is to manage the messages with greater care, identifying duplicates when they occur and not attempting to insert them in the database. This is the approach taken in twitproc.xqy for the simpler problem of dealing with duplicate users.

It would certainly be possible to refactor the code so that the xdmp:invoke call could be avoided, but in this case splitting work into several transactions feels like the more elegant solution. And any performance penalties associated with a few calls to xdmp:invoke are going to be totally swamped by the latency in the underlying HTTP requests, so there isn't really a downside.

What next?

In the next part, we'll push a little further forward, getting some code in place to display the messages we've downloaded. We'll also look at the subsequent processing I hinted at. Further down the road, we'll look at search, and then we'll add some JavaScript, refactor things a bit, and make an AJAXy/Web 2.0 UI for our application.

I hope you're enjoying the ride.

Comments

Hmm. If you insert two identical documents into a database, shouldn't that Just Work and produce only one of them?

Just asking.

—Posted by John Cowan on 04 Sep 2009 @ 03:12 UTC #

Oh, I think two identical documents is just a degenerate case of the general problem. Given the potential expense of determining that two documents are "identical", I don't think it's worth making a special case. The problem doesn't actually arise in real life very often, this special case even less often.

—Posted by Norman Walsh on 04 Sep 2009 @ 11:04 UTC #

Do you think we could get the "seperate" spelling widely accepted? That would have saved me some embarrassment in a grade-school spelling bee.

—Posted by Dan Connolly on 05 Sep 2009 @ 03:02 UTC #

No, but thanks for spreading the embarrassment around :-)

I fixed the typo.

—Posted by Norman Walsh on 05 Sep 2009 @ 08:05 UTC #