Micro-blogging Backup, part the fifth

Volume 12, Issue 36; 18 Oct 2009; last modified 08 Oct 2010

In which we clean things up.

If you've been using one of the micro-blogging services for a while, you're probably familiar with a set of conventions that have evolved for adding metadata to your status messages. The ones I'm familiar with are:

@user to identify another user.
#tag to add a “tag” to your message.
!group to identify a group (at the moment, this seems only to be an Identi.ca convention).

In addition to those conventions, the use of “URL shorteners” (http://tinyurl.com/, http://bit.ly/, http://is.gd/, etc.) is common. And, finally, although it may not be apparent in the client you use, at the API level, individual status messages may indicate that they are “in-reply-to” some other message.

So far, our micro-blogging backup system doesn't take advantage of any of this extra information.

One of the first things I decided to do was expand shortened URIs. There's no 140 character limit in the database and if you link to something on http://youtube.com/, the odds that I want to follow that link are within ε of zero. I'd like to know before I click.

As long as we're grovelling through the text of each message, it makes sense to expand the other conventions, turning them into the appropriate links.

It also makes sense to download any messages that an existing message is “in-reply-to”. If those messages are also replies, we'll follow them too until the trail ends. This allows us to display whole conversations, even if they involve participants that we don't follow.

All of this can be accomplished with one new module, /modules/cleanup.xqy, and a new top-level query to drive it, /clean-tweets.xqy. The interesting bits are in the cleanup.xqy module:

The actual work is just string manipulation: regular expressions and tokenize, mostly.
Following replies counts against your rate-limit, so we do at most 50 at a time.
To expand URIs, we perform HTTP HEAD requests against the URIs we find in the status messages. In the worst case, some of those may timeout, so we do at most 500 at a time. That way we're unlikely to perform a query that takes so long that it times out.
If you look closely, you'll see that in addition to doing the expansions, we also add new elements to the status document: t:mention for mentions of another user, t:tag for tags, t:group for groups, and t:host for the host names of expanded URIs.

We'll come back in some future installment and use those for faceted searches (e.g., “find all the messages by @xmlcalabash that include links to tests.xproc.org”).

Pop the two files mentioned above into your setup (if this is your first encounter with my micro-blogging backup series, make sure you start at the beginning).

After you've installed the files, running http://localhost:8330/clean-tweets.xqy will start cleaning up your database. If you've downloaded a lot of messages, you'll have to run it several times. If you have a lot of replies, you'll have to spread the runs over a few hours.

The fact that you sometimes have to run the cleanup scripts several times is a bit inconvenient. I'm experimenting with some JavaScript to improve that, but I'm still looking for better solutions.

Comments

Another convention is the use of "RT" (for "ReTweet") followed by "@username" to indicate that a message is being forwarded/rebroadcast from another user.