<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" version="5.0" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:gal="http://norman.walsh.name/rdf/gallery#" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<info>
    
    
    
    
    
    
    
    
    
    
    
    
    
<title>Micro-blogging Backup, part the third</title><biblioid class="uri">http://norman.walsh.name/2009/09/03/mbb03</biblioid>
<volumenum>12</volumenum>
<issuenum>28</issuenum>
<pubdate>2009-09-03T16:12:40-04:00</pubdate>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2009</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>In which we peel back the covers on what's been built so far.</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#MarkLogic"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#Microblogging"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#TheWeb"/>
</info>

<para xml:id="p1">There's more functionality to come, but first, I
thought it might be useful to spend a few minutes looking at what
we've got so far.</para>

<para xml:id="p2">The setup code in <filename>/mbb/init</filename> isn't very interesting,
and I'm not going to attempt to explain how CQ works, so we'll begin in
the <filename>/mbb/modules</filename> directory.</para>

<variablelist>
<varlistentry>
      <term>
	<filename>accounts.xqy</filename>
      </term>
<listitem>
<para xml:id="p3">This module contains some utility and convenience functions for
dealing with account data. I changed my mind about how to store the data
a couple of times early on, so these functions were supposed to protect
me a little bit from that.  I didn't follow through all the way so the
account abstraction is pretty leaky, but I left this module in place anyway.
</para>
</listitem>
</varlistentry>
<varlistentry>
      <term>
	<filename>twitter.xqy</filename>
      </term>
<listitem>
<para xml:id="p4">This module is a thin skin over the actual
<link xlink:href="http://apiwiki.twitter.com/Twitter-API-Documentation">Twitter
API</link>. Ideally, I'd flesh this module out to support the rest of
the endpoints, but I haven't bothered yet.
</para>
<para xml:id="p5">One school of thought on this kind of API module is that it should be
as thin as possible, providing only the thinest skin over the underlying API.
I mostly agree, but I did take a few liberties. If you wanted to adapt this 
module for some other purpose, you might have reason to carve it a little closer
to the bone.</para>
<para xml:id="p6">One decision I made was to have the
<methodname>account/rate_limit_status</methodname> method return the number of
calls remaining directly as a number, rather than returning the XML response.
That's pretty simple. The other changes I made are a bit deeper.</para>
<para xml:id="p7">The Twitter timeline methods are designed to be “paged”; the caller can select
the page size and the page they want to retreive. I decided that what I really 
want is <emphasis>all</emphasis> the pages; so my versions of the timeline
methods always request all the pages and return all of the results in a single
call (by performing the requisite paging for you, behind the scenes). 
Twitter limits you to 16 pages, but Identi.ca servers seem to offer more
pages. In order to avoid recursing beyond the size of the call stack, I placed
an arbitrary limit on the number of pages.
</para>
<para xml:id="p8">Finally, I decided to protect the caller from exceptions that can occur
if the underlying HTTP requests fail. Most of the public methods in
<filename>twitter.xqy</filename> return an element, either the Twitter API
response, or a <tag>t:error</tag> element containing the HTTP error code if
an error occurred.</para>
<para xml:id="p9">I think an argument could be made for
<emphasis>not</emphasis> doing this, for letting the lowest-level API
calls throw the exception, but I decided not to. You're
free to change that, of course.</para>
</listitem>
</varlistentry>
<varlistentry>
      <term>
	<filename>twitproc.xqy</filename>
      </term>
<listitem>
<para xml:id="p10">This module is mostly responsible for taking Twitter <tag>status</tag>
and <tag>user</tag> elements and inserting them into the database. Along
the way, we transform them just a little:</para>

<orderedlist>
<listitem>
<para xml:id="p11">I move them from no namespace into the “t:” namespace. First, I subscribe
to the position that XML vocabularies
<link xlink:href="http://www.w3.org/TR/webarch/#use-namespaces">should
place elements in a namespace</link>. I'm aware that there are people who believe
otherwise. They're wrong. Second, <wikipedia>XQuery</wikipedia>’s interpretation
of unqualified names
<link xlink:href="http://norman.walsh.name/2008/07/02/xquery#p11">exacerbates
the problem</link>. So you could look at this as patching a bug in the Twitter API.
</para>
</listitem>
<listitem>
<para xml:id="p12">I transform the contents of the <tag>created_at</tag> element into
ISO 8601 format (so it fits more naturally into the data model).
</para>
</listitem>
<listitem>
<para xml:id="p13">The Twitter APIs return a <tag>user</tag> element embedded in each
<tag>status</tag> message. This is probably a net win for limiting round-trip calls
to the API, but it
doesn't strike me as a very sensible way to store things in the database.
I break out the users and store them separately.
</para>
</listitem>
<listitem>
<para xml:id="p14">I add a few more elments to each status message. These record information
about subsequent processing to perform (more about that later),
the screen name of the user who uttered the message, and information about 
who was logged in to retreive this message.</para>
<para xml:id="p15">This is a little lazy on my part. Arguably, I should introduce another
namespace for these additional elements (so that some future Twitter API
change doesn't walk all over them), or maybe not store them <emphasis>in</emphasis>
the messages at all. I invite you to fix it if it bothers you.</para>
<para xml:id="p16">If you know something about 
<link xlink:href="http://www.marklogic.com/product/marklogic-server.html">MarkLogic
Server</link>, this may sound like a job for document properties. That's a
good idea, particularly for the downstream processing markers. However,
document properties are associated, as the name suggests, with
<emphasis>documents</emphasis> in the database. Later on in the code for displaying
messages, we're sometimes going to make a copy of the message (giving it a new
parent element). Doing that breaks the association with document properties.
I was trying to keep things simple, so I didn't use properties for one set
of information and child nodes for another, I just pushed it all into child
nodes. My bad.</para>
</listitem>
</orderedlist>

</listitem>
</varlistentry>
<varlistentry>
      <term>
	<filename>update.xqy</filename>
      </term>
<listitem>
<para xml:id="p17">This module wraps up the functionality of the
<filename>twitter.xqy</filename> and <filename>twitproc.xqy</filename>
modules, getting all the tweets for a user and inserting them into the
database. The code for finding the most recent messages by (and not by)
a particular user might be interesting to you. Ignore the
<varname>$tweet-collection</varname> variable; it's a holdover from an earlier
approach, no longer used.</para>
</listitem>
</varlistentry>
<varlistentry>
      <term>
	<filename>get-new-tweets.xqy</filename>
      </term>
<listitem>
<para xml:id="p18">This module exists only to be invoked from another module. It
declares an external variable that identifies a single account
then simply calls the <function>get-tweets</function> function from
the <filename>update.xqy</filename> module for that account.</para>
</listitem>
</varlistentry>
</variablelist>

<para xml:id="p19">The last bit of code that we've got so far is
<filename>get-tweets.xqy</filename> in the top level of the application
server. This module loops over
all the accounts that we've defined and, for each one, downloads and inserts
any new status messages into the database. It does this by invoking the
<filename>get-new-tweets.xqy</filename> module.</para>

<section xml:id="invoke">
<title>What's all this invoking stuff about?</title>

<para xml:id="p20">The server takes a completely safe,
<wikipedia page="ACID">transactional</wikipedia> approach to database updates.
You are guaranteed that every query that updates the database either succeeds
in its entirety or fails. One of the things that you aren't allowed to do is
make conflicting updates to the same document in the same transaction. You can
demonstrate this easily, just run the following expression in CQ:</para>

<programlisting>let $doc := &lt;foo&gt;some document&lt;/foo&gt;
return
  (xdmp:document-insert("/scratch/foo", $doc),
   xdmp:document-insert("/scratch/foo", $doc))</programlisting>

<para xml:id="p21">The server will bark “XDMP-CONFLICTINGUPDATES” and no inserts will be
made to the database.</para>

<para xml:id="p22">Why does this matter to us? Well, imagine that you setup two Twitter
accounts in our micro-blogging backup system. Imagine further that both of
those accounts follow
<link xlink:href="http://twitter.com/marklogic">marklogic</link>.
</para>

<para xml:id="p23">What's going to happen when we run the backup? Both accounts are going
to download all of the status messages on their “friends” timeline, so they're
both going to download all of the recent
<link xlink:href="http://twitter.com/marklogic">marklogic</link>
tweets. And they're both
going to try to insert them into the database. And that's going to generate
a “conflicting updates” error.</para>

<para xml:id="p24">Using the two-step <function>xdmp:invoke</function> dance as shown in
<filename>get-tweets.xqy</filename> and <filename>get-new-tweets.xqy</filename>
avoids this problem. The semantics of <function>xdmp:invoke</function> are that
it runs the specified module in a <emphasis>separate</emphasis> transaction.
</para>

<para xml:id="p25">Since no single user is going to download the same message twice, each
transaction will succeed. In fact, some messages will get updated twice in the
database, but that doesn't do any harm because the content of the message will
be the same in each case.</para>

<para xml:id="p26">An alternate approach to this problem is to manage the messages with
greater care, identifying duplicates when they occur and not attempting to
insert them in the database. This is the approach taken in 
<filename>twitproc.xqy</filename> for the simpler problem of dealing with
duplicate <tag>user</tag>s.</para>

<para xml:id="p27">It would certainly be possible to refactor the code so that the
<function>xdmp:invoke</function> call could be avoided, but in this case
splitting work into several transactions
feels like the more elegant solution. And any performance penalties
associated with a few calls to 
<function>xdmp:invoke</function> are going to be totally swamped by the
latency in the underlying HTTP requests, so there isn't really a downside.
</para>

</section>

<section xml:id="next">
<title>What next?</title>

<para xml:id="p28">In the next part, we'll push a little further forward, getting some
code in place to display the messages we've downloaded. We'll also look at
the subsequent processing I hinted at. Further down the road, we'll look
at search, and then we'll add some <wikipedia>JavaScript</wikipedia>,
refactor things a bit, and make an AJAXy/Web 2.0 UI for our application.
</para>

<para xml:id="p29">I hope you're enjoying the ride.</para>
</section>
</essay>

