<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" version="5.0" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:gal="http://norman.walsh.name/rdf/gallery#" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<info>
    
    
    
    
    
    
    
    
    
    
    
<title>Micro-blogging Backup, part the fifth</title><biblioid class="uri">http://norman.walsh.name/2009/10/18/mbb05</biblioid>
<volumenum>12</volumenum>
<issuenum>36</issuenum>
<pubdate>2009-10-18T16:14:49-04:00</pubdate>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2009</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>In which we clean things up.</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#MarkLogic"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#Microblogging"/>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#TheWeb"/>
</info>

<para xml:id="p1">If you've been using one of the micro-blogging services
for a while, you're probably familiar with a set of conventions that
have evolved for adding metadata to your status messages. The ones I'm familiar
with are:</para>

<itemizedlist>
<listitem>
<para xml:id="p2"><emphasis>@user</emphasis> to identify another user.</para>
</listitem>
<listitem>
<para xml:id="p3"><emphasis>#tag</emphasis> to add a “tag” to your message.</para>
</listitem>
<listitem>
<para xml:id="p4"><emphasis>!group</emphasis> to identify a group (at the moment, this
seems only to be an <link xlink:href="http://identi.ca/">Identi.ca</link>
convention).</para>
</listitem>
</itemizedlist>

<para xml:id="p5">In addition to those conventions, the use of “URL shorteners”
(<link xlink:href="http://tinyurl.com/"/>,
<link xlink:href="http://bit.ly/"/>,
<link xlink:href="http://is.gd/"/>, etc.)
is common. And, finally, although it may not be apparent in the client you use,
at the API level, individual status messages may indicate that they are 
“in-reply-to” some other message.
</para>

<para xml:id="p6">So far, our micro-blogging backup system doesn't take advantage of any
of this extra information.</para>

<para xml:id="p7">One of the first things I decided to do was
expand shortened URIs. There's no 140 character limit in the database and
if you link to something on
<link xlink:href="http://youtube.com/"/>, the odds that I want to
follow that link are within
<wikipedia page="Limit_%28mathematics%29">ε</wikipedia> of zero. I'd like
to know before I click.</para>

<para xml:id="p8">As long as we're grovelling through the text of
each message, it makes sense to expand the other conventions, turning them
into the appropriate links.
</para>

<para xml:id="p9">It also makes sense to download any messages that an existing message
is “in-reply-to”. If those messages are also replies, we'll follow them
too until the trail ends. This allows us to display whole conversations, even if they
involve participants that we don't follow.</para>

<para xml:id="p10">All of this can be accomplished with one new module,
<link xlink:href="examples/cleanup.xqy">/modules/cleanup.xqy</link>,
and a new top-level query to drive it,
<link xlink:href="examples/clean-tweets.xqy">/clean-tweets.xqy</link>.
The interesting bits are in the <filename>cleanup.xqy</filename> module:
</para>

<orderedlist>
<listitem>
<para xml:id="p11">The actual work is just string manipulation: regular expressions
and tokenize, mostly.</para>
</listitem>
<listitem>
<para xml:id="p12">Following replies counts against your rate-limit, so we do at most
50 at a time.</para>
</listitem>
<listitem>
<para xml:id="p13">To expand URIs, we perform HTTP HEAD requests against the URIs we find
in the status messages. In the worst case, some of those may timeout, so we
do at most 500 at a time. That way we're unlikely to perform a query that
takes so long that <emphasis>it</emphasis> times out.</para>
</listitem>
<listitem>
<para xml:id="p14">If you look closely, you'll see that in addition to doing the expansions,
we also add new elements to the status document: 
<tag>t:mention</tag> for mentions of another user,
<tag>t:tag</tag> for tags,
<tag>t:group</tag> for groups,
and 
<tag>t:host</tag> for the host names of expanded URIs.</para>
<para xml:id="p15">We'll come back in some future installment
and use those for faceted searches (e.g., 
“find all the messages by <literal>@xmlcalabash</literal> that include
links to <literal>tests.xproc.org</literal>”).
</para>
</listitem>
</orderedlist>

<para xml:id="p16">Pop the two files mentioned above into your setup (if this is your
first encounter with my micro-blogging backup series, make sure you start
at <link xlink:href="/2009/08/27/mbb01">the beginning</link>).
</para>

<para xml:id="p17">After you've installed the files, running
<link xlink:href="http://localhost:8330/clean-tweets.xqy"/> will start
cleaning up your database. If you've downloaded a lot of messages, you'll
have to run it several times. If you have a lot of replies, you'll have
to spread the runs over a few hours.</para>

<para xml:id="p18">The fact that you sometimes have to run the cleanup scripts several times
is a bit inconvenient. I'm experimenting with some JavaScript to improve
that, but I'm
still looking for better solutions.</para>

</essay>

