Geographical proximity

Volume 13, Issue 39; 23 Oct 2010

A quick look at how “nearby” is implemented on this weblog.

Linda van den Brink asks how the geographical “nearby” function works on this weblog.

The first thing I should say is that one key aspect of the technique was accidentally obscured a bit. If you look at the XML for an essay, you'll see that there is a geospatial coverage element in the metadata. That was accidentally left out of the RDF view, but I've fixed that bug.

It's a simple fact that the questions a database can answer fastest are the ones it can answer by looking at the indexes alone. If you make a database look at individual documents to find an answer then, even if the query is instantaneous, you have to perform at least one disk I/O on every document you look at. On a collection the size of this weblog, probably inconsequential, but I've learned to avoid it.

One way to help build the right indexes is to augment the metadata associated with a document when it's ingested. That metadata can be stripped out again before the document is served, so it's entirely transparent to users. That's what I do here. If you could see the raw XML, which you can't, you'd find a few additional metadata elements in the info:

<mldb:id>13,33</mldb:id>
<mldb:pubdate>2010-10-12T19:15:15-04:00</mldb:pubdate>
<mldb:updated>2010-10-12T22:41:32.694007-04:00</mldb:updated>
<mldb:topic>MarkLogic</mldb:topic>
<mldb:topic>SelfReference</mldb:topic>
<mldb:subject>Mark Logic</mldb:subject>
<mldb:subject>McChord Crothers Samuel</mldb:subject>
<mldb:subject>Self Reference</mldb:subject>
<mldb:subject>Walsh Norman</mldb:subject>
<mldb:coverage>us-pa-philadelphia</mldb:coverage>
<mldb:geoloc>
  <geo:lat>39.9500</geo:lat>
  <geo:long>-75.1500</geo:long>
</mldb:geoloc>

Range indexes built on those elements allow me to answer all the “runtime” query questions with indexes. The dc:coverage value is used to lookup some RDF metadata about locations at ingestion time and augment the essay with the actual geospatial coordinates.

I've met folks who are uncomfortable with the technique of adding these additional elements. Me, I try to be pragmatic. As long as they can be managed entirely automatically by the ingestion process and discarded when the documents are served out, I don't let it bother me.

Now, the answer to the geospatial question is straightforward. I've got a geospatial index over the mldb:geoloc elements, so if an essay has a location, I use that location as the center of a series of geospatial queries with radii of 1, 2, 5, 10, 15, 25, 50, 100, 250, and 500 miles. I stop as soon as I find a query that contains at least three (other) documents.

Those documents are the ones listed as “nearby”. If there are more than three, then the trailing elipsis links to a page that shows them all.

In the fullness of time, I'll build some more interesting geospatial features using the geospatial metadata associated with photos and perhaps other things.