Timezones are annoying and inconvenient. And that's before legislatures get involved and start mucking about with them. Nevertheless, in the real world, sometimes you just gotta deal.

It is not enough to be exceptionally mad, licentious and fanatical in order to win a great reputation; it is still necessary to arrive on the scene at the right time.

Voltaire

Ever since I started using PDAs and building my own XML representations of “personal information management” data, I've wrestled with timezones:

The only rational way to deal with these issues is to convert dates to UTC for internal use and convert back again when displaying the results. (In principle, any timezone would do, but UTC is easy.)

Given a time in UTC, the question “what time is it?” boils down to knowing what the offset from UTC is in the locale in question at the (absolute) time in question. Luckily, all of the rules for determining that offset, including adjustments for present and historical daylight savings time, are encoded in a publicly available database.

If you know that it's 8:20pm on 14 May 2013 in the “America/Chicago” timezone[1], you know that it's 1:20a on 15 May 2013 UTC.

For my purposes, working out how to interpret that database involved converting it to XML and writing the necessary XQuery to use it. Along the way, I discovered that there's also a publicly available map of the world's timezones.

[Photo]

This lead naturally to the question, if I know where I am, can I work out what timezone I'm in? This isn't as (completely) pointless as it might at first sound. Geolocation APIs in browsers mean that I can often tell where I am with considerable accuracy.

[Photo]

My laptop browser's geolocation API is accurate to within what appears to be inches.

Also, XQuery's implicit timezone isn't really very useful for these kinds of applications. As I write this, my offset from GMT is -5 hours. That could be “America/New_York” in winter or “America/Chicago” in summer and there's no way to tell.

Being able to work out from my geolocation that I'm in “America/Chicago” and not “America/New_York” would be useful.

So can I?

“Yes.” But not without a bit of effort.

The initial problem was how to decode the shape files into something useful. I eventually found Perl libraries that could read the shape files. That “reduced” the problem to one of dealing with just over 27,000 polygons, some with more than 20,000 vertices.

MarkLogic Server has very capable geospatial APIs, but asking it to compute in which of 27,000 hugely complex polygons a particular point appears is not practical.

What is a practical answer?

First, let me introduce a geospatial feature of MarkLogic server, then I'll outline my answer. Given a complex polygon, you can ask MarkLogic Server to decompose it into a set of bounding boxes, horizontal slices across the complex polygon, each fitting as tight to the shape as possible. Boxes are nice because, computationally, it's fairly easy to answer the question, is this point in this box?

[Photo]

Of course, if the shape is irregular, there may be points within the box that are not within the actual polygon. If you ask for more boxes, there will be fewer such points, but you'll have to deal with…more boxes.

For each timezone polygon, I compute two sets of boxes: the first is the overall bounding box for the entire timezone. The second divides each of the polygons into boxes.

Here, then, is the technique that I use to answer the question, what timezone is this point in?

  1. From the set of overall bounding boxes, which boxes contain the point?

    If the answer is zero, then the point is off in international waters or something and I don't care.

    If the answer is exactly one, then I'm done. That's the timezone the point is in.

  2. If the answer is more than one, then I ask, from the set of all bounding boxes associated with the timezones identified in step 1, which boxes contain the point?

    If the answer is zero, then the point is off in international waters or something and I don't care.

    If the answer is exactly one, then I'm done. That's the timezone the point is in.

  3. If the answer is more than one, then I have to actually do the computationally difficult task of working out which complex polygon the point is in. But I've reduced the number of polygons that have to be compared to only those that have overlapping bounding boxes. There are rarely more than two such polygons and when there are more than two, it's usually the case that the polygons themselves are smaller.

Here's an example of overlapping boxes in Europe.

[Photo]

This point is in two boxes, one that slices across the Czech Republic and one that slices across Austria, including a big chunk of the Czech Republic where Austria's border is “concave”.

[Photo]

It's all well and good to say that's the technique, but how is it actually implemented?

To answer that question, I need to introduce another feature of MarkLogic Server: reverse queries. The overwhelming majority of applications use queries to find documents: what documents contain the phrase “John Bigboote”? Which documents were created on 1 Nov 1938? How many documents are about Grover's Mill, NJ?

But MarkLogic Server can use its indexes “the other way around”. It's a little brain bending, but if you store a bunch of queries in the database, then you can efficiently ask the question, which queries match this document that I'm about to insert?

For each bounding box and polygon, I create a separate query (more than 223,000 of them, in fact) and store it in the database. I use collections to distinguish between the boxes and polygons.

When presented with a point, say latitude 39.82E, longitude 101.9W, I construct a dummy document and use a reverse query limited to the collection of overall bounding boxes to ask which queries match this point?

The answer I get is two queries. I've crafted the URIs for those query documents to identify their timezones:

/tzinfo/boxes/America/Chicago/0
/tzinfo/boxes/America/Denver/0

I've also used those URIs as the names of collections for the relevant bounding boxes. That means I can turn around and use a reverse query limited to those collections to ask again, which queries match?

The answer again is two queries:

/tzinfo/pboxes/America/Chicago/354-1085
/tzinfo/pboxes/America/Denver/2-102

They're the blue bounding boxes here:

[Photo]

These URIs are also used as the names for collections which means I can turn around and use a reverse query limited to exactly the documents that contain the 354th polygon in the “America/Chicago timezone and the 2nd polygon in the “America/Denver timezone to work out that the answer is:

/tzinfo/pboxes/America/Chicago/354
[Photo]

Putting the point in the the “America/Chicago” timezone!

But wait! There is yet another wrinkle. Some timezone polygons are nested. Nested, I hear you say? Yes! Consider, for example, the north eastern corner of Arizona.

[Photo]

Most of that corner is in the “America/Chicago” timezone, except for a little island of “America/Phoenix” which, I kid you not, contains an island of “America/Chicago”.

[Photo]

This stumped me for a long time because the timezone polygons aren't annotated in any way. Then just recently, it occurred to me that if a point is in two “America/Chicago” polygons (not boxes, but polygons), then it must be in an exclusion. (And if it's in two polygons it will necessarily be in two boxes, so I'm guaranteed to get down to the polygon level in these cases.) If it's in three, then it must be in an inclusion, etc. Because there's no reason for timezone polygons to share an edge, I think this will always work.

As your reward for reading all the way to the end, you can get the code from the ML-tzinfo project at github. If you just want to play with the sample applications, they're up at http://tzmap.nwalsh.com/.


[1]The conventional way to represent timezones in the timezone database is with a country/city pair. Often this is the country and its capital city, but for larger countries, various representative cities have been chosen (with some provision for aliases).

Comments:

And that's before legislatures get involved and start mucking about with them.

Here you touched on one big myster of life. See the latest update of the Fedora tzdata package. Why in the world have legislatures such unquencheable need to muck with timzeones? Why in the world Paraguay decided to muck with their timezone anyway? I have never understood this really.

BTW, what’s wrong with <blockquote> in comments?

Posted by Matěj Cepl on 31 May 2013 @ 10:05am UTC #

Just a couple of things:

Beware of converting future times to UTC prematurely; there are comprehensive data about timezones for the past and present, but by no means for the future. Israel, notably, changes its DST rules every year, trying to juggle religious and secular issues. When in UTC is an option that expires at noon on January 12, 2020, actually expiring? We think we know, but Congress might change the U.S. DST rules again.

That's primarily why there is at least one time zone in the Olson database per country, even countries that have always kept the same time at least since the Epoch. Each one is a separate time zone jurisdiction.

You don't explain why there's an island within an island, and I thought you might want to know at least part of it. The state of Arizona, for reasons best known to itself, doesn't observe DST, and that's why the America/Phoenix timezone exists. However, the Navajo Nation, which has jurisdiction in parts of three states, uses the same time throughout its territory, observing DST, as New Mexico and Utah do. That's why America/Chicago extends into Arizona. The Hopi Reservation is embedded in the Navajo Nation and is exclusively in Arizona, so it keeps Arizona time, and that's your inclusion. I am unable to account for the exclusion, and I wonder if it's an error.

In principle it is not the case that polygons can't share an edge. Suppose your exclusion above had actually belonged to a third timezone altogether, and had extended slightly further east or south?

Posted by John Cowan on 31 May 2013 @ 12:42pm UTC #

"America/Chicago" is "continent/city pair" not "country/city pair" isn't it?

Posted by Derek Read on 08 Jun 2013 @ 12:46am UTC #

Matěj: Who knows why they muck with timezones. Beats the heck out of me. And blockquote just isn't one of the elements I allow. I suppose I could.

John: Indeed, looking into the future is fraught with peril. Nevertheless, the module supports looking forward with the current rules. I haven't checked what happens with Isreal.

I actually don't think it would matter if they shared an edge. Though the question of which timezone the edge is in would have to be answered somehow.

Derek: Yes, I guess "America/Chicago" is really a continent/city pair. But lots of timezone names are country/city.

Posted by Norman Walsh on 08 Jun 2013 @ 01:50pm UTC #

How does The only rational way to deal with these issues is to convert dates to UTC for internal use and convert back again when displaying the results help with A repeating appointment at 11:00a on Thursdays in Boston occurs at different times (in other timezones) depending on whether it's winter in Boston (“standard” time) or summer (“daylight savings” time).?

I'd have thought the rational way to deal with these issues is to store all dates in their “natural” timezones and convert them, perhaps via UTC, when displaying the results.

Posted by Ed Davies on 08 Jun 2013 @ 02:01pm UTC #

I did a little more investigation, and your inclusion is just one of several Navajo enclaves within the Hopi Reservation. All this is the result of decades of problems between traditionalists both Navajo and Hopi (who want cooperation and peaceful sharing), the rez governments (who want control of surface and mineral rights), and the Feds (who want the problem to go away by any means necessary). See the Cooch Behar map for a nightmare of enclaves and exclaves on the India-Bangladesh border: these have serious consequences, as the people living on them basically get no services from the nation of which they are nominally citizens, and since they can't buy or sell anything without incurring customs duties, are the poorest of the poor.

Posted by John Cowan on 08 Jun 2013 @ 10:54pm UTC #

Fair cop, Ed. I don't think it really helps much with that case. For the case of repeating appointments tied to a locale, you basically have to work out the time at that locale for each appointment.

I think I what I actually do for repeating appointments is assume that the time is fixed for the locale and do the computations the other way around. What time is 11:00a US/Eastern in GMT on 20 June vs what time is it in GMT on 12 December. Which I think boils down to what you suggested.

Timezones. What a joy.

Posted by Norman Walsh on 20 Jun 2013 @ 01:59am UTC #
Comments on this essay are closed. Thank you, spammers.