Mercurial heartburn

Volume 10, Issue 81; 09 Aug 2007; last modified 08 Oct 2010

A tale of woe with bad parts, good parts, and a moral.

I've been using Mercurial for a few weeks now. I had no trouble switching; it's similar enough to the other version control systems I've used. And even though I'm usually connected to the net, I'm loving fast, local commits. So much so that I have six or eight repositories up and running now: my home directory, one for each of the web sites I maintain, one for papers I've written, another for presentations, basically most of (eventually, all of) the stuff that isn't in some other version control system.

The bad part

Then a few days ago, I happened to notice something odd (alas, I don't recall what). Verifying local and remote repositories revealed that my home directory repository was totally borked on the server (it showed 1 revision where locally I had 23) and the repository for this weblog showed a bunch of errors, both locally and remotely.

The very worst part is that there had been no error messages to indicate that anything had gone wrong. In fact, running an hg push in my home directory appeared to complete successfully.

It's hard to imagine a worse sort of error than that in a version control system.

Much digging (with tremendous assistance from Matt Mackall at selenic) revealed the problem.

In a nutshell, my hosting company, where the server I'm pushing to is located, terminates with extreme prejudice any process that uses too much memory or too much CPU. My attempt to push the commit of a 50Mb image caused the Mercurial process on the server to chew up hundreds of megabytes of memory attempting to compute diffs, and when it crossed some threshold, it got SIGKILLd.

I think the client should have noticed this and not appeared to complete successfully, but I'm not sure how realistic that is. I'm willing to stipulate that kill -9ing one half of system isn't something you'd expect to happen.

In practice, I didn't realize I'd attempted to put a 50Mb image into the system and I don't have any practical reason to do so.

Matt maintains that a similar circumstance caused the errors in my other repository. I'm not really convinced, but since there doesn't appear to be any way to tell, I'm not pushing the point.

So now I've got two repositories with errors deep in their version history. (The second of 23 transactions in one case, the 1366th of 1420 in the other.)

The good part

Luckily, Mercurial has features that allowed me to deal with these issues. They're called “queues”. I don't really grok them, but Matt helped me use them to fix the problems.

In each case, I turned all of the commits that had occurred after the point where the error occurred into a queue of patches. Then I rolled back to the error version, fixed the errors, refreshed the patch for that version, reapplied all the patches, and finally deleted the queue.

With fixed local repositories in hand, I was able to migrate them back to the server and all seems to be well.

The moral

You might think the moral is, don't use Mercurial, but I'm not willing to go that far. A lot of big projects are (or are going to be) using it. I'm confident that the bugs will get worked out. The queuing functionality seems to provide the ability to work around problems. No, the moral is an unverified backup is a worthless backup. Backing up your data (and really, that's what I'm using Mercurial for here) but not checking that the backup is both readable and correct is accepting a risk proportional to the importance of your data. And you're backing it up because it's important, right?

Your gun, your bullet, your foot.

No more blindly pushing changes and assuming they work for me.

#!/usr/bin/perl -- # -*- Perl -*-
# rdf:
# dc:title HGPush
# dc:date 2007-08-09T10:35:04-04:00
# cvs:date $Date$
# dc:creator
# dc:rights Copyright © 2007 Norman Walsh.
# cc:license
# cc:license
# dc:description Pushes Mercurial commits verifying both repositories
# Bugs:
#  - only supports the default server path
#  - only supports server paths of the form
#    ssh://user@host/path/to/repository
#  - should check the return code of the local verify and abort
#    if it fails

open (HG, ".hg/hgrc") || die "No .hg directory?\n";

my $host = undef;
my $section = "";
my $key = "";
while (<HG>) {
    next if /^\#/ || /^\s*$/;
    $section = $1 if /^\[(.*?)\]/;
    $key = $1 if /^([^\[\s=]+)/;
    if ($section eq 'paths' && $key eq 'default') {
	($host = $_) =~ s/^.+\s//;

die "No default path?\n" if !defined($host);

die "Only prepared for ssh paths; found \"$host\"\n"
    unless $host =~ /^ssh:\/\/(.*?)\/(.*)$/;

my $rsh = $1;
my $path = $2;

print "Verifying local repository\n";
system("hg verify");
print "Pushing local changes to $host\n";
system("hg push");
print "Verifying remote repository\n";
system("rsh $rsh \"(cd $path; hg verify)\"");

Share and enjoy.


# weekly Friday ZFS snapshot of ON-bldscripts gate
1 7 * * 4       /sbin/zfs snapshot local/Public/ON10-bldscripts@`/usr/bin/date +%Y-%m-%d`
—Posted by Vladimir Kotal on 09 Aug 2007 @ 08:09 UTC #

Mmmmm. ZFS. Sweeeet :-)

The point wasn't just that I wanted to backup the repository, it was that the integrity of the repository which is my backup of the real data needs to be tested every time it's updated.

—Posted by Norman Walsh on 09 Aug 2007 @ 08:52 UTC #

Are you saying mercurial is not transactional? That's a little bit scary.

—Posted by Matthias on 10 Aug 2007 @ 07:32 UTC #

It's transactional; it has atomic commits and all the goodness you expect. I'm saying if you sigkill -9 the server while it's in the middle of a push, things can go horribly awry in a way that isn't detected.

I think it should be detected, but...

—Posted by Norman Walsh on 10 Aug 2007 @ 07:52 UTC #

I thought it was a distributed system. Calling it "server" seems to be a stretch.

I'd say kill'ing -9 the pulled side should give an error in the pushing side, or the system can't be called transactional.

I still find really scary that a revision control system ends without a "proper" error when it breaks and after it is broken.

—Posted by SantiagoGala on 22 Aug 2007 @ 01:10 UTC #

Very helpful! I've just signed up for a VPS (Virtual Private Server eg a Xen DomU) with limited memory and this is something I worried about. Thanks :]


—Posted by Tyler Oderkirk on 06 Sep 2007 @ 09:34 UTC #

Note that system(...) in Perl returns 1+ in case the command failed. Perhaps you wanted:

system('hg verify') and die "$!";

—Posted by Jesse Glick on 30 Sep 2008 @ 12:41 UTC #