Up again

Volume 8, Issue 96; 22 Jun 2005; last modified 08 Oct 2010

The fact that you can read this demonstrates that…you can read this.

What happened was, yesterday, someone sent a robot (not a very bright one) to scrape this site. The robot managed to bring two unrelated bugs and a bad idea together in the worst possible way.

Bug the first: by following all the links on each page, it managed to expose a recently introduced bug in site navigation. This bug caused a server error and a 404.

Bug the second: the robot tripped over a URI rewriting bug and went into an infinite loop.

Bad idea: two years ago or thereabouts, when I moved all my travel pictures off of nwalsh.com, I setup the 404 error document as a Perl script. This script tried to offer helpful suggestions for a variety of URIs that I decided not to handle with redirects.

Server admin types can probably see it coming. Recipe for bringing down your server: set it up so that handling a 404 requires starting up a Perl process and then generate an infinite number of 404's as fast as you can. When the server load broke 40, my ISP pulled the plug.

Unfortunately, I didn't get an immediate notification of this event. I'm still investigating why, but I suspect that it has something to do with the fact that I'm in the middle of moving something else around. It wasn't until John Cowan and later Luis Miguel Morillas and Danny Ayers pointed it out that I even new something was wrong. And I didn't know I was the one who needed to take corrective action until this morning.

But I think it's all better now.

One final thought, if you're going to write a robot to scrape an entire website, could you please try to be a little bit polite about it? Throttle the blasted thing so that it gets 100 pages and then waits five minutes or something before getting the next 100. There are almost 27,000 files on this server. Trying to slurp them all down as fast as you can just seems…insensitive.