<?xml version="1.0" encoding="UTF-8"?>
<essay xml:lang="en" version="lillet" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:gal="http://norman.walsh.name/rdf/gallery#">
<info>
    
    
    
    
    
    
    
    
    
<title>Up again</title><biblioid class="uri">http://norman.walsh.name/2005/06/22/upagain</biblioid>
<volumenum>8</volumenum>
<issuenum>96</issuenum>
<pubdate>2005-06-22T06:14:56-04:00</pubdate>
<date>$Date: 2005-09-11 10:27:02 -0400 (Sun, 11 Sep 2005) $</date>
<author>
      <personname>
<firstname>Norman</firstname>
	<surname>Walsh</surname>
</personname>
    </author>
<copyright>
      <year>2005</year>
      <holder>Norman Walsh</holder>
    </copyright>
<abstract>
<para>The fact that you can read this demonstrates that…you can read this.</para>
</abstract>
<dc:subject rdf:resource="http://norman.walsh.name/knows/taxonomy#SelfReference"/>
</info>

<para xml:id="p1">What happened was, yesterday, someone sent a robot
(not a very bright one) to scrape this site. The robot managed to
bring two unrelated bugs and a bad idea together in the worst possible
way.</para>

<para xml:id="p2">Bug the first: by following <emphasis>all</emphasis>
the links on each page, it managed to expose a recently introduced bug
in site navigation. This bug caused a server error and a 404.</para>

<para xml:id="p3">Bug the second: the robot tripped over a URI rewriting bug and
went into an infinite loop.</para>

<para xml:id="p4">Bad idea: two years ago or thereabouts, when I moved
all my travel pictures off of
<systemitem class="systemname">nwalsh.com</systemitem>, I setup the 404 error
document as a Perl script. This script tried to offer helpful suggestions
for a variety of URIs that I decided not to handle with redirects.
</para>

<para xml:id="p5">Server admin types can probably see it coming. Recipe for
bringing down your server: set it up so that handling a 404 requires
starting up a Perl process and then generate an infinite number of
404's as fast as you can. When the server load broke 40, my ISP pulled
the plug.</para>

<para xml:id="p6">Unfortunately, I didn't get an immediate notification of this
event. I'm still investigating why, but I suspect that it has something
to do with the fact that I'm in the middle of moving something else
around. It wasn't until
<personname>
      <firstname>John</firstname>
      <surname>Cowan</surname>
</personname> and later
<personname>
      <firstname>Luis Miguel</firstname>
      <surname>Morillas</surname>
</personname> and
<personname>
      <firstname>Danny</firstname>
      <surname>Ayers</surname>
</personname>
<link xlink:href="http://dannyayers.com/archives/2005/06/21/qotd-5/">pointed
it out</link> that I even new something was wrong. And I didn't know
I was the one who needed to take corrective action until this morning.</para>

<para xml:id="p7">But I think it's all better now.</para>

<para xml:id="p8">One final thought, if you're going to write a robot to scrape
an entire website, could you please try to be a little bit polite about it?
Throttle the blasted thing so that it gets 100 pages and then waits five
minutes or something before getting the next 100. There are
almost 27,000 files on this server. Trying to slurp them all down as fast
as you can just seems…insensitive.</para>

</essay>

