[Dirvish] [administrivia] taming web spiders, particularly Baidu?

Keith Lofstrom keithl at gate.kl-ic.com
Mon Jul 23 18:31:41 UTC 2012


This is not about dirvish, but about the website.  Perhaps some
of you sysadmins can help.

You may occasionally see the dirvish.org website stop responding
to web requests.

dirvish.org is running on my virtual machine at rimuhosting in
Dallas, along with half a dozen other low-usage sites.  Some of
the contents on other sites are lectures and videos, about 5GB
of total content.

Baidu, the Chinese search engine, spiders the net every 15 minutes,
looking for changes.  Which means it attempts to download 20GB
an hour from my server.  Sometimes it does not complete the requests
in time, and they accumulate.  During the last slowdown, netstat
reported 140 open ports to baiduspider, including many big files. 
Apache stopped taking most new requests, and browsers timed out.

As a temporary measure, I've disallowed baiduspider in robots.txt 
for all my sites.  I will move the videos and large files to some
of the free file hosting services over time.  But I want to keep
serving China's 20% of the world's population with reasonably
up-to-date search results.  So, the question:

Is there any way to tell the search spiders to visit once a day
or once a week, rather than four times per hour?  Or send them
"recent changes" lists instead of them repeatedly downloading the
same files?  Any other ideas for calming down the web crawlers?

Keith

-- 
Keith Lofstrom          keithl at keithl.com         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs



More information about the Dirvish mailing list