[Dirvish] [administrivia] taming web spiders, particularly Baidu?

James Stanley james at incoherency.co.uk
Mon Jul 23 19:04:02 UTC 2012

Hi Keith,

I've not noticed any problems with the dirvish site (though I don't
visit it particularly often), but you may find that you have better luck
using robots.txt to prevent all robots from indexing the large videos,
but still allow them to index text content. Something like:

  User-agent: *
  Disallow: /directory-full-of-videos

Good luck!

James Stanley

On Mon, 23 Jul 2012 11:31:41 -0700
Keith Lofstrom <keithl at kl-ic.com> wrote:

> This is not about dirvish, but about the website.  Perhaps some
> of you sysadmins can help.
> You may occasionally see the dirvish.org website stop responding
> to web requests.
> dirvish.org is running on my virtual machine at rimuhosting in
> Dallas, along with half a dozen other low-usage sites.  Some of
> the contents on other sites are lectures and videos, about 5GB
> of total content.
> Baidu, the Chinese search engine, spiders the net every 15 minutes,
> looking for changes.  Which means it attempts to download 20GB
> an hour from my server.  Sometimes it does not complete the requests
> in time, and they accumulate.  During the last slowdown, netstat
> reported 140 open ports to baiduspider, including many big files. 
> Apache stopped taking most new requests, and browsers timed out.
> As a temporary measure, I've disallowed baiduspider in robots.txt 
> for all my sites.  I will move the videos and large files to some
> of the free file hosting services over time.  But I want to keep
> serving China's 20% of the world's population with reasonably
> up-to-date search results.  So, the question:
> Is there any way to tell the search spiders to visit once a day
> or once a week, rather than four times per hour?  Or send them
> "recent changes" lists instead of them repeatedly downloading the
> same files?  Any other ideas for calming down the web crawlers?
> Keith

More information about the Dirvish mailing list