[Dirvish] [administrivia] taming web spiders, particularly Baidu?

Loren M. Lang lorenl at alzatex.com
Mon Jul 23 22:07:29 UTC 2012


On 7/23/2012 11:31 AM, Keith Lofstrom wrote:
> This is not about dirvish, but about the website.  Perhaps some
> of you sysadmins can help.
>
> You may occasionally see the dirvish.org website stop responding
> to web requests.
>
> dirvish.org is running on my virtual machine at rimuhosting in
> Dallas, along with half a dozen other low-usage sites.  Some of
> the contents on other sites are lectures and videos, about 5GB
> of total content.
>
> Baidu, the Chinese search engine, spiders the net every 15 minutes,
> looking for changes.  Which means it attempts to download 20GB
> an hour from my server.  Sometimes it does not complete the requests
> in time, and they accumulate.  During the last slowdown, netstat
> reported 140 open ports to baiduspider, including many big files. 
> Apache stopped taking most new requests, and browsers timed out.

There are a variety of HTTP headers that are important for telling User
Agents like spiders how to cache responses.  Important headers include
Etag, Last-Modified, Expires, and Cache-Control.  When I visit
dirvish.org, I see both the Etag and Last-Modified headers which should
allow reasonable caching of that page, but when visiting any page under
wiki.dirvish.org, they are missing which is typical of dynamically
generated pages.  There are actually two key events in play, first a
page can be considered to be valid for so long.  While a page is valid
(or fresh), it does not need to be checked for freshness and a cached
copy can be used with no network traffic needed.  The second is the time
frame a page can be cached but may be invalid.  In this case, if there
was enough information in the initial response, the User Agent can issue
a conditional-GET request asking for any updates.  If the old page is
still valid, the web server simply responds with a "304 Not Modified"
which can save on a lot of bandwidth for larger files.  The MoinMoin
wiki recommends setting up mod_expires to help, but does not seem to
support Etag yet.  mod_expires should help regardless.  Look at the
bottom of this page:

http://moinmo.in/AutoUpdatingStuff

If you have other dynamic pages, or even static pages, there should be
add-ons or Apache tweaks to improve the freshness and caching behavior. 
I'm using the Live HTTP Headers add-on for Firefox to see what headers
are being provided as I navigate the Dirvish website.

>
> As a temporary measure, I've disallowed baiduspider in robots.txt 
> for all my sites.  I will move the videos and large files to some
> of the free file hosting services over time.  But I want to keep
> serving China's 20% of the world's population with reasonably
> up-to-date search results.  So, the question:
>
> Is there any way to tell the search spiders to visit once a day
> or once a week, rather than four times per hour?  Or send them
> "recent changes" lists instead of them repeatedly downloading the
> same files?  Any other ideas for calming down the web crawlers?
>
> Keith
>


-- 
Loren M. Lang
lorenl at alzatex.com
http://www.alzatex.com/


Public Key: ftp://ftp.tallye.com/pub/lorenl_pubkey.asc
Fingerprint: 10A0 7AE2 DAF5 4780 888A  3FA4 DCEE BB39 7654 DE5B




More information about the Dirvish mailing list