[Dirvish] [administrivia] taming web spiders, particularly Baidu?

f-dirvish at media.mit.edu f-dirvish at media.mit.edu
Mon Jul 23 21:30:58 UTC 2012

It -should- be using HEAD to see if the timestamps have changed---not
downloading everything repeatedly!

I found that Baidu was so badly-behaved on a site I run that I just
barred it completely, by telling Apache not to serve anything with
that user-agent string.  Good riddance.  (The site is for a makerspace
in the US, which requires physical presence to use, so I frankly don't
care if a few Chinese searchers don't know we exist or have to use
someone else's search engine to figure it out.  If they care, they
should fix their damned search engines not to be so obnoxious.)

I don't know if Baidu actually pays any attention to robots.txt.
(I can't remember if this was the reason I barred it, besides
its ridiculously high load.)  But if it does, you can disallow
just that one search engine from scanning your large content.

