I was trying to gather data from an NBA stats website, and I'm certain my IP got blacklisted for making too many requests. It failed to occur to me that this is "rude", so I did some research and learned about "polite" crawling with crawl delays. I looked at their robots.txt file and found the following:User-agent: * Disallow: /blazers/ Disallow: /dump/ Disallow: /fc/ Disallow: /my/ Disallow: /7103 Disallow: /play-index/plus/.cgi? Crawl-delay: 3 They specify a crawl-delay of 3 seconds. Doesn't that mean if I space my requests out by 3 seconds, they're okay with it? To be safe, I put a 4 second delay in between each request. Furthermore, I'm not requesting any of the routes that they've disallowed. Yet, I got blacklisted again half way through the run. Anyone know why?
Submitted September 05, 2017 at 02:01AM by bskilly
No comments:
Post a Comment