How to ignore robots.txt at a spider level in Scrapy

Jul 30, 2020 · by Tim Kamanin

Scrapy has the ROBOTSTXT_OBEY setting that defines whether your spiders should respect robots.txt policies or not. The problem is that this setting is global and applies to all spiders. But what if you want to override it for some spiders?

It turns out it's easy, and the following technique can be used to override any Scrapy setting (not only ROBOTSTXT_OBEY) at a spider level.

All you need to do is to add custom_settings dictionary with values you want to override to a spider class, so in our case it would look like:

python
class MyPoliteSpider(scrapy.Spider):
  name = 'my_polite_spider'
  custom_settings = {
    'ROBOTSTXT_OBEY': False
  }

I've set MyPoliteSpider not to respect robots.txt policies, which is not very polite...

Want to get more 🔥 tips like this one?

Subscribe to get notified about new dev tutorials