Scrapy: Better handling (and docs) of multiple spiders

Created on 28 Mar 2018 · 4Comments · Source: scrapy/scrapy

Python 3.6, Scrapy 1.5, Twisted 17.9.0

I'm running multiple spiders in the same process per:
https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

My code is basically:
```process = CrawlerProcess(scrapy_settings)

for i in range(1, (num_spiders + 1)):
    process.crawl(MySpiderClass)

process.start()```

When I run this against a test LOCALHOST site I have, on my Windows 7 box, it runs fine with 30 spiders. Negligible CPU usage, low RAM usage, "network" IO that's very low, making about 25 requests per second in total.

However, when I scrape (still from my PC) using these settings, but just 15 spiders, I get a huge list of DNS ERROR and TIMEOUT ERROR's returned near-immediately for sites that work perfectly fine from a browser (and quite quickly too).
With 15 spiders against localhost, Scrapy is using about:
5-10KB/s Up
300-600KB/s Down (bursting to 800)
Opening(and closing) about 40 TCP Connections per second
About 543 TCP Packets/ per second
About 14 actual url's crawled per second, along with all related background processing.

My internet connection is good for about:
1000KB/s Up
4400 KB/S Down

So I figure hey, maybe it's a latency thing with my home connection, and spin up a VPS service (Centos 7.4) somewhere across the continent in a datacentre that likely has short, fat pipes.... but I get the exact same thing!
Even just 7 spiders returns quite a lot of those error's for sites that do work fine, it just takes longer and there are fewer of them.

Now, it's true that I changed the DNS and TIMEOUT values to something low because DNS has a response time measured in milliseconds (and I tested mine extensively yesterday to confirm it), similar for TIMEOUTS. But even if I change these back to something higher, I still get a wall of errors, it just takes longer.

For example, in the first 10 seconds of running 15 spiders against live hosts, (with a 2s DNS_TIMEOUT), there are DNS Errors for over 500 urls covering around 100 domains in my logs. Yet at a system level, I can see that only 35 DNS requests were actually made of which four failed, 8 didn't receive a reply, and the others mostly received responses in a 20-800 ms range. Only three were slower than my 2s value.

I'm not CPU bound, the Scrapy Python process is using far less than a single core, either on my box, or the VPS (which only had one core anyway).

At a guess, this is probably a Twisted thing (though I know nothing of Twisted).

So this raises a few issues:

Make multiple-spiders better at sharing Twisted's resources
Better document Scrapy's handling of multiple spiders. Suggestions include:
- Currently it's not even clear from the docs if multiple-spiders is concurrent or consecutive (I've determined it's concurrent through testing, but the docs should be explicit on this point).
- Clarification on which Scrapy Settings are a "shared pool" across spiders and which are not. I.e. is the CONCURRENT_REQUESTS pool shared across all spiders, or per-spider? Same for CONCURRENT_REQUESTS_PER_IP, AUTOTHROTTLE, etc etc.
- Any notes on scrapy's/twisted limitations when it comes to multiple spiders. On the one hand I can't be the first person to try this, but on the other hand I can't find anything about it. This person seems to be able to run over 30 spiders but is using a different mechanism - https://kirankoduru.github.io/python/multiple-scrapy-spiders.html
- DNS_TIMEOUT does not mean what I'd expect it to mean. It's not the amount of time a DNS Request has to get a response from when it leaves the host (because most of my requests were never sent), it includes Twisted/Scrapy Queuing Time too.

(Note: My reason for splitting spiders is simply to speed up the crawl. I have no interest in distributed crawling - my target url-base (far less than a million) is small but split across tens of thousands of hosts. I've successfully crawled with one spider over about a week just fine in the past. I'm not interested in scrapyd.)

docs enhancement

Source

mohmad-null

Most helpful comment

I've spent some more time investigating this in various ways and come up with some observations that ideally should be documented somewhere so as to save others time.

With 15 spiders:

The DNS Timeouts are a result of REACTOR_THREADPOOL_MAXSIZE being too low. I upped it to 25 and now I'm not seeing any dns errors that don't appear to be legit. So I'd suggest mentioning that in the section of the docs that covers multiple spiders. And on the DNS_TIMEOUTS setting that it relates directly to the REACTOR_THREADPOOL_MAXSIZE setting and is actually a timeout that starts as soon as it enters that pool.
I was still getting tons of regular TIMEOUT Errors for sites that respond within a few seconds in a browser. DOWNLOAD_TIMEOUT is 50. There doesn't seem to be any "threadpool" like setting for this that I can see. About 70% of requests were timing out. Looking into it, I had about 800 connections open at once! This is far higher than my CONCURRENT_REQUESTS setting, so I'm guessing that value is NOT pooled but is spider independent. I'm now dividing that value myself before sending it to the Crawlers. I would suggest this should be documented too, along with the other things in my post above. I've now set my CONCURRENT_REQUESTS to 128, and with the dividing, I'm no longer seeing any invalid timeouts.

Using the above, my 15 spiders can fairly easily crawl and process 1000 pages a minute. Or about 1 per second per spider. This is a great little feature of Scrapy, and I think lots of people would benefit from it if it was clearer in the docs. The fact that some settings are global and some per-spider should definitely be made clear, ideally with an explicit list of which are which on the multiple-spiders section. Hopefully these comments will provide a helpful starting point.

Semi-related note: Having tried it, the blog post linked to in my issue is using a very old version of scrapy and doesn't work in the current version (Crawler API has changed considerably).

mohmad-null on 29 Mar 2018

👍3

All 4 comments

I haven't checked all your points yet, but a quick note: ISP's dns servers often can't handle the load if you're doing many dns requests, they may return errors. It can be a problem in broad crawls even if you're using a single spider. Using OpenDNS may allow you to do more DNS requests; Google DNS (8.8.8.8) may also help, though it is worse if I recall correctly.

kmike on 28 Mar 2018

Hi, thanks for the suggestion. I'm not using an ISP's DNS (in part because my ISP cheaps out and actually just points to Google's DNS).
The problem is that as noted, I used a tool to monitor DNS requests (https://nirsoft.net/utils/dns_query_sniffer.html) - the requests simply were not made. To me this would indicate that somewhere hidden deep inside Twisted's code there's a queue for DNS requests and the timer starts as soon as you put something in there. I've just tried taking a look but there's just too much interlinking between the modules and as I said, I'm not familiar with Twisted so.... I also just tried googling around for any mention of such a queue but couldn't find anything.

mohmad-null on 29 Mar 2018

I've spent some more time investigating this in various ways and come up with some observations that ideally should be documented somewhere so as to save others time.

With 15 spiders:

The DNS Timeouts are a result of REACTOR_THREADPOOL_MAXSIZE being too low. I upped it to 25 and now I'm not seeing any dns errors that don't appear to be legit. So I'd suggest mentioning that in the section of the docs that covers multiple spiders. And on the DNS_TIMEOUTS setting that it relates directly to the REACTOR_THREADPOOL_MAXSIZE setting and is actually a timeout that starts as soon as it enters that pool.
I was still getting tons of regular TIMEOUT Errors for sites that respond within a few seconds in a browser. DOWNLOAD_TIMEOUT is 50. There doesn't seem to be any "threadpool" like setting for this that I can see. About 70% of requests were timing out. Looking into it, I had about 800 connections open at once! This is far higher than my CONCURRENT_REQUESTS setting, so I'm guessing that value is NOT pooled but is spider independent. I'm now dividing that value myself before sending it to the Crawlers. I would suggest this should be documented too, along with the other things in my post above. I've now set my CONCURRENT_REQUESTS to 128, and with the dividing, I'm no longer seeing any invalid timeouts.

Semi-related note: Having tried it, the blog post linked to in my issue is using a very old version of scrapy and doesn't work in the current version (Crawler API has changed considerably).

mohmad-null on 29 Mar 2018

👍3

I am having the same issue, 5 spiders are working perfectly in CrawlProcess but if i add more than that, i am start getting the following error

(failed x times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): : Connection to the other side was lost in a non-clean fashion: Connection lost.

Even though all spiders working separably.
this is my setting file
`BOT_NAME = 'news'

SPIDER_MODULES = ['news.spiders']
NEWSPIDER_MODULE = 'news.spiders'

LOG_LEVEL = 'DEBUG'
GENERIC_CRAWLER_MAX_REPEAT_COUNT = 3000
DEPTH_LIMIT = 5

REACTOR_THREADPOOL_MAXSIZE = 100
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 72
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 0
RANDOMIZE_DOWNLOAD_DELAY = True
COOKIES_ENABLED = False
DNS_TIMEOUT = 600
DOWNLOAD_TIMEOUT = 180
DNSCACHE_ENABLED = True
FEED_EXPORT_FIELDS = ['source', 'title', 'author',
'date', 'description', 'url', 'request_url', 'body', 'images']

DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Connection": "keep-alive",
'Upgrade-Insecure-Requests': '1',
'accept-encoding': 'deflate, br',
'cache-control': 'max-age=0'
}

EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}

ITEM_PIPELINES = {
'news.pipelines.NewsPipeline': 100,
}`