Url http://www.tehnosad.ru/subcategory/?id=584&producers[]=524 do not fetch from scrapy.
Bat httpie and chrome get Ok.
How to correct it?
$ scrapy fetch --headers 'http://www.tehnosad.ru/subcategory/?id=584&producers[]=524'
2017-06-02 13:49:44 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: remains)
2017-06-02 13:49:44 [scrapy.utils.log] INFO: Overridden settings: {'RETRY_TIMES': 5, 'SPIDER_MODULES': ['remains.spiders'], 'FEED_URI': './%(name)s.csv', 'HTTPCACHE_EXPIRATION_SECS': 600, 'RETRY_HTTP_CODES': [400, 408, 420, 500, 502, 503, 504], 'BOT_NAME': 'remains', 'FEED_FORMAT': 'csv', 'AUTOTHROTTLE_ENABLED': True, 'HTTPCACHE_STORAGE': 'scrapy.contrib.httpcache.FilesystemCacheStorage', 'NEWSPIDER_MODULE': 'remains.spiders', 'HTTPCACHE_DIR': './httpcache', 'DOWNLOADER_CLIENTCONTEXTFACTORY': 'remains.ssl_context.CustomClientContextFactory'}
2017-06-02 13:49:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.throttle.AutoThrottle']
2017-06-02 13:49:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-02 13:49:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-02 13:49:44 [scrapy.middleware] INFO: Enabled item pipelines:
['remains.pipelines.RemainsPipeline']
2017-06-02 13:49:44 [scrapy.core.engine] INFO: Spider opened
2017-06-02 13:49:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-02 13:49:44 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:49:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:49:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:29 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-02 13:50:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:52 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:20 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-02 13:51:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>: max redirections reached
2017-06-02 13:51:54 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-02 13:51:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 9196,
'downloader/request_count': 21,
'downloader/request_method_count/GET': 21,
'downloader/response_bytes': 14284,
'downloader/response_count': 21,
'downloader/response_status_count/301': 21,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 2, 6, 51, 54, 206787),
'log_count/DEBUG': 21,
'log_count/INFO': 9,
'memusage/max': 50880512,
'memusage/startup': 49733632,
'scheduler/dequeued': 21,
'scheduler/dequeued/memory': 21,
'scheduler/enqueued': 21,
'scheduler/enqueued/memory': 21,
'start_time': datetime.datetime(2017, 6, 2, 6, 49, 44, 629809)}
2017-06-02 13:51:54 [scrapy.core.engine] INFO: Spider closed (finished)
Hi @tonal , there's an open PR about safe characters in w3lib.
I can see that Firefox and Chrome do NOT percent-encode [ and ]. So there's definitely something to fix in w3lib and safe_url_string() (which is used in scrapy.http.Request)
Thank you for bringing attention to the issue again!
The issue is fixed when using the current master branch of https://github.com/scrapy/w3lib .
I'll close this when we release w3lib v1.18
w3lib v1.18 has been released. Upgrading to it solves the issue.
$ scrapy version -v
Scrapy : 1.4.0
lxml : 3.8.0.0
libxml2 : 2.9.3
cssselect : 1.0.1
parsel : 1.2.0
w3lib : 1.18.0
Twisted : 17.5.0
Python : 3.6.0+ (default, Feb 24 2017, 17:40:01) - [GCC 6.2.0 20161005]
pyOpenSSL : 17.0.0 (OpenSSL 1.0.2g 1 Mar 2016)
Platform : Linux-4.8.0-59-generic-x86_64-with-debian-stretch-sid
$ scrapy fetch --headers 'http://www.tehnosad.ru/subcategory/?id=584&producers[]=524'
2017-08-03 15:43:00 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(...)
2017-08-03 15:43:00 [scrapy.core.engine] INFO: Spider opened
2017-08-03 15:43:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-03 15:43:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-03 15:43:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.tehnosad.ru/subcategory/?id=584&producers[]=524> (referer: None)
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Server: nginx
< Date: Thu, 03 Aug 2017 13:51:12 GMT
< Content-Type: text/html; charset=WINDOWS-1251
< X-Powered-By: PHP/5.3.22
< Set-Cookie: PHPSESSID=fje13jbdushm7o76qsrk7dpob0; path=/
< Set-Cookie: thnsd-uid=045F3F9E-A541-BD0F-187E-9679649002FD; expires=Tue, 30-Jan-2018 13:43:01 GMT; path=/; domain=.tehnosad.ru
< Set-Cookie: first=yes; expires=Tue, 30-Jan-2018 13:43:01 GMT; path=/; domain=.tehnosad.ru
< Set-Cookie: thnsd-region=1-3-5898; expires=Tue, 30-Jan-2018 13:43:01 GMT; path=/; domain=.tehnosad.ru
< Set-Cookie: gwoDiscountEnable=0; expires=Thu, 10-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=613; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
2017-08-03 15:43:02 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-03 15:43:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 248,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 50301,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 3, 13, 43, 2, 306480),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'memusage/max': 48222208,
'memusage/startup': 48222208,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 8, 3, 13, 43, 0, 200276)}
2017-08-03 15:43:02 [scrapy.core.engine] INFO: Spider closed (finished)
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Last-Modified: Thu, 03 Aug 2017 12:43:01 GMT
< Vary: Accept-Encoding
< Content-Language: ru
Most helpful comment
Hi @tonal , there's an open PR about safe characters in w3lib.
I can see that Firefox and Chrome do NOT percent-encode
[and]. So there's definitely something to fix in w3lib andsafe_url_string()(which is used inscrapy.http.Request)Thank you for bringing attention to the issue again!