Scrapy: Don`t fetch url

Created on 2 Jun 2017  路  3Comments  路  Source: scrapy/scrapy

Url http://www.tehnosad.ru/subcategory/?id=584&producers[]=524 do not fetch from scrapy.
Bat httpie and chrome get Ok.
How to correct it?

$ scrapy fetch --headers 'http://www.tehnosad.ru/subcategory/?id=584&producers[]=524'
2017-06-02 13:49:44 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: remains)
2017-06-02 13:49:44 [scrapy.utils.log] INFO: Overridden settings: {'RETRY_TIMES': 5, 'SPIDER_MODULES': ['remains.spiders'], 'FEED_URI': './%(name)s.csv', 'HTTPCACHE_EXPIRATION_SECS': 600, 'RETRY_HTTP_CODES': [400, 408, 420, 500, 502, 503, 504], 'BOT_NAME': 'remains', 'FEED_FORMAT': 'csv', 'AUTOTHROTTLE_ENABLED': True, 'HTTPCACHE_STORAGE': 'scrapy.contrib.httpcache.FilesystemCacheStorage', 'NEWSPIDER_MODULE': 'remains.spiders', 'HTTPCACHE_DIR': './httpcache', 'DOWNLOADER_CLIENTCONTEXTFACTORY': 'remains.ssl_context.CustomClientContextFactory'}
2017-06-02 13:49:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.throttle.AutoThrottle']
2017-06-02 13:49:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-02 13:49:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-02 13:49:44 [scrapy.middleware] INFO: Enabled item pipelines:
['remains.pipelines.RemainsPipeline']
2017-06-02 13:49:44 [scrapy.core.engine] INFO: Spider opened
2017-06-02 13:49:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-02 13:49:44 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:49:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:49:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:29 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-02 13:50:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:52 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:50:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:20 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-02 13:51:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524> from <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>
2017-06-02 13:51:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.tehnosad.ru/subcategory/?id=584&producers%5B%5D=524>: max redirections reached
2017-06-02 13:51:54 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-02 13:51:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 9196,
 'downloader/request_count': 21,
 'downloader/request_method_count/GET': 21,
 'downloader/response_bytes': 14284,
 'downloader/response_count': 21,
 'downloader/response_status_count/301': 21,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 6, 2, 6, 51, 54, 206787),
 'log_count/DEBUG': 21,
 'log_count/INFO': 9,
 'memusage/max': 50880512,
 'memusage/startup': 49733632,
 'scheduler/dequeued': 21,
 'scheduler/dequeued/memory': 21,
 'scheduler/enqueued': 21,
 'scheduler/enqueued/memory': 21,
 'start_time': datetime.datetime(2017, 6, 2, 6, 49, 44, 629809)}
2017-06-02 13:51:54 [scrapy.core.engine] INFO: Spider closed (finished)
bug

Most helpful comment

Hi @tonal , there's an open PR about safe characters in w3lib.
I can see that Firefox and Chrome do NOT percent-encode [ and ]. So there's definitely something to fix in w3lib and safe_url_string() (which is used in scrapy.http.Request)
Thank you for bringing attention to the issue again!

All 3 comments

Hi @tonal , there's an open PR about safe characters in w3lib.
I can see that Firefox and Chrome do NOT percent-encode [ and ]. So there's definitely something to fix in w3lib and safe_url_string() (which is used in scrapy.http.Request)
Thank you for bringing attention to the issue again!

The issue is fixed when using the current master branch of https://github.com/scrapy/w3lib .
I'll close this when we release w3lib v1.18

w3lib v1.18 has been released. Upgrading to it solves the issue.

$ scrapy version -v
Scrapy    : 1.4.0
lxml      : 3.8.0.0
libxml2   : 2.9.3
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.5.0
Python    : 3.6.0+ (default, Feb 24 2017, 17:40:01) - [GCC 6.2.0 20161005]
pyOpenSSL : 17.0.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.8.0-59-generic-x86_64-with-debian-stretch-sid
$ scrapy fetch --headers 'http://www.tehnosad.ru/subcategory/?id=584&producers[]=524'
2017-08-03 15:43:00 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(...)
2017-08-03 15:43:00 [scrapy.core.engine] INFO: Spider opened
2017-08-03 15:43:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-03 15:43:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-03 15:43:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.tehnosad.ru/subcategory/?id=584&producers[]=524> (referer: None)
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Server: nginx
< Date: Thu, 03 Aug 2017 13:51:12 GMT
< Content-Type: text/html; charset=WINDOWS-1251
< X-Powered-By: PHP/5.3.22
< Set-Cookie: PHPSESSID=fje13jbdushm7o76qsrk7dpob0; path=/
< Set-Cookie: thnsd-uid=045F3F9E-A541-BD0F-187E-9679649002FD; expires=Tue, 30-Jan-2018 13:43:01 GMT; path=/; domain=.tehnosad.ru
< Set-Cookie: first=yes; expires=Tue, 30-Jan-2018 13:43:01 GMT; path=/; domain=.tehnosad.ru
< Set-Cookie: thnsd-region=1-3-5898; expires=Tue, 30-Jan-2018 13:43:01 GMT; path=/; domain=.tehnosad.ru
< Set-Cookie: gwoDiscountEnable=0; expires=Thu, 10-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=613; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
< Set-Cookie: ROOT_CATEGORY_ID=584; expires=Fri, 04-Aug-2017 13:43:01 GMT; path=/
2017-08-03 15:43:02 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-03 15:43:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 248,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 50301,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 3, 13, 43, 2, 306480),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'memusage/max': 48222208,
 'memusage/startup': 48222208,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 8, 3, 13, 43, 0, 200276)}
2017-08-03 15:43:02 [scrapy.core.engine] INFO: Spider closed (finished)
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Last-Modified: Thu, 03 Aug 2017 12:43:01 GMT
< Vary: Accept-Encoding
< Content-Language: ru
Was this page helpful?
0 / 5 - 0 ratings

Related issues

redapple picture redapple  路  3Comments

yashrsharma44 picture yashrsharma44  路  4Comments

GoingMyWay picture GoingMyWay  路  3Comments

Urahara picture Urahara  路  4Comments

Dainius-P picture Dainius-P  路  3Comments