Scrapy: Is there a way to pass meta information into CrawlSpider's requests?

Created on 24 Apr 2014  路  2Comments  路  Source: scrapy/scrapy

I want to pass along some meta information that belongs to the original start_url's request. I can not find a way to reference that original request from within process_request.

I'm currently doing this by overriding _requests_to_follow in my CrawlSpider sub-class and passing meta=response.meta into Request (example below).

Is there a better way? If not, is a better way (like a parameter in the Rule constructor) wanted?

    def _requests_to_follow(self, response):
        self.f.write('REQUESTS RESPONSE META: %s' % response.meta)
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded, meta=response.meta)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

Most helpful comment

Thanks Nicolas. I've changed the override back to its original but added this line below the original meta.update line:

r.meta['original_meta'] = response.meta

That works for now.

Cheers,

All 2 comments

@kvnn You should ask on scrapy google group. We don't currently have the feature, but I can tell you that passing the whole meta coming from the response isn't a good practice. There are middlewares like retry that writes in meta the amount or retried times and this isn't deleted on the middleware itself, so that consecutive request won't be retried or retried less times. Also there are things like the downloader slot, cookies, etc.

Thanks Nicolas. I've changed the override back to its original but added this line below the original meta.update line:

r.meta['original_meta'] = response.meta

That works for now.

Cheers,

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Dainius-P picture Dainius-P  路  3Comments

yashrsharma44 picture yashrsharma44  路  4Comments

Gallaecio picture Gallaecio  路  3Comments

mah1212 picture mah1212  路  3Comments

Hecate2 picture Hecate2  路  3Comments