Scrapy: Is there a way to pass meta information into CrawlSpider's requests?

Created on 24 Apr 2014 · 2Comments · Source: scrapy/scrapy

I want to pass along some meta information that belongs to the original start_url's request. I can not find a way to reference that original request from within process_request.

I'm currently doing this by overriding _requests_to_follow in my CrawlSpider sub-class and passing meta=response.meta into Request (example below).

Is there a better way? If not, is a better way (like a parameter in the Rule constructor) wanted?

    def _requests_to_follow(self, response):
        self.f.write('REQUESTS RESPONSE META: %s' % response.meta)
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded, meta=response.meta)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

Source

kvnn

Most helpful comment

Thanks Nicolas. I've changed the override back to its original but added this line below the original meta.update line:

r.meta['original_meta'] = response.meta

That works for now.

Cheers,

kvnn on 24 Apr 2014

👍2 ❤1

All 2 comments

@kvnn You should ask on scrapy google group. We don't currently have the feature, but I can tell you that passing the whole meta coming from the response isn't a good practice. There are middlewares like retry that writes in meta the amount or retried times and this isn't deleted on the middleware itself, so that consecutive request won't be retried or retried less times. Also there are things like the downloader slot, cookies, etc.