I want to pass along some meta information that belongs to the original start_url's request. I can not find a way to reference that original request from within process_request.
I'm currently doing this by overriding _requests_to_follow in my CrawlSpider sub-class and passing meta=response.meta into Request (example below).
Is there a better way? If not, is a better way (like a parameter in the Rule constructor) wanted?
def _requests_to_follow(self, response):
self.f.write('REQUESTS RESPONSE META: %s' % response.meta)
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded, meta=response.meta)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
@kvnn You should ask on scrapy google group. We don't currently have the feature, but I can tell you that passing the whole meta coming from the response isn't a good practice. There are middlewares like retry that writes in meta the amount or retried times and this isn't deleted on the middleware itself, so that consecutive request won't be retried or retried less times. Also there are things like the downloader slot, cookies, etc.
Thanks Nicolas. I've changed the override back to its original but added this line below the original meta.update line:
r.meta['original_meta'] = response.meta
That works for now.
Cheers,
Most helpful comment
Thanks Nicolas. I've changed the override back to its original but added this line below the original meta.update line:
That works for now.
Cheers,