Scrapy: How should I save data to disk while crawling?

Created on 1 Jun 2016  Â·  3Comments  Â·  Source: scrapy/scrapy

Sorry, it is not a programming error.

As I know, Scrapy crawls the data into buffer, and when the program exit, it saves the data to disk.

My machine has 8G memory, and the data I am crawling maybe 8-10G.

How can I set the scrapy to save data while crawling so that it can avoid memory error?

There are 2200000 pages to be crawled, And how can I set that after crawled 10000 pages, it saves the data to disk?

Most helpful comment

(Not an expert in Scrapy here, but...)

As I know, Scrapy does not crawl data into buffers and does not save any data to disk.

There are Feed Exporters, which write data to the standard output. (I am not sure if it feeds data as soon as the data are available or if it keep it in buffers until the spider closes. Anyway...) Unless you use Feed Exporters, you have to write your own mechanism to write data to disk (file, database, whatever).

Let's say you have a pipeline that saves data to a file. As soon as the data arrive in the pipeline, it will get written to the file. If each page generates an item, Scrapy will never uses 8BG RAM (probably not even 500 MB!), and your file will grow as Scrapy is running.

(If I am saying something wrong, someone will correct me soon!)

All 3 comments

(Not an expert in Scrapy here, but...)

As I know, Scrapy does not crawl data into buffers and does not save any data to disk.

There are Feed Exporters, which write data to the standard output. (I am not sure if it feeds data as soon as the data are available or if it keep it in buffers until the spider closes. Anyway...) Unless you use Feed Exporters, you have to write your own mechanism to write data to disk (file, database, whatever).

Let's say you have a pipeline that saves data to a file. As soon as the data arrive in the pipeline, it will get written to the file. If each page generates an item, Scrapy will never uses 8BG RAM (probably not even 500 MB!), and your file will grow as Scrapy is running.

(If I am saying something wrong, someone will correct me soon!)

Sorry, let me update the issue. Because at first, I need to save the data to files.

$ scrapy crawl PROJECTNAME -o output.json -t json

@GoingMyWay you could also use ITEM_PIPELINES to add items to DB. Here's something I wrote about how to do it using SQLAlchemy

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LokiSharp picture LokiSharp  Â·  3Comments

Hecate2 picture Hecate2  Â·  3Comments

redapple picture redapple  Â·  3Comments

JafferWilson picture JafferWilson  Â·  4Comments

DharmeshPandav picture DharmeshPandav  Â·  3Comments