Sorry, it is not a programming error.
As I know, Scrapy crawls the data into buffer, and when the program exit, it saves the data to disk.
My machine has 8G memory, and the data I am crawling maybe 8-10G.
How can I set the scrapy to save data while crawling so that it can avoid memory error?
There are 2200000 pages to be crawled, And how can I set that after crawled 10000 pages, it saves the data to disk?
(Not an expert in Scrapy here, but...)
As I know, Scrapy does not crawl data into buffers and does not save any data to disk.
There are Feed Exporters, which write data to the standard output. (I am not sure if it feeds data as soon as the data are available or if it keep it in buffers until the spider closes. Anyway...) Unless you use Feed Exporters, you have to write your own mechanism to write data to disk (file, database, whatever).
Let's say you have a pipeline that saves data to a file. As soon as the data arrive in the pipeline, it will get written to the file. If each page generates an item, Scrapy will never uses 8BG RAM (probably not even 500 MB!), and your file will grow as Scrapy is running.
(If I am saying something wrong, someone will correct me soon!)
Sorry, let me update the issue. Because at first, I need to save the data to files.
$ scrapy crawl PROJECTNAME -o output.json -t json
@GoingMyWay you could also use ITEM_PIPELINES to add items to DB. Here's something I wrote about how to do it using SQLAlchemy
Most helpful comment
(Not an expert in Scrapy here, but...)
As I know, Scrapy does not crawl data into buffers and does not save any data to disk.
There are Feed Exporters, which write data to the standard output. (I am not sure if it feeds data as soon as the data are available or if it keep it in buffers until the spider closes. Anyway...) Unless you use Feed Exporters, you have to write your own mechanism to write data to disk (file, database, whatever).
Let's say you have a pipeline that saves data to a file. As soon as the data arrive in the pipeline, it will get written to the file. If each page generates an item, Scrapy will never uses 8BG RAM (probably not even 500 MB!), and your file will grow as Scrapy is running.
(If I am saying something wrong, someone will correct me soon!)