Scrapy: How should I save data to disk while crawling?

Created on 1 Jun 2016 · 3Comments · Source: scrapy/scrapy

Sorry, it is not a programming error.

As I know, Scrapy crawls the data into buffer, and when the program exit, it saves the data to disk.

My machine has 8G memory, and the data I am crawling maybe 8-10G.

How can I set the scrapy to save data while crawling so that it can avoid memory error？

There are 2200000 pages to be crawled, And how can I set that after crawled 10000 pages, it saves the data to disk?

Source

GoingMyWay

Most helpful comment

(Not an expert in Scrapy here, but...)

As I know, Scrapy does not crawl data into buffers and does not save any data to disk.

There are Feed Exporters, which write data to the standard output. (I am not sure if it feeds data as soon as the data are available or if it keep it in buffers until the spider closes. Anyway...) Unless you use Feed Exporters, you have to write your own mechanism to write data to disk (file, database, whatever).

Let's say you have a pipeline that saves data to a file. As soon as the data arrive in the pipeline, it will get written to the file. If each page generates an item, Scrapy will never uses 8BG RAM (probably not even 500 MB!), and your file will grow as Scrapy is running.

(If I am saying something wrong, someone will correct me soon!)

djunzu on 2 Jun 2016

👍2

All 3 comments

(Not an expert in Scrapy here, but...)

As I know, Scrapy does not crawl data into buffers and does not save any data to disk.

There are Feed Exporters, which write data to the standard output. (I am not sure if it feeds data as soon as the data are available or if it keep it in buffers until the spider closes. Anyway...) Unless you use Feed Exporters, you have to write your own mechanism to write data to disk (file, database, whatever).

Let's say you have a pipeline that saves data to a file. As soon as the data arrive in the pipeline, it will get written to the file. If each page generates an item, Scrapy will never uses 8BG RAM (probably not even 500 MB!), and your file will grow as Scrapy is running.

(If I am saying something wrong, someone will correct me soon!)

djunzu on 2 Jun 2016

👍2

Sorry, let me update the issue. Because at first, I need to save the data to files.

$ scrapy crawl PROJECTNAME -o output.json -t json

GoingMyWay on 2 Jun 2016

@GoingMyWay you could also use ITEM_PIPELINES to add items to DB. Here's something I wrote about how to do it using SQLAlchemy

kirankoduru on 2 Jun 2016

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[Python3.5.2][Scrapy 1.1.2] ImportError: No module named 'sgmllib'

LokiSharp · 3Comments

sleep

Hecate2 · 3Comments

Cryptic traceback for non-callable callback

redapple · 3Comments

No idea how to scrape from website changing quotes on text change

JafferWilson · 4Comments

support for "multipart/form-data" in scrapy.FormRequest

DharmeshPandav · 3Comments