Cheerio: out of memory when using cheerio in crawler

Created on 17 Mar 2016  路  19Comments  路  Source: cheeriojs/cheerio

Hi,
I'm using cheerio to parse html page in a simple crawler as below, the system quickly go out of memory when processing tens of pages, my computer has more than 4GB free memory, I notice that
cheerio has a load operation, do I need to unload the page explicitly or some how to let cheerio release the memory after the processing finish?

var cheerio = require('cheerio');
var request = require('request');


function parseSpecificRoom(url)
{
    request({uri: url}, function(err, resp, body) {
        var $ = cheerio.load(body);
        var price = $('.house-price').text();
        var pay = $('.pay-method').text();
        var type = $('.house-type').text().replace(/\s/g, '');
        var location = $('.xiaoqu').text().replace(/\s/g, '');
        var phone = $('.tel-num').text().replace(/\s/g, '');
        console.log(price + ', ' + pay + ', ' + type + ', ' + location + ', ' + phone)
    });
}

function parsePage(index)
{
    request({uri:'http://sz.58.com/chuzu/pn' + index}, function(err, resp, body) {
        var $ = cheerio.load(body);
        var zufang = $('#infolist').children('table').eq(1).children('tr');
        zufang.each(function(i, elem) {
            var url = $(this).children().eq(1).children().eq(0).attr('href');
            parseSpecificRoom(url)
        });
    });
}

for(var i = 1; i < 100; i++)
    parsePage(i);

Most helpful comment

Actually, I ran into the "out of memory" error recently and tried to replace cheerio with whacko, which solved my problem.
thanks a lot, but will cheerio solve this problem in the future?

All 19 comments

I am also observing a memory leak with using Cheerio.

This is somehow interesting, but the leaking object in my case is a WeakMap inside the dependency lodash. If you capture two heap snapshots and compare them, you would probably notice that the top entry under (array) being the internal data of metaMap of lodash. I tried to create a self-contained script to try to trigger the memory leak with a WeakMap but I didn't get any promising results unfortunately.

The leak being in WeakMap most likely indicates a bug within Node.js or even the V8 engine. If it can be triggered under Chrome/Chromium, then it would be certain that it's a bug of the V8 engine.

At the meantime, I think this issue should be reported to the Node.js project too.

That's not necessary a Leak.

@alvinhochun lodash is a problem for me in a lot of projects. I had to send PR to 7 different projects because of a memory leak in an old version. I manage to change a lot of projects to use native functions instead of lodash and cheerio basically use 3 functions: _.each _.default and _.extend.

One BIG problem of lodash is that it is VERY HEAVY. But we can require only the function we need, that way, the ~500kb of the module allocated in cache (for each version, yes, this is a problem) go to around ~30kb. We can also replace the ones we can for native functions and _.each is a example.

But we need to considere that cheerio creates a lot of objects for the "DOM" and if your HTML is big, your "renderization" will be bigger. One way that i manage to workaround the problem was exposing the GC and cleaning the memory after i finish using the $ function, that way the entire DOM previous in memory is clean. I have a crawler too and this kind of app doesn't have time to wait the GC goodwill :P

@qdk0901 You can use this: https://simonmcmanus.wordpress.com/2013/01/03/forcing-garbage-collection-with-node-js-and-v8/

I'll work on these changes and send a PR in a few days :)
Maybe we will see a improvement

@luanmuniz from what I got by taking the heal snapshots it does look like there is a memory leak somewhere, since taking a heap snapshot does force a GC automatically.

I tried tracing calls and I think it is caused by _.bind.

But it is possible that I misread something. I'll try diagnosing this more tonight.

One problem we ran into when using cheerio in a crawler with low mem requirements was this v8 string copying issue. As far as I am aware this has still not been "fix" in v8...not sure if it ever will be because it sounds like its an either/or optimization and they have chosen one route which ends up causing this issue.

You can see the cheerio issue here: https://github.com/cheeriojs/cheerio/issues/263

We are also still using 0.19.0 in our web crawler because of weird issues we were seeing in 0.20.0 but could never solve. Our crawlers have best case 500mb ram available to them.

Well, for me I did not keep any copy of the strings after one request is complete. It seems that in my case it really is due to the gc not running since V8 didn't account for the system memory limit (system limit is at 500MB but V8 uses 1.4GB by default iirc), therefore a simple --max_old_space_size parameter to node seems to have "resolved" it.

Ah yes. We have that set as well. I we have ours at 460mb. We use the following flags for v8 when running: --expose-gc --optimize_for_size --max_old_space_size=460 --gc_interval=100 (Node 5.6.0)

Those seem to do pretty well for us, but obviously they should be tuned to your particular system.

Hi,all
Finally, I replace cheerio with parse5+xml as below, memory usage stabilize at around 300MB, running out of memory does not happen anymore

var request = require('request');
var xpath = require('xpath');
var parse5 = require('parse5');
var xmlser = require('xmlserializer');
var dom = require('xmldom').DOMParser;

function getXmlDoc(html)
{
    var document = parse5.parse(html);
    var xhtml = xmlser.serializeToString(document);
    var doc = new dom().parseFromString(xhtml);

    return doc;
}

function extractNodeValue(path, doc)
{
    var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    var node = select(path, doc);
    if (node.length > 0)
        return node[0].nodeValue.replace(/\s/g, '');

    return '';
}

function parseSpecificRoom(url)
{
    request({uri: url}, function(err, resp, body) {

        var doc = getXmlDoc(body);

        var price = extractNodeValue("//x:*[@class='house-price']/text()", doc);
        var pay = extractNodeValue("//x:*[contains(@class, 'pay-method')]/text()", doc);
        var type = extractNodeValue("//x:*[contains(@class, 'house-type')]/text()", doc);
        var location = extractNodeValue("//x:*[contains(@class, 'xiaoqu')]/text()", doc);
        var phone = extractNodeValue("//x:*[contains(@class, 'tel-num')]/text()", doc);
        console.log(price + ', ' + pay + ', ' + type + ', ' + location + ', ' + phone)
    });
}

function parsePage(index)
{
    request({uri:'http://sz.58.com/chuzu/pn' + index}, function(err, resp, body) {
        var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
        var doc = getXmlDoc(body);
        var zufang = select("//x:div[@id='infolist']/x:table/x:tbody/x:tr//x:a[@class='t']/@href", doc);
        for (var i = 0; i < zufang.length; i++)
            parseSpecificRoom(zufang[i].value);
    });
}

for (var i = 0; i < 100; i++)
    parsePage(i);

@qdk0901 Yeah, parse5 has a MUCH MORE light parse response compared to htmlparser2. But it does seems tricky to make the change.

The problem with htmlparser2 is the next and prev objects that store about everything again inside it. Maybe we can study the impact to change htmlparser2 to something less redundant

@qdk0901 There is also a cheerio fork called whacko that uses parse5 instead of htmlparser2. Probably easier to use than converting all selectors to XPATH.

Going forward, cheerio should switch to parse5, as it fixes a lot of problems & development of htmlparser2 stalls.

@fb55 , hi, thank you very much, I tried the whacko, it's much better and as easy to use as cheerio

cheerio is a giant mem leak. "implementation of core jQuery designed specifically for the server" - something like that absolutely cannot be used on a server. Simply download any youtube page and try to load and run couple of selectors in a loop. Ram will balloon to gigs.
@fb55 , I tried the whacko, and it doesn't have memory issues in my case.

@pps83 Thanks for mentioning whacko. I replaced the import (since they have the same api) and it works. Memory usage is stable.

Actually, I ran into the "out of memory" error recently and tried to replace cheerio with whacko, which solved my problem.
thanks a lot, but will cheerio solve this problem in the future?

@mike442144 I'm working on this

@luanmuniz Cool, We are looking forward to your solution.

@luanmuniz Any update?

@mike442144 Yes and No. The problem is at the parser and i worked a lot on it, but its hard to change the parse and still maintain the XML features. There is a discussion on #863 and they don't want to let the xml support be removed. You should follow #863, ill probably post the updates there. I'm still trying to work on it, but i don't know how long its going to take :(

I thought about 2 solutions, make a fork, like whacko or make my own version of cheerio from scratch. I'm working on the second one, but its not a quick job unfortunately :(

@luanmuniz Thanks very much, let me follow #863

Hi Guys, I think this can be closed, right? I think all the related issues are closed now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dandv picture dandv  路  5Comments

francoisromain picture francoisromain  路  5Comments

M3kH picture M3kH  路  4Comments

gajus picture gajus  路  4Comments

robogeek picture robogeek  路  4Comments