Phantomjs: PhantomJS crashes with 2.0 when crawling ~170 urls sequential

Created on 11 Sep 2015 · 4Comments · Source: ariya/phantomjs

Windows 8.1 64bit, precompiled binary fro the website.

I'm running a test with 1.9.8 at the moment, crashes everytime between 2000-4000 urls. Still better than 170 urls with 2.0.

Error messages so far with the same script + same links, ran it ~20 times.

PhantomJS has crashed. Please read the crash reporting guide at http://phantomjs.org/crash-reporting.html and file a bug report at https://github.com/ariya/phantomjs/issues/new. Unfortunately, no crash dump is available. (Is %TEMP% (C:\Users\XXX\AppData\Local\Temp) a directory you cannot write?)

PhantomJS has crashed. Please read the crash reporting guide at http://phantomjs.org/crash-reporting.html and file a bug report at https://github.com/ariya/phantomjs/issues/new. Please attach the crash dump file: C:\Users\XXX\AppData\Local\Temp\a4fd6af6-1244-44d3-8938-3aabe298c2fa.dmp

QThread::start: Failed to create thread ()

Crash dumps:
https://www.dropbox.com/s/i3qi5ed33mbblie/500%20links%20-a4fd6af6-1244-44d3-8938-3aabe298c2fa.dmp?dl=1
https://www.dropbox.com/s/najdz9fhdexvav1/500%20links-%2095ebab5c-859b-40e9-936b-84967471779b.dmp?dl=1
https://www.dropbox.com/s/1d2t8rtev85yf96/500%20links%20-%20d450c8e1-9728-41c7-ba52-dfef466f0222.dmp?dl=1

Script:

console.log('Hello, world!');
var fs = require('fs');
var stream = fs.open('500sitemap.txt', 'r');
var webPage = require('webpage');
var i = 1;
var hasFound = Array();
var hasonLoadFinished = Array();

function handle_page(line) {
var page = webPage.create();
page.settings.loadImages = false;
page.open(line, function() {});

page.onResourceRequested = function(requestData, request) {
    var match = requestData.url.match(/example.de\/ac/g)
    if (match != null) {
        hasFound[line] = true;
        var targetString = decodeURI(JSON.stringify(requestData.url));
        var klammerauf = targetString.indexOf("{");
        var jsonobjekt = targetString.substr(klammerauf,     (targetString.indexOf("}") - klammerauf) + 1);
        targetJSON = (decodeURIComponent(jsonobjekt));
        var t = JSON.parse(targetJSON);
        console.log(i + "   " + t + "       " + t['id']);
        request.abort;
    } else {
        //hasFound = false;
        return;
    }

};
page.onLoadFinished = function(status) {    
    if (!hasonLoadFinished[line]) {
        hasonLoadFinished[line] = true;
        if (!hasFound[line]) {
            console.log(i + " :NOT FOUND: " + line);
            console.log("");
        }
        i++;
        setTimeout(page.close, 200);
        nextPage();
    }
}
};

function nextPage() {
var line = stream.readLine();
if (!line) {
    end = Date.now();
    console.log("");
    console.log(((end - start) / 1000) + " Sekunden");
    phantom.exit(0);
}
hasFound[line] = false;
hasonLoadFinished[line] = false;
handle_page(line);
}

start = Date.now();
nextPage();

Bug Crash Need code stale

Source

Marmeladenbrot

👍2

Most helpful comment

I ran into a similar issue that, after solving, I realized was the same as yours.

The issue is not actually with page.close() not freeing memory, it is some magical nonsense going on with setTimeout. When you call page.close() within a setTimeout, it does not work properly. The workaround is to double wrap the page.close call.

// bad
setTimeout(function() {
  page.close();
}, 1000);

// works for whatever reason
setTimeout(function() {
  setTimeout(function() {
    page.close();
  }, 1);
}, 1000);

In your specific case, try this:

// change this
setTimeout(page.close, 200);

// to this
setTimeout(function() {
  setTimeout(function() {
    page.close();
  }, 1);
}, 200);

mz3 on 22 Jan 2016

👍2 😕1

All 4 comments

I'm sorry we took so long to get back to you about this.

With regret, I don't know if we're going to be able to do anything about this. It's probably some kind of memory leak or memory corruption in the interaction between Qt and Webkit, and we don't have the manpower to debug it. (I have debugged things like this before - it can take _days_ of investigation.)

Personally, I would recommend quitting and restarting the PhantomJS process after every page load. I use PhantomJS as a scraper that way and it works well.

If _you_ have time to debug this and work up a patch, though, we would be glad to take it.

zackw on 14 Oct 2015

👍2

I ran into a similar issue that, after solving, I realized was the same as yours.

// bad
setTimeout(function() {
  page.close();
}, 1000);

// works for whatever reason
setTimeout(function() {
  setTimeout(function() {
    page.close();
  }, 1);
}, 1000);

In your specific case, try this:

// change this
setTimeout(page.close, 200);

// to this
setTimeout(function() {
  setTimeout(function() {
    page.close();
  }, 1);
}, 200);

mz3 on 22 Jan 2016

👍2 😕1

@mz3 Thanks that hack worked for me 👍

awumsuri on 19 Oct 2016

Due to our very limited maintenance capacity (see #14541 for more details), we need to prioritize our development focus on other tasks. Therefore, this issue will be automatically closed. In the future, if we see the need to attend to this issue again, then it will be reopened. Thank you for your contribution!