Phantomjs: PhantomJS crashes with 2.0 when crawling ~170 urls sequential

Created on 11 Sep 2015  路  4Comments  路  Source: ariya/phantomjs

Windows 8.1 64bit, precompiled binary fro the website.

I'm running a test with 1.9.8 at the moment, crashes everytime between 2000-4000 urls. Still better than 170 urls with 2.0.

Error messages so far with the same script + same links, ran it ~20 times.

PhantomJS has crashed. Please read the crash reporting guide at http://phantomjs.org/crash-reporting.html and file a bug report at https://github.com/ariya/phantomjs/issues/new. Unfortunately, no crash dump is available. (Is %TEMP% (C:\Users\XXX\AppData\Local\Temp) a directory you cannot write?)

PhantomJS has crashed. Please read the crash reporting guide at http://phantomjs.org/crash-reporting.html and file a bug report at https://github.com/ariya/phantomjs/issues/new. Please attach the crash dump file: C:\Users\XXX\AppData\Local\Temp\a4fd6af6-1244-44d3-8938-3aabe298c2fa.dmp

QThread::start: Failed to create thread ()

Crash dumps:
https://www.dropbox.com/s/i3qi5ed33mbblie/500%20links%20-a4fd6af6-1244-44d3-8938-3aabe298c2fa.dmp?dl=1
https://www.dropbox.com/s/najdz9fhdexvav1/500%20links-%2095ebab5c-859b-40e9-936b-84967471779b.dmp?dl=1
https://www.dropbox.com/s/1d2t8rtev85yf96/500%20links%20-%20d450c8e1-9728-41c7-ba52-dfef466f0222.dmp?dl=1

Script:

console.log('Hello, world!');
var fs = require('fs');
var stream = fs.open('500sitemap.txt', 'r');
var webPage = require('webpage');
var i = 1;
var hasFound = Array();
var hasonLoadFinished = Array();

function handle_page(line) {
var page = webPage.create();
page.settings.loadImages = false;
page.open(line, function() {});

page.onResourceRequested = function(requestData, request) {
    var match = requestData.url.match(/example.de\/ac/g)
    if (match != null) {
        hasFound[line] = true;
        var targetString = decodeURI(JSON.stringify(requestData.url));
        var klammerauf = targetString.indexOf("{");
        var jsonobjekt = targetString.substr(klammerauf,     (targetString.indexOf("}") - klammerauf) + 1);
        targetJSON = (decodeURIComponent(jsonobjekt));
        var t = JSON.parse(targetJSON);
        console.log(i + "   " + t + "       " + t['id']);
        request.abort;
    } else {
        //hasFound = false;
        return;
    }

};
page.onLoadFinished = function(status) {    
    if (!hasonLoadFinished[line]) {
        hasonLoadFinished[line] = true;
        if (!hasFound[line]) {
            console.log(i + " :NOT FOUND: " + line);
            console.log("");
        }
        i++;
        setTimeout(page.close, 200);
        nextPage();
    }
}
};

function nextPage() {
var line = stream.readLine();
if (!line) {
    end = Date.now();
    console.log("");
    console.log(((end - start) / 1000) + " Sekunden");
    phantom.exit(0);
}
hasFound[line] = false;
hasonLoadFinished[line] = false;
handle_page(line);
}

start = Date.now();
nextPage();
Bug Crash Need code stale

Most helpful comment

I ran into a similar issue that, after solving, I realized was the same as yours.

The issue is not actually with page.close() not freeing memory, it is some magical nonsense going on with setTimeout. When you call page.close() within a setTimeout, it does not work properly. The workaround is to double wrap the page.close call.

// bad
setTimeout(function() {
  page.close();
}, 1000);

// works for whatever reason
setTimeout(function() {
  setTimeout(function() {
    page.close();
  }, 1);
}, 1000);

In your specific case, try this:

// change this
setTimeout(page.close, 200);

// to this
setTimeout(function() {
  setTimeout(function() {
    page.close();
  }, 1);
}, 200);

All 4 comments

I'm sorry we took so long to get back to you about this.

With regret, I don't know if we're going to be able to do anything about this. It's probably some kind of memory leak or memory corruption in the interaction between Qt and Webkit, and we don't have the manpower to debug it. (I have debugged things like this before - it can take _days_ of investigation.)

Personally, I would recommend quitting and restarting the PhantomJS process after every page load. I use PhantomJS as a scraper that way and it works well.

If _you_ have time to debug this and work up a patch, though, we would be glad to take it.

I ran into a similar issue that, after solving, I realized was the same as yours.

The issue is not actually with page.close() not freeing memory, it is some magical nonsense going on with setTimeout. When you call page.close() within a setTimeout, it does not work properly. The workaround is to double wrap the page.close call.

// bad
setTimeout(function() {
  page.close();
}, 1000);

// works for whatever reason
setTimeout(function() {
  setTimeout(function() {
    page.close();
  }, 1);
}, 1000);

In your specific case, try this:

// change this
setTimeout(page.close, 200);

// to this
setTimeout(function() {
  setTimeout(function() {
    page.close();
  }, 1);
}, 200);

@mz3 Thanks that hack worked for me 馃憤

Due to our very limited maintenance capacity (see #14541 for more details), we need to prioritize our development focus on other tasks. Therefore, this issue will be automatically closed. In the future, if we see the need to attend to this issue again, then it will be reopened. Thank you for your contribution!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

machadolab picture machadolab  路  5Comments

yisibl picture yisibl  路  5Comments

sinojelly picture sinojelly  路  3Comments

maboiteaspam picture maboiteaspam  路  3Comments

yairza picture yairza  路  6Comments