It was suggested i bring this issue up for phantomJS as well: Here's the most recent thread:
A current issue I'm having is that for some webpages I try to scrape, for some reason, getElementsInfo, and evaluate... document.querySelectorAll returns null for any inquiry.
The current page I'm seeing this problem with is www.solesociety.com. I've tried using the selectors *, form, and form#mini_search_form for this page and all return null.
Example code I used:
var casper = require('casper').create();
casper.start('http://www.amazon.com');
casper.then(function() {
var elements = this.getElementsInfo("*");
casper.echo('Num elements: ' + elements.length);
});
casper.run();
Current casper version is 1.1.0-DEV
Current phantom version 1.9.1
n1k0 commented an hour ago
Scratch that, I reproduced your issue using the url you suggested in the description. I have no idea what's going on here.
n1k0 commented an hour ago
For the records, this native PhantomJS fails as well:
var page = require('webpage').create();
page.open("http://www.solesociety.com", function() {
console.log(page.evaluate(function() {
return document.querySelector("*").length;
}));
phantom.exit();
});
Gives:
$ phantomjs test.js
null
querySelector returns a single node or null. You want to be using querySelectorAll.
CC @n1k0
Ok I tested with querySelectorAll and it does give results. Something interesting I observed, however, when pulling the forms from the same website. While three forms are returned by querySelectorAll("form"). 2/3 of the forms are null when tested.
Code used:
var page = require('webpage').create();
page.open("http://www.solesociety.com", function() {
var forms = page.evaluate(function() {
return document.querySelectorAll("form");
});
console.log('Num Forms: ' + forms.length);
for(var i = 0; i < forms.length; ++i) {
if(forms[i]) {
console.log('Form exists');
console.log('form id: ' + forms[i].id);
}
}
phantom.exit();
});
Output:
Form exists
form id: invite-friends-form
Is it known what could cause this to happen? as when I tried directly querying one of the two forms that were null it worked as expected. I tested this with document.querySelector("form#search_mini_form")
Please read the docs. You can only return JSON-serializable objects from page.evaluate. Returning complex objects like DOM objects will not only not work but also usually results in unexpected/irrational behavior.
I tested each of the three forms individually and multiple times by doing document.querySelector("form#[insert id]");. The three form id's are search_mini_form, newsletterSignupForm, and invite-friends-form. All three returned a valid object rather than null. Implying that they are all serializable. Why then would querySelectorAll("form") return a list of length 3 with 2/3 being null?
See previously mentioned "unexpected behavior" note. Elements cannot be reliably serialized when transferring them between different owner documents (in this case, a WebPage instance and the Phantom outer context), period. Grab the data you need from them (e.g. href values rom anchors, etc.) and return just that to the Phantom context. If you need to do advanced manipulation, do it within the owner document (WebPage instance) via your page.evaluate.
This is not something that is up for debate, this is battle-tested advice from a 2+ year veteran user [and contributor].
Likely to be related with https://github.com/ariya/phantomjs/issues/11632#issuecomment-35297337
I've the same problem with querySelectorAll function when I want to select 5 differents span with aleatories id attributes. I get 4/5 objects null and I can't onclick over them
Due to our very limited maintenance capacity (see #14541 for more details), we need to prioritize our development focus on other tasks. Therefore, this issue will be automatically closed. In the future, if we see the need to attend to this issue again, then it will be reopened. Thank you for your contribution!
Most helpful comment
See previously mentioned "unexpected behavior" note. Elements cannot be reliably serialized when transferring them between different owner documents (in this case, a WebPage instance and the Phantom outer context), period. Grab the data you need from them (e.g. href values rom anchors, etc.) and return just that to the Phantom context. If you need to do advanced manipulation, do it within the owner document (WebPage instance) via your
page.evaluate.This is not something that is up for debate, this is battle-tested advice from a 2+ year veteran user [and contributor].