Jsdom: querySelectorAll() with attribute selector which tests value is slow for URI-based values

Created on 19 May 2016  路  22Comments  路  Source: jsdom/jsdom

This is more specific issue instead of #677.

1) Run this script to get a test page:

'use strict';

const fs = require('fs');

const html =  fs.openSync('test.html', 'w');

fs.writeSync(html,
  '\uFEFF<!doctype html><html><head><meta charset="UTF-8"><title></title></head><body>\n\n',
  null, 'utf8');

let i = 100000;
while (i--) {
  fs.writeSync(html, `<a title='${i}.html' href='${i}.html'></a>\n`, null, 'utf8');
}

fs.writeSync(html, '\n</body></html>\n', null, 'utf8');

You will get 100000 a elements with title and href attributes with identical values.

2) Run these four scripts one by one. Do not combine all the four tests in one script because of memoization in the querySelectorAll() which affects performance of consecutive calls. Compare the outputs.

require('jsdom').env({file: 'test.html', done: (err, window) => {
  let hrStart;
  let hrEnd;

  hrStart = process.hrtime();
  console.log(window.document.querySelectorAll('a[title]').length);
  hrEnd = process.hrtime(hrStart);
console.log(`${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);
}});
0.318559525 s
require('jsdom').env({file: 'test.html', done: (err, window) => {
  let hrStart;
  let hrEnd;

  hrStart = process.hrtime();
  console.log(window.document.querySelectorAll('a[title*=".html"]').length);
  hrEnd = process.hrtime(hrStart);
  console.log(`${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);
}});
0.559967377 s
require('jsdom').env({file: 'test.html', done: (err, window) => {
  let hrStart;
  let hrEnd;

  hrStart = process.hrtime();
  console.log(window.document.querySelectorAll('a[href]').length);
  hrEnd = process.hrtime(hrStart);
  console.log(`${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);
}});
0.34831147 s
require('jsdom').env({file: 'test.html', done: (err, window) => {
  let hrStart;
  let hrEnd;

  hrStart = process.hrtime();
  console.log(window.document.querySelectorAll('a[href*=".html"]').length);
  hrEnd = process.hrtime(hrStart);
  console.log(`${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);
}});
9.444107183 s

As you can see, the cause of slowness is not the value testing by itself, but the testing of the href value.

3) Run the second and the fourth scripts with profiling (see this guide). You can see this hot spot in the --prof-process output for the fourth script (it is missing in the output for second script):

 [Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
  10422   57.9%  C:\Program Files\nodejs\node.exe
   8311   79.7%    C:\Program Files\nodejs\node.exe
   2868   34.5%      LazyCompile: *URLStateMachine ...\node_modules\whatwg-url\lib\url-state-machine.js:423:25
   2861   99.8%        LazyCompile: *exports.resolveURLToResultingParsedURL ...\node_modules\jsdom\lib\jsdom\living\helpers\document-base-url.js:51:42
   2861  100.0%          LazyCompile: setTheURL ...\node_modules\jsdom\lib\jsdom\living\nodes\HTMLHyperlinkElementUtils-impl.js:261:19
   2859   99.9%            LazyCompile: *getAttribute ...\node_modules\nwmatcher\src\nwmatcher-noqsa.js:321:13

In the getAttribute function of the nwmatcher-noqsa.js attributes are dealing with differently if they match the ATTR_URIDATA list. It seems to be the turning point. However I can't trace all the parsing chain to URLStateMachine because I lack for understanding of all the jsdom complexity and components connections.

Is there any possible reason for this slowing-down additional parsing calls chain in the mere string match testing?

All 22 comments

Hrm it's weird that getAttribute would call into nwmatcher. Something there has to be off...

For the rest of the stack: URLs are parsed lazily, so when you execute the CSS selector, it'll have to resolve URLs.

Woops, read the stack trace the wrong way around.

So yeah, basically what I wrote above. When you execute the selector, all URLs have to be parsed. This should only occur once though, so if you removed the memoize of the selector, you'll still see a similiar reduction in runtime.

@Sebmaster It seems the memoization does not help in this case if the match string is changed:

require('jsdom').env({file: 'test.html', done: (err, window) => {
  let hrStart;
  let hrEnd;

  hrStart = process.hrtime();
  console.log(window.document.querySelectorAll('a[href*=".html"]').length);
  hrEnd = process.hrtime(hrStart);
  console.log(`${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);

  hrStart = process.hrtime();
  console.log(window.document.querySelectorAll('a[href*="ht"]').length);
  hrEnd = process.hrtime(hrStart);
  console.log(`${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);
}});
9.438445258 s

100000
9.0343929 s

Huh, then we're probably not caching the URL resolve results correctly. Need to check if we have to roll our own there or it's a bug in the implementation if we're not following the spec correctly.

In the Chrome Canary with the same file:

let pStart;

pStart = performance.now();
console.log(window.document.querySelectorAll('a[href*=".html"]').length);
console.log(`${(performance.now() - pStart) / 1e3} s`);

pStart = performance.now();
console.log(window.document.querySelectorAll('a[href*="ht"]').length);
console.log(`${(performance.now() - pStart) / 1e3} s`);
0.05266999999992549 s
100000
0.021925000000046567 s

Don't forget to create a benchmark if you have a fix, to avoid regressions: https://github.com/tmpvar/jsdom/tree/master/benchmark ;p

I don't think there's any spec-compliance issue here.

Currently we do URL parsing lazily, without caching, for things like the href property. But I'm not sure why that should impact our querySelector performance, since querySelector only cares about attribute values, not about resolved hrefs. Maybe nwmatcher is incorrectly using element.href?

The profile above seems to show nwmatcher's getAttribute calling jsdom's internal setTheURL, which doesn't make any sense to me at all. nwmatcher shouldn't have any access to jsdom internals.

In jsdom 9.2.1 performance is twice better (~ 4 sec, still very slow compared with other selector types) and the top of profile stack is changed. These hot spots are not in the simple 'a[href]' query:

   8638   55.7%  C:\Program Files\nodejs\node.exe
   6042   69.9%    C:\Program Files\nodejs\node.exe
    660   10.9%      LazyCompile: *URLStateMachine ...\node_modules\whatwg-url\lib\url-state-machine.js:427:25
    658   99.7%        LazyCompile: *module.exports.basicURLParse ...\node_modules\whatwg-url\lib\url-state-machine.js:1121:41
    658  100.0%          LazyCompile: *getAttribute ...\node_modules\nwmatcher\src\nwmatcher-noqsa.js:321:13
    622   94.5%            LazyCompile: *<anonymous> :1:10
     36    5.5%            Function: ~<anonymous> :1:10

    369    6.1%      LazyCompile: Join native array.js:131:14
    369  100.0%        LazyCompile: *InnerArrayJoin native array.js:343:24
    369  100.0%          LazyCompile: *join native array.js:356:19
    369  100.0%            LazyCompile: ~serializeURL ...\node_modules\whatwg-url\lib\url-state-machine.js:1040:22

Looks like nwmatcher does an object check on the requested attribute, which triggers the url resolve algorithm :(

I see a couple potential fixes here:

  • Fix nwmatcher to not do that
  • Add a URL resolution cache to jsdom, which all instances of reflectURLAttribute would use. A simple LRU cache of, say, 50 items would make sense to me.

The issue should have been fixed by this commit:

https://github.com/dperini/nwmatcher/commit/1e30d83

please confirm and close.

I use Node 7 beta now. All the tests in the issue just hang with CPU loaded.

  1. Can anybody test it with the last Node 6 version and confirm it fixed?
  2. Can anybody test it with the Node 7 beta and reproduce the tests to be hung?

I've tried with Node 6.8.0 portable and the first test hangs as well. jsdom 9.6.0.

Tests work for me, takes like 5 minutes for all of them though.

That's a lot longer than the seconds the OP had... not great... I wonder if there's some infinite loop due to lines like these. But then I'm not sure why it would work after 5 minutes :-/. I won't be able to really investigate until this weekend... help appreciated.

Oh sorry, I was refering to our normal test suite. Testing the first test case with a simple html string with a link takes like 0.01s. I don't have a large HTML file handy.

Ah OK! Is that with master nwmatcher then?

It seems the cause is not in the querySelectorAll. If I comment out the line with querySelectorAll the test still works more than 5 minutes and I cant wait till the end. May be the cause is the parsing the html itself with the 100000 links. However this is downgrade.

Just try this test with the file produced by the code in the first post:

require('jsdom').env({file: 'test.html', done: (err, window) => {
  console.log('Parsed.');
}});

It hangs.

OK. That sounds like probably a separate issue. I am surprised jsdom was ever able to handle such a large document, but I guess there might have been a regression recently. Let's open a new issue to discuss and diagnose, and leave this for confirming that querySelector is faster with @dperini's changes.

OK. I've reduced the number of elements to 1000 and here is the results:

0.038502006 s

1000
0.041310866 s

1000
0.033337641 s

1000
0.10319168 s

The last is 3 times slower now instead of 30 times slower then. Is it representative enough to close the issue?

Yeah, that sounds good to me :). Thanks very much for testing!!

Was this page helpful?
0 / 5 - 0 ratings