jsdom is a great tool for web scraping. However the textContent
is a very inconvenient way to get readable text for html2text conversion.
There is a wonderful article about usefulness of negligible innerText
in many cases:
http://perfectionkills.com/the-poor-misunderstood-innerText/
The author suggests getSelection().toString()
as a very slow workaround, but getSelection
is not implemented in the jsdom yet.
Could you consider an implementing of the innerText
in the jsdom? The author has done a great exploration about it, he has even added a simple spec at the end.
And what a pity that rangy Selection
and innerText
library is not compatible with jsdom: https://github.com/timdown/rangy/issues/348
So, innerText is not standard, and not implemented in at least one major engine (Firefox). Without a standard, I don't think we should implement it.
Looks like there's some movement in this whole thing with a draft spec here. See also all the references. There are no issues on the repo though, so I wonder how complete it already is / how quick progress will be.
Firefox has implemented: https://bugzilla.mozilla.org/show_bug.cgi?id=264412
WHATWG semms to approve: https://github.com/whatwg/compat/issues/5#issuecomment-168049752
From the spec it's seems like we can't implement innerText
properly without basic layout support.
Yeah, this is not really going to be implementable in jsdom anyway, without a lot of infrastructure work... nobody get their hopes up :(.
As to layout support requirement: https://github.com/rocallahan/innerText-spec/issues/2
Is there any plan to implement it because of WHATWG adoption?
Yeah... Although the spec requires a lot of stuff jsdom doesn't have, around CSS boxes :(. Not sure what to do.
Is there any lib for this to plug along with jsdom?
@domenic care to drop some knowledge on why this is such an infrastructure overhaul? We thought the 800lb gorilla in the room would leave lo-key. But looks like it's not going anywhere. As you know have been wrapping my head around the innards of jsdom. Where would be a great place in the repo to start reviewing code to a jsdom newb?
Thanks in advance 🙏 /cc @vsemozhetbyt
The primary issue is the fact that innerText
leans on the layout engine for guidance, and jsdom has no layout engine. See https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute and
http://perfectionkills.com/the-poor-misunderstood-innerText/ . From the second link:
Notice how innerText almost precisely represents exactly how text appears on the page. textContent, on the other hand, does something strange — it ignores newlines created by
and around styled-as-block elements ( in this case). But it preserves spaces as they are defined in the markup.
Still out of scope and no workaround?
Apparently the spec says:
If this element is not being rendered, or if the user agent is a non-CSS user agent, [emphasis added] then return the same value as the textContent IDL attribute on this element.
I think a workaround would be then to simply return textContent
.
We implement enough CSS that I don't think that applies. We just don't implement the layout parts...
Hi guys, any news on this one?
Just use headless chrome :)
@domenic from that spec that @coreh mentioned:
https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute
If this element is not being rendered, or if the user agent is a non-CSS user agent, then return the same value as the
textContent
IDL attribute on this element.
https://html.spec.whatwg.org/multipage/rendering.html#being-rendered
An element is being rendered if it has any associated CSS layout boxes, SVG layout boxes, or some equivalent in other styling languages.
If jsdom
doesn't implement the layout parts, doesn't that mean "not being rendered" applies?
This message is for anyone reaching this github thread that just wants a way to get their tests passing without changing their function implementations.
copypasta for the top of your test files:
// Expose JSDOM Element constructor
global.Element = (new JSDOM()).window.Element;
// 'Implement' innerText in JSDOM: https://github.com/jsdom/jsdom/issues/1245
Object.defineProperty(global.Element.prototype, 'innerText', {
get() {
return this.textContent;
},
});
Naturally, caveats from the above discussion apply.
In case anyone else is running into this issue I took it 1 step further and used the sanitize-html
package to get basically what the browser is doing (note I did not import the JSDOM setup as I found it wasn't needed when putting this in my Jest setup file but if you're not using Jest then you'll want to use the global.Element = (new JSDOM()).window.Element
setup that @bennypowers recommended):
Object.defineProperty(global.Element.prototype, 'innerText', {
get() {
return sanitizeHtml(this.textContent, {
allowedTags: [], // remove all tags and return text content only
allowedAttributes: {}, // remove all tags and return text content only
});
},
configurable: true, // make it so that it doesn't blow chunks on re-running tests with things like --watch
});
i had a similar need but wanted to go slightly further than just using the textContent
- again, this won't be an accurate representation of what browsers actually do, especially with respect to elements hidden by css, but it's good enough for my use case:
function innerText(el)
el = el.cloneNode(true) // can skip if mutability isn't a concern
el.querySelectorAll('script,style').forEach(s => s.remove())
return el.textContent
}
What a pity!
Apparently the spec says:
If this element is not being rendered, or if the user agent is a non-CSS user agent, [emphasis added] then return the same value as the textContent IDL attribute on this element.
I think a workaround would be then to simply return textContent.
We implement enough CSS that I don't think that applies. We just don't implement the layout parts...
@domenic please consider a more liberal interpretation of the spec
textContent
is explicitly allowed as fallback, when application of CSS rules is too expensive
also, innerText is specified as getter and setter
Given that I am the spec editor, I can state with certainty that "when application of CSS rules is too expensive" is not what the spec is saying.
.. that was my interpretation of "if the user agent is a non-CSS user agent"
whats the difference between a "CSS user agent" and a "non-CSS user agent"?
what about:
a CSS user agent can "apply CSS rules" and output the result (graphic or textual)
a non-CSS user agent is too dumb to "apply CSS rules"
We implement enough CSS that I don't think that applies.
what do you mean? window.getComputedStyle?
a fallback to textContent is still better than not implementing a standard interface
Maybe we can just use textContent
value to replace the result of innerText
while running tests with jsdom
. For example:
describe('mytest', () => {
beforeAll(() => {
Object.defineProperty(HTMLElement.prototype, 'innerText', {
get() {
return this.textContent;
}
});
});
it('should ok', () => {
// test assertions
});
});
Most helpful comment
In case anyone else is running into this issue I took it 1 step further and used the
sanitize-html
package to get basically what the browser is doing (note I did not import the JSDOM setup as I found it wasn't needed when putting this in my Jest setup file but if you're not using Jest then you'll want to use theglobal.Element = (new JSDOM()).window.Element
setup that @bennypowers recommended):