Cheerio: Doing .text() of a <script> brings empty string on 1.0.0-rc.2

Created on 6 Jul 2017 · 16Comments · Source: cheeriojs/cheerio

Hi, I've been doing .text() of tags for scraping purposes, and the latest release candidate breaks this. jQuery 1~3 and cheerio ~0 both works fine.

To Do

Source

aletorrado

👍1

Most helpful comment

I spent a lot of time wondering why I wasn't able to get the text() of a <script type="application/ld+json"> and finally found this issue. It should be documented somewhere as a breaking change is it's not reverted.

I agree with @fb55, I was confused as both innerText and textContent would return the content of the script (at least in Chrome).

I'm okay using html() instead if this behavior is more aligned with the jQuery API, as long as the breaking change is documented.

FdezRomero on 23 Aug 2017

👍11

All 16 comments

Thanks for the report! This is a tricky problem.

The updated behavior was merged to master (i.e. not just the 1.0.0 release branch) in gh-1018. Judging from the discussion in that patch (and in bug reports gh-986 and gh-)924), it seems to have been motivated by comparisons with anecdotal evidence of observed jQuery behavior. Missing from those discussions is evidence from the jQuery documentation of $.fn.text, but the docs do indeed support this:

The .text() method cannot be used on form inputs or scripts. To set or get
the text value of input or textarea elements, use the .val() method. To
get the value of a script element, use the .html() method.

Interestingly, this is not observable from Chrome or Firefox--the method returns the text contents of script elements in both of those browsers. The caveat is likely in place due to practical constraints for Internet Explorer. IE versions 6 through 8 do not provide text contents for script tags. I imagine the jQuery team opted to recommend using html because smoothing over IE's behavior had negative performance implications.

From this perspective, the change is technically valid. To promote portability between Cheerio and jQuery and between all web browsers, Cheerio should similarly recommend that consumers use html for this use case.

That said, such a policy definitely feels like a lowest common denominator solution, one that accounts for web browsers that are over eight years old and have negligible market share. It's also important to note that jQuery itself no longer supports Internet Explorer releases prior to version 9. (That detail tends to suggest a jQuery documentation change. I've opened an issue to discuss this with the jQuery
maintainers.)

So with all of that said, I'm leaning towards reverting gh-1018. What do you think, @fb55?

jugglinmike on 29 Jul 2017

This is an interesting case. I was assuming that .text() was an alias for innerText, when it is using textContent under the hood. I'm fine with reverting #1018, even though the current behaviour is probably more useful when scraping pages (as both tags contents usually aren't visible).

fb55 on 1 Aug 2017

A bit bias ;) but the current behavior does support the specifications in the jQuery docs as you mentioned.

Having both options is best? Users wanting to scrape script tags can use .html() and users wanting to ignore script tags can use .text(). Reverting #1018 will give users only one option.

ghost on 14 Aug 2017

I agree with @fb55, I was confused as both innerText and textContent would return the content of the script (at least in Chrome).

I'm okay using html() instead if this behavior is more aligned with the jQuery API, as long as the breaking change is documented.

FdezRomero on 23 Aug 2017

👍11

I was also confused, that the text() method on script elements returns empty string. As for jQuery 3.1 the jQuery(script).text() returns the textContent of the element.

tenbits on 22 Jan 2019

👍2

Honestly I think I just pulled out every strand of hair on my head head while trying to figure why I kept getting an empty string!

tegaguru on 14 Jan 2020

Let's revert #1018, the current behavior seems pretty surprising.

fb55 on 12 Sep 2020

👍1

Before publishing that fix, please see if enzyme's test suite passes with it :-)

ljharb on 12 Sep 2020

@ljharb Do you have a suspicion that things might break? Would love to hear another take on this 😄

fb55 on 12 Sep 2020

It’s possible it’d only be conceptual and not actual breakage, but I’d hope enzyme has enough test coverage that it’d notice the change.

ljharb on 12 Sep 2020

Next steps:

[ ] Create a PR reverting this change.
[ ] Test that PR against enzyme's test suite.
[ ] If failing tests in enzyme, discuss with @ljharb.
[ ] Assuming all good, update our tests, document breaking change, merge.

matthewmueller on 22 Sep 2020

👍1

@ljharb The last commit on the enzyme master branch is two years old, even though tags still seem to be pushed. What is the best source for enzyme's source code, so we can run the test suite?

fb55 on 20 Dec 2020

@fb55 thats the date the commit was authored; i landed it a week ago. The master branch is quite up to date.

ljharb on 20 Dec 2020

Ah, interesting 😄 Thanks for clarifying!

fb55 on 20 Dec 2020

I decided to punt on this for now. For reference, enzyme actually does not have a test covering it.

fb55 on 21 Dec 2020

@fb55 that is true; the second checkbox in the list above never happened. I can certainly try to add a test, or at least open a PR/branch with one, if that would help you fix this in cheerio?

ljharb on 22 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings