Cheerio: Doing .text() of a <script> brings empty string on 1.0.0-rc.2

Created on 6 Jul 2017  路  16Comments  路  Source: cheeriojs/cheerio

Hi, I've been doing .text() of tags for scraping purposes, and the latest release candidate breaks this. jQuery 1~3 and cheerio ~0 both works fine.

To Do

Most helpful comment

I spent a lot of time wondering why I wasn't able to get the text() of a <script type="application/ld+json"> and finally found this issue. It should be documented somewhere as a breaking change is it's not reverted.

I agree with @fb55, I was confused as both innerText and textContent would return the content of the script (at least in Chrome).

I'm okay using html() instead if this behavior is more aligned with the jQuery API, as long as the breaking change is documented.

All 16 comments

Thanks for the report! This is a tricky problem.

The updated behavior was merged to master (i.e. not just the 1.0.0 release branch) in gh-1018. Judging from the discussion in that patch (and in bug reports gh-986 and gh-)924), it seems to have been motivated by comparisons with anecdotal evidence of observed jQuery behavior. Missing from those discussions is evidence from the jQuery documentation of $.fn.text, but the docs do indeed support this:

The .text() method cannot be used on form inputs or scripts. To set or get
the text value of input or textarea elements, use the .val() method. To
get the value of a script element, use the .html() method.

Interestingly, this is not observable from Chrome or Firefox--the method returns the text contents of script elements in both of those browsers. The caveat is likely in place due to practical constraints for Internet Explorer. IE versions 6 through 8 do not provide text contents for script tags. I imagine the jQuery team opted to recommend using html because smoothing over IE's behavior had negative performance implications.

From this perspective, the change is technically valid. To promote portability between Cheerio and jQuery and between all web browsers, Cheerio should similarly recommend that consumers use html for this use case.

That said, such a policy definitely feels like a lowest common denominator solution, one that accounts for web browsers that are over eight years old and have negligible market share. It's also important to note that jQuery itself no longer supports Internet Explorer releases prior to version 9. (That detail tends to suggest a jQuery documentation change. I've opened an issue to discuss this with the jQuery
maintainers
.)

So with all of that said, I'm leaning towards reverting gh-1018. What do you think, @fb55?

This is an interesting case. I was assuming that .text() was an alias for innerText, when it is using textContent under the hood. I'm fine with reverting #1018, even though the current behaviour is probably more useful when scraping pages (as both tags contents usually aren't visible).

A bit bias ;) but the current behavior does support the specifications in the jQuery docs as you mentioned.

Having both options is best? Users wanting to scrape script tags can use .html() and users wanting to ignore script tags can use .text(). Reverting #1018 will give users only one option.

I spent a lot of time wondering why I wasn't able to get the text() of a <script type="application/ld+json"> and finally found this issue. It should be documented somewhere as a breaking change is it's not reverted.

I agree with @fb55, I was confused as both innerText and textContent would return the content of the script (at least in Chrome).

I'm okay using html() instead if this behavior is more aligned with the jQuery API, as long as the breaking change is documented.

I was also confused, that the text() method on script elements returns empty string. As for jQuery 3.1 the jQuery(script).text() returns the textContent of the element.

Honestly I think I just pulled out every strand of hair on my head head while trying to figure why I kept getting an empty string!

Let's revert #1018, the current behavior seems pretty surprising.

Before publishing that fix, please see if enzyme's test suite passes with it :-)

@ljharb Do you have a suspicion that things might break? Would love to hear another take on this 馃槃

It鈥檚 possible it鈥檇 only be conceptual and not actual breakage, but I鈥檇 hope enzyme has enough test coverage that it鈥檇 notice the change.

Next steps:

  • [ ] Create a PR reverting this change.
  • [ ] Test that PR against enzyme's test suite.
  • [ ] If failing tests in enzyme, discuss with @ljharb.
  • [ ] Assuming all good, update our tests, document breaking change, merge.

@ljharb The last commit on the enzyme master branch is two years old, even though tags still seem to be pushed. What is the best source for enzyme's source code, so we can run the test suite?

@fb55 thats the date the commit was authored; i landed it a week ago. The master branch is quite up to date.

Ah, interesting 馃槃 Thanks for clarifying!

I decided to punt on this for now. For reference, enzyme actually does not have a test covering it.

@fb55 that is true; the second checkbox in the list above never happened. I can certainly try to add a test, or at least open a PR/branch with one, if that would help you fix this in cheerio?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chenweiyj picture chenweiyj  路  5Comments

AlbertoElias picture AlbertoElias  路  4Comments

misner picture misner  路  3Comments

becush picture becush  路  3Comments

unicrus picture unicrus  路  4Comments