Cheerio: $.html() encoding apostrophes

Created on 6 Dec 2013  ยท  25Comments  ยท  Source: cheeriojs/cheerio

Having quotes in quotes and calling html on it encodes the inside quotes. jQuery .html correctly returns the desired output (not replace ' with ')

console.log(require('cheerio').load('<div ng-include="\'views/main.html\'"></div>').html());
// <div ng-include="&apos;views/main.html&apos;"></div>
โŒ Bug

Most helpful comment

Have you tried to run it with decodeEntities: false?

console.log(require('cheerio').load('<div ng-include="\'views/main.html\'"></div>', {decodeEntities: false}).html());

All 25 comments

Okay, why exactly is this an issue?

It effectively replaces It replaces the "'file'" with "&apos;file&apos;". The inside single quotes are required and valid.

Why are they required? Every browser's HTML parser will output the same result, it won't make any difference.

No offense but this is a valid issue. cheerio should not replace user desired syntax. Yeah the parse would display it as "'file'" to the user .... if I was outputting that to a browsers display. I'm not. I'm using it to parse the HTML and then work with it. It should keep everything in the ng-include as inputted and treat the whole value as a string.

That specific syntax is required when using AngularJS: http://docs.angularjs.org/api/ng.directive:ngInclude

It's not the job of a parser to preserve the original document. .html() returns an HTML representation of the parsed document, which doesn't have to be equal to the original document.

As long as the source code is parsed, all entities will be resolved (example).

You are right about the entities being resolved. I forgot that it does that. However, the issue is still valid. It shouldn't change the source code. According to cheerios description: Fast, flexible, and lean implementation of core jQuery designed specifically for the server. it should behave like jQuery core, which doesn't do this.

Lastly, even if it does replace the quote, it shouldn't use &apos; :http://stackoverflow.com/questions/2083754/why-shouldnt-apos-be-used-to-escape-single-quotes

I agree that cheerio could be smarter about replacing entities. Anyway, we're living in an HTML5 world, $apos; is part of HTML5 and can be used (HTML4 browsers implemented it anyway, so the point is invalid).

I'll jump in on this one. In my case I am appending an image tag with a src url with query params. In theory, it should not matter that it replaces the & with &. But it does matter because in this case: the output is an email, not a browser and the & is being interpreted as written so that my query params come in as amp;qname. I have a workaround which to unescape the output of .html(), but really I should not have to in this case and I have some concerns about what may happen if a legitimately escaped string comes in and we unescape it. Anyway, food for thought.

Okay, e-mails are a valid issue, especially with &amp; in src and href attributes (that also applies to regular websites).

I am having same issue. basically i am reading the html doing some pre-processing and outputting it back to same file. quick fix would be really really appreciated.

As mentioned in :http://stackoverflow.com/questions/2083754/why-shouldnt-apos-be-used-to-escape-single-quotes, &apos throws up errors in IE8, which it does I just found out.

I've also hit this as an issue when working with emails and adding tracking parameters to links. Any help working towards a fix would be greatly appreciated.

Happy to muck in and work on this if there are any pointers as to where to start?

I'm facing that issue too since I'm also using Angular. @eddiemonge did you find a decent workaround?

same here

Same problem here with ng-include.

Similar with backbone/rendr: <div data-fetch_summary="{&quot;collection&quot;:5}"> -> <div data-fetch_summary="{&#x26;quot;collection&#x26;quot;:5}">

Submitted PR #499

Seems fixed to me in latest.

@eddiemonge I'm running cheerio 0.17 and when I run your test I still get encoded apostrophe characters.

console.log(require('cheerio').load('<div ng-include="\'views/main.html\'"></div>').html());
<div ng-include="&apos;views/main.html&apos;"></div>

The encoded characters inside of attributes seems to break htmlmin which is why I'd like to leave them in if possible.

Have you tried to run it with decodeEntities: false?

console.log(require('cheerio').load('<div ng-include="\'views/main.html\'"></div>', {decodeEntities: false}).html());

@alexindigo I've met the same issue and have tried {decodeEntities: false} with the same usage you suggest, unfortunately it did not work...

@dickeylth It does work for me. If you can share your code, maybe we'd be able to point out at the issue. Thank you.

This is still an issue introduced by /fb55/entities. A PR solving this issue was made in 2015 and remains unmerged. Scenario here: Corporate email that contains stuff parsed by cheerio, &apos; renders as plaintext in Outlook environment. What should we do about it? As a cheerio user, I do not have control over whether entities does HTML4 or HTML5 encoding.

I had style="font-family: 'Roboto';" and { decodeEntities: false } fixed %apos problem

The problem with &apos; is that it is part of XHTML standard not HTML.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

trevorfrese picture trevorfrese  ยท  4Comments

miguelmota picture miguelmota  ยท  3Comments

Canop picture Canop  ยท  3Comments

bxqgit picture bxqgit  ยท  3Comments

clayrisser picture clayrisser  ยท  4Comments