Cheerio: decodeEntities false code tag

Created on 14 May 2018  ยท  34Comments  ยท  Source: cheeriojs/cheerio

const $ = cheerio.load(`<code>&lt;script src&gt;</code>`, {
  decodeEntities: false
});
console.log($.html());

output

<html><head></head><body><code><script src></code></body></html>

you can see that the string &lt;script src&gt; has been unescaped

Most helpful comment

This is the biggest problem of the project

All 34 comments

It seems that you need to set it to true.

$ = cheerio.load(`<code>&lt;script src&gt;=></code>`, {   decodeEntities: true }); console.log($.html());
<html><head></head><body><code>&lt;script src&gt;=&gt;</code></body></html>

@Jasonlhy if i set true, the chinese will be decoded, eg:

const $ = cheerio.load(`<p>ๅ“ˆๅ“ˆๅ“ˆๅ“ˆ</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($.html());

output:

<html><head></head><body><p>&#x54C8;&#x54C8;&#x54C8;&#x54C8;</p><code>&lt;script src&gt;</code></body></html>

After a little bit of research and testing. Parser5 (The HTML parser) does not support the option: decodeEntities now, the result from parser5 is default decoded (unescaped).

Both cases will print out ๅ“ˆๅ“ˆๅ“ˆๅ“ˆ.

const $ = cheerio.load(`<p>ๅ“ˆๅ“ˆๅ“ˆๅ“ˆ</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($('p').text());
const $ = cheerio.load(`<p>ๅ“ˆๅ“ˆๅ“ˆๅ“ˆ</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($('p').text());

I think the reason of this problem is the serializer, it is located at https://github.com/cheeriojs/dom-serializer/blob/master/index.js

It seems to change all decoded entities (Including Chinese text) back to encoded one if decodeEntities set to true

function renderText(elem, opts) {
  var data = elem.data || '';

  // if entities weren't decoded, no need to encode them back
  if (opts.decodeEntities && !(elem.parent && elem.parent.name in unencodedElements)) {
    data = entities.encodeXML(data);
  }

  return data;
}

This is the biggest problem of the project

I had documents containing html like <pre>&lt;span&gt;a = 1&lt;/span&gt;</pre> (i.e. code samples of html itself) and wanted to process them with cheerio before displaying sending them to the browser. decodeEntities: false obviously was not working due to the problem reported in this issue, and decodeEntities: true was encoding far more than < and >.

I kept decodeEntities: false and used https://www.npmjs.com/package/patch-package to generate and automatically apply this patch to dom-serializer upon every install:

```diff
diff --git a/node_modules/dom-serializer/index.js b/node_modules/dom-serializer/index.js
index 4f41e8f..8f9f14f 100644
--- a/node_modules/dom-serializer/index.js
+++ b/node_modules/dom-serializer/index.js
@@ -136,6 +136,8 @@ function renderText(elem, opts) {
data = entities.encodeXML(data);
}

  • data = data.replace(//g, '>')

  • return data;
    }
    ```

Since that only affects text nodes, it should not mess with anything, but caveat emptor.

I wonder if the clean solution here is replacing that encodeXML with something more context-aware, i.e. encode depending on the document type and charset.

This issue still exists now and I am crazy because I am building a static site generator with cheerio. This problem behaves in two ways:

  • First I am using marked.js and it will encode symbols like < in code blocks (<pre></pre>) into &lt, that's what expected. However, if I set decodeEntities: false, cheerio will replace &lt back with <, which is ridiculous!!! Why parser decode &lt into <? I think it should not change this.
  • But if I set decodeEntities: true, all CJK chars will be replaced into HTML entities, which is hard to deal with because some JavaScript codes needs to get string length, HTML entities is OK to render but not easy to process.

Can anybody locate the problem and fix it? Thanks a lot.

After reading codes for 3 hours I believe I find where problem begins.

@bard I think your patch is not correct, not only innerHTMLs but attrs also has this problem.

The problem is because cheerio is switching from htmlparser2 to parse5, and decodeEntities is an option to htmlparser2, but not an option to parse5, in version 1.0.0-rc3, they did not pass this option to parse5 too.

If we use htmlparser2 with decodeEntities: false, all things works correctly. However if you install cheerio with npm i -s cheerio, you got 1.0.0-rc3, which uses parse5.

parse5 will parse &lt; into < when it reading input, while htmparser2 with decodeEntities: false won't. And when we get result with .html(), cheerio will use dom-serializer to render strings.

dom-serializer supposes parser understanding decodeEntities, suppose we have &lt; as input, if decodeEntities is true, dom-serializer should get &, else it should get <. When we are using htmlparser2, it works. With decodeEntities: false, it looks like &lt;(input) -> &lt;(parsed) -> &lt;(before and after serialized). With decodeEntities: true, it looks like &lt;(input) -> <(parsed) -> <(before serialized) -> &lt; (after serialized).

But whether decodeEntities is true or not, parse5 always parse &lt; into <, if we set it to true, it looks like &lt;(input) -> <(parsed) -> <(before serialized) -> &lt; (after serialized), however, chars like Chinese and Japanese are ALSO encoded, which is BAD. But if we set it to false, it looks like &lt;(input) -> <(parsed) -> <(before serialized) -> < (after serialized because decodeEntities is false), which is totally a mess.

dom-serializer is right, it has no fault in its behavior. There are two ways to fix:

  • Stop use parse5, let us use htmlparser2 again by install 0.22.0 with npm i -s [email protected] (code in master branch is still 0.22.0, confusing). @fb55 @jugglinmike @matthewmueller
  • Make parse5 able to understand decodeEntites, because sometimes we want to keep &lt; or &amp; in innerHTML, always un-escape them will cause a mess. @inikulin

We'd better not modify dom-serializer, because it's a parse5 issue. parse5 does NOT always un-escape &amp;, it replace &amp in innerHTML into &, but won't replace &amp; in attributes.

It's a little bit long because it's complex, but please read and fix it, we cannot always lock version to 0.22.0.

Based on https://github.com/jsdom/jsdom/issues/2501 and https://github.com/inikulin/parse5/issues/261#issuecomment-401389295 it's unlikely that this will be handled in parse5.

The second thread suggests that it's possible to grab the raw, undecoded text from the input data based on node position, though that would require modifying dom-serializer, too.

@bard dom-serializer is used to generate strings from dom, not dom from string. We shouldn't touch it because it works correctly.

due to https://github.com/inikulin/parse5/issues/75 we know parse5 always decode entities, so decodeEntites is useless, and then we should add another option for encode entities or not, problems are cheerio currently support both parse5(for HTML) and htmlparser2(for XML)...

Edit: I got a better idea, we just need to remove entities re-encode in dom-serializer when working with parse5, I will send a pr.

Edit: However, this will break compatibility, needs an version bump...

Edit: PR is here https://github.com/cheeriojs/dom-serializer/pull/80

Thanks @AlynxZhou. Great find. Your 1st Solution worked for me, but you are right that we can't be permanently be bound to version 0.22.0.

I send a PR to dom-serializer to fix 1.0.0, however no one replied...

On Mon, Jun 3, 2019, 08:55 Iqbal Ahmed notifications@github.com wrote:

Thanks @AlynxZhou https://github.com/AlynxZhou. Great find. Your 1st
Solution worked for me, but you are right that we can't be permanently be
bound to version 0.22.0.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cheeriojs/cheerio/issues/1198?email_source=notifications&email_token=ACANVR42P4FOIMW62TLNKKDPYRTWVA5CNFSM4E7WNJK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWYBZXQ#issuecomment-498080990,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACANVR7VPU5TRUT4COB57HDPYRTWVANCNFSM4E7WNJKQ
.

@fb55 @jugglinmike this seems to be causing some folks pain. If i recall, we were going to continue supporting both dom-serializer and parse5. do we have a snippet that we can point people to for the old behavior?

This is because parse5 ignores decodeEntity, I modified dom-serializer to fix this, but it needs a version bump.

I wonder what breaks exactly with decoded entities. The only thing that comes to my mind is that resulting pages are served in encoding other than UTF-8. Otherwise, nowadays there is no reason to prefer HTML entities (with an exception for safety-related HTML escaping) over code points.

@inikulin The issue is serialization. Some people have entities as plain text in their HTML or want to maintain the encoding style they've used.

@fb55 But for people who don't want HTML entities in their files, cheerio 1.0.0 is totally broken because it cannot handle things correctly (decodeEntities false, &lt; becomes <, decodeEntities true, CJK chars becomes entities). The problem is not in cheerio, it's because parse5 always decode entities while parsing. Using decodeEntities to decide whether we should encode is not a good choice, because we cannot control whether decode entities when using parse5, we'd better not to encode entities, just escape HTML is enough. cheerio is expected to be a parser, not encoder, encoding all entities makes it really hard for further string editing.

I hope you can consider my PR https://github.com/cheeriojs/dom-serializer/pull/80, I am using cheerio in my tools and I cannot always stick to 0.22.0.

One month passed and this problem still exists:

Screenshot from 2019-08-07 14-30-46

I am afraid that as a Chinese user I have to keep using cheerio 0.22.0 in my project forever because it seems no one will take a look on the reason I found.

@AlynxZhou this workaround should help. I've been using it in production since April with no ill effects.

@claviska Thank you. But what I want is not a workaround, I know I can solve it by a simple hook. However, the problem is a bug in fact, it is because new library's behavior is different from the old one, and it needs to be fixed by remove unsupported option, it's complex and needs a change of logic, workaround can only hide the problem, but the problem still exists and may break other part.

Anyway thank you for your help.

@AlynxZhou I'd like to see it fixed too, but this is open source so our options are to:

  1. Contribute a fix
  2. Use a workaround
  3. Wait patiently
  4. Use an old version that probably won't get updated ๐Ÿ˜•

What I've proposed can be used as a separate module that you can easily remove once the bug is properly fixed. If you want to use the latest version without this encoding issue, it's probably your best bet.

Cheers

@claviska I have send a PR to dom-serializer which removed decodeEntities because parser5 always do decoding and this options is useless. However, still no one replied.

To unblock, you can use the old parser:

const cheerio = require('cheerio')
const $ = cheerio.load(`<code>&lt;script src&gt;</code>`, {
  _useHtmlParser2: true,
  decodeEntities: false,
})
console.log($.html())
// <code>&lt;script src&gt;</code>

This is undocumented and may be removed when we go to 1.0.0. However, since I'm telling people they can use it, we'll probably need to support it. I hope this is okay, @jugglinmike @fb55.


@AlynxZhou thanks for taking the time to share your findings. Unfortunately, it's not easy for me to tell if the snippet above solves your problem or not. Can you please reframe your problem in the following way:

  1. Create a minimal code example that illustrates the problem.
  2. What you expect to see.
  3. What you actually see.

This will help us debug your issue. You can refer to the original poster for a decent example of this.

@matthewmueller No need to check via code, I can tell you this can definitely solve problem, however, as you said, _useHtmlParser2 is undocumented and may be removed, so I am afraid that I can't use such a unstable feature in my project to fix bugs.

Such snippet is just let 1.0.0 works like 0.22.0, to solve problem we need to consider differences between parse5 and htmlparser2, and I need to say again, encode all things into XML entities is a BAD option and we may consider to drop it because parse5 always decode input.

It's highly dependent on your use case, which is why we need to provide an option to switch. parse5 is a better supported parser at this point and I believe it was @fb55's idea to switch to it over his own parser.

FYI, for the final public API I'd probably try and have a parser interface like this:

const parser = {
  parse(code: string): AST
}

cheerio.load("...", { parser: parser })

What do you think @jugglinmike @fb55 ?

@matthewmueller Sorry, I am reading code again now and find that the problem has been fixed by using parse5's serializer instead of dom-serializer, so I am trying to test it, I am REALLY happy to see that it may be fixed!

UPDATE: I am sure this is fixed in v1.0.0 branch and now waiting this to be updated to npm happily! If you want to test I write my code here:

const cheerio = require('cheerio')

const $ = cheerio.load('<p>&lt;ๅ“ˆๅ“ˆๅ“ˆ</p>')

$('body').html() // output -> '<p>&lt;ๅ“ˆๅ“ˆๅ“ˆ</p>'

'<p><ๅ“ˆๅ“ˆๅ“ˆ</p>' or you get XML entities means it was not solved, but in my test v1.0.0 works correct!

(https://github.com/cheeriojs/cheerio/blob/c635bea08f72f6670977674d40d44a5edd7f4a31/lib/static.js#L106 is the code that fixed the problem.)

So, #1307 fixed this issue?

I think it was fixed before this, you may do a test.

I previously had an issue with embedding xml code block in my blog. I can confirm it's fixed when I use v1.0.0 branch.

OP @huanz can you confirm?

@matthewmueller Any plan for releasing 1.0.0? Or could you please consider to release another rc version to fix this problem? We really want 1.0.0 branch because it fixed this bug. Thanks!

I can confirm that the problem is fixed on the 1.0.0 branch ๐Ÿ‘ I installed it this way:

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

This is not ideal and hope that a new RC will soon be released.

is be fixed ๏ผŸ I also have this problem

const $ = cheerio.load('<p>&lt;ๅ“ˆๅ“ˆๅ“ˆ</p>')

console.log($('body').html())
npm i [email protected]

emmm , it's ok

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

is be fixed ๏ผŸ I also have this problem

const $ = cheerio.load('<p>&lt;ๅ“ˆๅ“ˆๅ“ˆ</p>')

console.log($('body').html())
npm i [email protected]

emmm , it's ok

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

two years past and they still don't update version on npm, so you will still face this problem if you install 1.0.0-rc3 from npm. You need to install 1.0.0 branch from github.

[email protected] is now on npm. Hoping this is resolved now!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

trevorfrese picture trevorfrese  ยท  4Comments

Canop picture Canop  ยท  3Comments

askie picture askie  ยท  4Comments

miguelmota picture miguelmota  ยท  3Comments

Tetheta picture Tetheta  ยท  3Comments