Cheerio: decodeEntities false code tag

Created on 14 May 2018 · 34Comments · Source: cheeriojs/cheerio

const $ = cheerio.load(`<code>&lt;script src&gt;</code>`, {
  decodeEntities: false
});
console.log($.html());

output

<html><head></head><body><code><script src></code></body></html>

you can see that the string <script src> has been unescaped

Source

huanz

Most helpful comment

This is the biggest problem of the project

senwzz on 7 Aug 2018

👍16

All 34 comments

It seems that you need to set it to true.

$ = cheerio.load(`<code>&lt;script src&gt;=></code>`, {   decodeEntities: true }); console.log($.html());
<html><head></head><body><code>&lt;script src&gt;=&gt;</code></body></html>

Jasonlhy on 29 May 2018

@Jasonlhy if i set true, the chinese will be decoded, eg:

const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($.html());

output:

<html><head></head><body><p>&#x54C8;&#x54C8;&#x54C8;&#x54C8;</p><code>&lt;script src&gt;</code></body></html>

huanz on 29 May 2018

After a little bit of research and testing. Parser5 (The HTML parser) does not support the option: decodeEntities now, the result from parser5 is default decoded (unescaped).

Both cases will print out 哈哈哈哈.

const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($('p').text());

const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($('p').text());

I think the reason of this problem is the serializer, it is located at https://github.com/cheeriojs/dom-serializer/blob/master/index.js

It seems to change all decoded entities (Including Chinese text) back to encoded one if decodeEntities set to true

function renderText(elem, opts) {
  var data = elem.data || '';

  // if entities weren't decoded, no need to encode them back
  if (opts.decodeEntities && !(elem.parent && elem.parent.name in unencodedElements)) {
    data = entities.encodeXML(data);
  }

  return data;
}

Jasonlhy on 29 May 2018

This is the biggest problem of the project

senwzz on 7 Aug 2018

👍16

I had documents containing html like <pre><span>a = 1</span></pre> (i.e. code samples of html itself) and wanted to process them with cheerio before displaying sending them to the browser. decodeEntities: false obviously was not working due to the problem reported in this issue, and decodeEntities: true was encoding far more than < and >.

I kept decodeEntities: false and used https://www.npmjs.com/package/patch-package to generate and automatically apply this patch to dom-serializer upon every install:

```diff
diff --git a/node_modules/dom-serializer/index.js b/node_modules/dom-serializer/index.js
index 4f41e8f..8f9f14f 100644
--- a/node_modules/dom-serializer/index.js
+++ b/node_modules/dom-serializer/index.js
@@ -136,6 +136,8 @@ function renderText(elem, opts) {
data = entities.encodeXML(data);
}

data = data.replace(//g, '>')
return data;
}
```

Since that only affects text nodes, it should not mess with anything, but caveat emptor.

I wonder if the clean solution here is replacing that encodeXML with something more context-aware, i.e. encode depending on the document type and charset.

bard on 10 Apr 2019

👍1

This issue still exists now and I am crazy because I am building a static site generator with cheerio. This problem behaves in two ways:

First I am using marked.js and it will encode symbols like < in code blocks (<pre></pre>) into &lt, that's what expected. However, if I set decodeEntities: false, cheerio will replace &lt back with <, which is ridiculous!!! Why parser decode &lt into <? I think it should not change this.
But if I set decodeEntities: true, all CJK chars will be replaced into HTML entities, which is hard to deal with because some JavaScript codes needs to get string length, HTML entities is OK to render but not easy to process.

Can anybody locate the problem and fix it? Thanks a lot.

AlynxZhou on 19 May 2019

After reading codes for 3 hours I believe I find where problem begins.

@bard I think your patch is not correct, not only innerHTMLs but attrs also has this problem.

The problem is because cheerio is switching from htmlparser2 to parse5, and decodeEntities is an option to htmlparser2, but not an option to parse5, in version 1.0.0-rc3, they did not pass this option to parse5 too.

If we use htmlparser2 with decodeEntities: false, all things works correctly. However if you install cheerio with npm i -s cheerio, you got 1.0.0-rc3, which uses parse5.

parse5 will parse < into < when it reading input, while htmparser2 with decodeEntities: false won't. And when we get result with .html(), cheerio will use dom-serializer to render strings.

dom-serializer supposes parser understanding decodeEntities, suppose we have < as input, if decodeEntities is true, dom-serializer should get &, else it should get <. When we are using htmlparser2, it works. With decodeEntities: false, it looks like <(input) -> <(parsed) -> <(before and after serialized). With decodeEntities: true, it looks like <(input) -> <(parsed) -> <(before serialized) -> < (after serialized).

But whether decodeEntities is true or not, parse5 always parse < into <, if we set it to true, it looks like <(input) -> <(parsed) -> <(before serialized) -> < (after serialized), however, chars like Chinese and Japanese are ALSO encoded, which is BAD. But if we set it to false, it looks like <(input) -> <(parsed) -> <(before serialized) -> < (after serialized because decodeEntities is false), which is totally a mess.

dom-serializer is right, it has no fault in its behavior. There are two ways to fix:

Stop use parse5, let us use htmlparser2 again by install 0.22.0 with npm i -s [email protected] (code in master branch is still 0.22.0, confusing). @fb55 @jugglinmike @matthewmueller
Make parse5 able to understand decodeEntites, because sometimes we want to keep < or & in innerHTML, always un-escape them will cause a mess. @inikulin

We'd better not modify dom-serializer, because it's a parse5 issue. parse5 does NOT always un-escape &, it replace &amp in innerHTML into &, but won't replace & in attributes.

It's a little bit long because it's complex, but please read and fix it, we cannot always lock version to 0.22.0.

AlynxZhou on 19 May 2019

👍4

Based on https://github.com/jsdom/jsdom/issues/2501 and https://github.com/inikulin/parse5/issues/261#issuecomment-401389295 it's unlikely that this will be handled in parse5.

The second thread suggests that it's possible to grab the raw, undecoded text from the input data based on node position, though that would require modifying dom-serializer, too.

bard on 19 May 2019

@bard dom-serializer is used to generate strings from dom, not dom from string. We shouldn't touch it because it works correctly.

AlynxZhou on 19 May 2019

due to https://github.com/inikulin/parse5/issues/75 we know parse5 always decode entities, so decodeEntites is useless, and then we should add another option for encode entities or not, problems are cheerio currently support both parse5(for HTML) and htmlparser2(for XML)...

Edit: I got a better idea, we just need to remove entities re-encode in dom-serializer when working with parse5, I will send a pr.

Edit: However, this will break compatibility, needs an version bump...

Edit: PR is here https://github.com/cheeriojs/dom-serializer/pull/80

AlynxZhou on 19 May 2019

Thanks @AlynxZhou. Great find. Your 1st Solution worked for me, but you are right that we can't be permanently be bound to version 0.22.0.

propattern on 3 Jun 2019

I send a PR to dom-serializer to fix 1.0.0, however no one replied...

On Mon, Jun 3, 2019, 08:55 Iqbal Ahmed notifications@github.com wrote:

Thanks @AlynxZhou https://github.com/AlynxZhou. Great find. Your 1st
Solution worked for me, but you are right that we can't be permanently be
bound to version 0.22.0.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cheeriojs/cheerio/issues/1198?email_source=notifications&email_token=ACANVR42P4FOIMW62TLNKKDPYRTWVA5CNFSM4E7WNJK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWYBZXQ#issuecomment-498080990,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACANVR7VPU5TRUT4COB57HDPYRTWVANCNFSM4E7WNJKQ
.

AlynxZhou on 3 Jun 2019

@fb55 @jugglinmike this seems to be causing some folks pain. If i recall, we were going to continue supporting both dom-serializer and parse5. do we have a snippet that we can point people to for the old behavior?

matthewmueller on 3 Jun 2019

This is because parse5 ignores decodeEntity, I modified dom-serializer to fix this, but it needs a version bump.

AlynxZhou on 3 Jun 2019

I wonder what breaks exactly with decoded entities. The only thing that comes to my mind is that resulting pages are served in encoding other than UTF-8. Otherwise, nowadays there is no reason to prefer HTML entities (with an exception for safety-related HTML escaping) over code points.

inikulin on 27 Jun 2019

@inikulin The issue is serialization. Some people have entities as plain text in their HTML or want to maintain the encoding style they've used.

fb55 on 27 Jun 2019

@fb55 But for people who don't want HTML entities in their files, cheerio 1.0.0 is totally broken because it cannot handle things correctly (decodeEntities false, < becomes <, decodeEntities true, CJK chars becomes entities). The problem is not in cheerio, it's because parse5 always decode entities while parsing. Using decodeEntities to decide whether we should encode is not a good choice, because we cannot control whether decode entities when using parse5, we'd better not to encode entities, just escape HTML is enough. cheerio is expected to be a parser, not encoder, encoding all entities makes it really hard for further string editing.

I hope you can consider my PR https://github.com/cheeriojs/dom-serializer/pull/80, I am using cheerio in my tools and I cannot always stick to 0.22.0.

AlynxZhou on 2 Jul 2019

One month passed and this problem still exists:

Screenshot from 2019-08-07 14-30-46

I am afraid that as a Chinese user I have to keep using cheerio 0.22.0 in my project forever because it seems no one will take a look on the reason I found.

AlynxZhou on 7 Aug 2019

@AlynxZhou this workaround should help. I've been using it in production since April with no ill effects.

claviska on 7 Aug 2019

@claviska Thank you. But what I want is not a workaround, I know I can solve it by a simple hook. However, the problem is a bug in fact, it is because new library's behavior is different from the old one, and it needs to be fixed by remove unsupported option, it's complex and needs a change of logic, workaround can only hide the problem, but the problem still exists and may break other part.

Anyway thank you for your help.

AlynxZhou on 8 Aug 2019

@AlynxZhou I'd like to see it fixed too, but this is open source so our options are to:

Contribute a fix
Use a workaround
Wait patiently
Use an old version that probably won't get updated 😕

What I've proposed can be used as a separate module that you can easily remove once the bug is properly fixed. If you want to use the latest version without this encoding issue, it's probably your best bet.

Cheers

claviska on 8 Aug 2019

@claviska I have send a PR to dom-serializer which removed decodeEntities because parser5 always do decoding and this options is useless. However, still no one replied.

AlynxZhou on 9 Aug 2019

👍2

To unblock, you can use the old parser:

const cheerio = require('cheerio')
const $ = cheerio.load(`<code>&lt;script src&gt;</code>`, {
  _useHtmlParser2: true,
  decodeEntities: false,
})
console.log($.html())
// <code>&lt;script src&gt;</code>

This is undocumented and may be removed when we go to 1.0.0. However, since I'm telling people they can use it, we'll probably need to support it. I hope this is okay, @jugglinmike @fb55.

@AlynxZhou thanks for taking the time to share your findings. Unfortunately, it's not easy for me to tell if the snippet above solves your problem or not. Can you please reframe your problem in the following way:

Create a minimal code example that illustrates the problem.
What you expect to see.
What you actually see.

This will help us debug your issue. You can refer to the original poster for a decent example of this.

matthewmueller on 18 Aug 2019

@matthewmueller No need to check via code, I can tell you this can definitely solve problem, however, as you said, _useHtmlParser2 is undocumented and may be removed, so I am afraid that I can't use such a unstable feature in my project to fix bugs.

Such snippet is just let 1.0.0 works like 0.22.0, to solve problem we need to consider differences between parse5 and htmlparser2, and I need to say again, encode all things into XML entities is a BAD option and we may consider to drop it because parse5 always decode input.

AlynxZhou on 19 Aug 2019

It's highly dependent on your use case, which is why we need to provide an option to switch. parse5 is a better supported parser at this point and I believe it was @fb55's idea to switch to it over his own parser.

FYI, for the final public API I'd probably try and have a parser interface like this:

const parser = {
  parse(code: string): AST
}

cheerio.load("...", { parser: parser })

What do you think @jugglinmike @fb55 ?

matthewmueller on 19 Aug 2019

@matthewmueller Sorry, I am reading code again now and find that the problem has been fixed by using parse5's serializer instead of dom-serializer, so I am trying to test it, I am REALLY happy to see that it may be fixed!

UPDATE: I am sure this is fixed in v1.0.0 branch and now waiting this to be updated to npm happily! If you want to test I write my code here:

const cheerio = require('cheerio')

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

$('body').html() // output -> '<p>&lt;哈哈哈</p>'

'<p><哈哈哈</p>' or you get XML entities means it was not solved, but in my test v1.0.0 works correct!

(https://github.com/cheeriojs/cheerio/blob/c635bea08f72f6670977674d40d44a5edd7f4a31/lib/static.js#L106 is the code that fixed the problem.)

AlynxZhou on 19 Aug 2019

So, https://github.com/cheeriojs/cheerio/pull/1307 fixed this issue?

curbengh on 19 Aug 2019

So, #1307 fixed this issue?

I think it was fixed before this, you may do a test.

AlynxZhou on 19 Aug 2019

I previously had an issue with embedding xml code block in my blog. I can confirm it's fixed when I use v1.0.0 branch.

OP @huanz can you confirm?

curbengh on 20 Aug 2019

@matthewmueller Any plan for releasing 1.0.0? Or could you please consider to release another rc version to fix this problem? We really want 1.0.0 branch because it fixed this bug. Thanks!

AlynxZhou on 26 Sep 2019

👍6

I can confirm that the problem is fixed on the 1.0.0 branch 👍 I installed it this way:

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

This is not ideal and hope that a new RC will soon be released.

michaelperrin on 2 Aug 2020

👍1

is be fixed ？ I also have this problem

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

console.log($('body').html())

npm i [email protected]

emmm , it's ok

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

superhuai on 6 Dec 2020

is be fixed ？ I also have this problem

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

console.log($('body').html())

npm i [email protected]

emmm , it's ok

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

two years past and they still don't update version on npm, so you will still face this problem if you install 1.0.0-rc3 from npm. You need to install 1.0.0 branch from github.

AlynxZhou on 7 Dec 2020

[email protected] is now on npm. Hoping this is resolved now!

fb55 on 21 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings