Cheerio: Maintain tag attribute quote characters

Created on 16 Jun 2015  Â·  14Comments  Â·  Source: cheeriojs/cheerio

Cheerio changes attributes with single quotes into double quotes.

var cheerio = require('cheerio');
// uses 'single quotes'
var $ = cheerio.load('<div attr=\'value\'></div>')
$.html();
// => <div attr="value"></div>
// has "double quotes"

This is useful for me, as I use JSON in HTML attributes for widget settings.

<div data-settings='{ "option": true }'></div>

Which is encoded with HTML entities (possibly breaking JSON) as

<div data-options="{ &quot;option&quot;: true }"></div>

Setting decodeEntities: false is encoded and breaks HTML

<div data-options="{ "option": true }"></div>

Ideally, cheerio would preserve which quote character is used. I understand this is an edge case, so I'm reporting it in case others run into it. Similar to #460

Most helpful comment

I fix this.
steps:

  1. find the file node_modules/dom-serializer/index.js
  2. location the line at 68,change
else {
output += key + "='" + (opts.decodeEntities ? entities.encodeXML(value) : value) + "'";
}

To

else {
  if(/[^\\]\"/.test(value)){
        output += key + "='" + (opts.decodeEntities ? entities.encodeXML(value) : value) + "'";
      }else {
        output += key + '="' + (opts.decodeEntities ? entities.encodeXML(value) : value) + '"';

      }
}

All 14 comments

This definitely won't break JSON within browsers. IMHO this won't be fixed.

@fb55 is there any chance of this being fixed?

The html my app has to consume is very horribly written and I do not have control over fixing it. Not only does it contain raw json inside of a div tag, but most of the href attributes for anchors only work because of single quotes href='javascript: dostuff("asdf");' which like the issue breaks the tag when double quotes are replaced and the decodeEntities: false is used.

+1, I also store JSON in HTML attributes and would like to second keeping single quoted escapes because it makes the html with embedded json much easier to manipulate and read

+1

gentlemen any update on this one? do you intend to fix this or not at all? its more that 1 year later and task is still open.

@fb55, any update for this? Problem is really important, can't store json in meta tags there.

Switching to parse5 could fix this (it's another open issue). As I said
before, not sure if this will be fixed with the current architecture as it
doesn't break anything.

– Felix

ok @fb55, thank you very much for the feedback.

I fix this.
steps:

  1. find the file node_modules/dom-serializer/index.js
  2. location the line at 68,change
else {
output += key + "='" + (opts.decodeEntities ? entities.encodeXML(value) : value) + "'";
}

To

else {
  if(/[^\\]\"/.test(value)){
        output += key + "='" + (opts.decodeEntities ? entities.encodeXML(value) : value) + "'";
      }else {
        output += key + '="' + (opts.decodeEntities ? entities.encodeXML(value) : value) + '"';

      }
}

Thanks, @gaecom.
I've slightly modified your solution.

    if (opts.plainQuotes && /[^\\]\"/.test(value)) {
        if (opts.decodeEntities) {
          value = entities.encodeXML(value);
          if (opts.plainQuotes) { value = value.replace(/&quot;/g, '"'); }
        }
        output += key + "='" + value + "'";
      } else {
        if (opts.decodeEntities) {
          value = entities.encodeXML(value);
          if (opts.plainQuotes) { value = value.replace(/&apos;/g, "'"); }
        }
        output += key + '="' + value + '"';
      }

This variant works with decodeEntities: true.
It makes some double job with encoding and replacing &quot; back, but it is acceptable for better readability.

I also have this issue and the proposed solutions doesn't works for my use case. The HTML parsed by cheerio (via Inky) will also by parsed later by Twig and the quote needs to remain the same. Here is a HTML sample which fail:

{{ "The content may have simple quote like « l' » or « d' », as it is the case for <a href='http://my.link'>tags</a>."|trans }}

We are having this issue with implementing the new AMP pages by google. One of the parameters requires JSON inside a data attribute like so:

<amp-ad rtc-config='{"urls": ["url1",...]}'>

The single quotes get converted to doubles which invalidates the HTML. JSON can't contain single quotes so swapping double and single quotes doesn't work.

Please cheerio, we need you.

SOS. Does anyone have a fix for this without modifying external node_modules?

I just had this problem and fixed it by adding decodeEntities: false to the props when loading the html:

this.$ = cheerio.load(myHtml, {decodeEntities: false})
Was this page helpful?
0 / 5 - 0 ratings

Related issues

M3kH picture M3kH  Â·  4Comments

AlbertoElias picture AlbertoElias  Â·  4Comments

misner picture misner  Â·  3Comments

trevorfrese picture trevorfrese  Â·  4Comments

unicrus picture unicrus  Â·  4Comments