Hi Cheerio Team
I'm trying to solve the issue with self-closing tags.
This code:
const cheerio = require('cheerio')
const $ = cheerio.load('<div data-message="msg"></div>',
{
xmlMode: true,
decodeEntities: false,
})
console.log($.xml())
Outputs: <div data-message="msg"/>
But I need the following output: <div data-message="msg"></div>
Also I have to keep cheerio options xmlMode: true and decodeEntities: false
Is there any way to tell cheerio to close tags in such cases?
Also facing this issue with Automattic/juice :(
Even though I'm rather new to cheerio, I've already been struggling quite a bit with those self-closing tags - and I think now that I can help here.
First, let me state a few assumptions I am making:
npm install cheerio gets you at the moment, and your example can be reproduced with it.<link />, <meta />, <br/>, and some more<script> (!). Of course something like <div/> isn't allowed either.$.xml, $.html and $.text will be deprecated.cheerio.load('<div class="foo"></div>') will NOT anymore get you a representation only of that fragment, but rather behave like a browser: it'll put a <!DOCTYPE html><html><head></head><body>...</body></html> around it.instance.html(selector), instance.xml(selector) and instance.text(selector) will also be deprecated. So: to get (just) the outerHTML of a particular tag will be somewhat trickier.To be clear: these are assumptions of mine. I am not entirely sure about all of them, particulary those w.r.t. the differences to the upcoming v1.0.0 Please do correct me whereever I might be wrong!
As for 2) and 3), the actual output in the example is valid XML but invalid HTML. So maybe one shouldn't expect to get a non-self-closing, empty <div></div> from .xml(). But .html() definitely should NOT return invalid HTML - the more if it was valid in the first place. However, in this example, it makes no difference which one you call; the output's the same for both: <div/>.
The workaround to enforce empty <div></div> rather than self-closing <div/> (and the like): put an empty text node inside!
This can be achieved by $('div').filter((i,e) => !e.children.length).text(''))
The .filter(...) ensures that we don't overwrite the innards of any non-empty <div>.
Of course this should be applied to all empty tags, that are not allowed to be self-closing in HTML.
There's quite a bit more to say about this, e.g. when <script> tags must be empty (but not self-closing), or how to actually enforce self-closing tags like <meta .../> even when they were (incorrectly) either empty (<meta...></meta>) or simply left open (just <meta...>)...
Also TODO: 4) and 5)
Yes I'm also having this problem, in both html and xml mode tags with empty contents are made self closing (including divs). Putting an empty space in between fixes the issue but isn't always possible.
Hi Cheerio Team,
this breaks completely <script> tags, self closing them
what is the fix here? the solution like "add empty text node" is not a solution, it's an awful crutch
I see this issue https://github.com/cheeriojs/cheerio/issues/366, which was "politely" closed, probably pointing to the same problem 4 (!) years ago.
If you have to use xmlMode, you can disable self-closing tags using the selfClosingTags option:
const $ = cheerio.load('<div data-message="msg"></div>',
{
xmlMode: true,
decodeEntities: false,
selfClosingTags: false,
})
console.log($.xml())
Outputs: <div data-message="msg"></div>.
Note that this will disable parsing self-closing tags as well.
Most helpful comment
Even though I'm rather new to cheerio, I've already been struggling quite a bit with those self-closing tags - and I think now that I can help here.
First, let me state a few assumptions I am making:
npm install cheeriogets you at the moment, and your example can be reproduced with it.<link />,<meta />,<br/>, and some more<script>(!). Of course something like<div/>isn't allowed either.a) the static methods
$.xml,$.htmland$.textwill be deprecated.b)
cheerio.load('<div class="foo"></div>')will NOT anymore get you a representation only of that fragment, but rather behave like a browser: it'll put a<!DOCTYPE html><html><head></head><body>...</body></html>around it.c)
instance.html(selector),instance.xml(selector)andinstance.text(selector)will also be deprecated. So: to get (just) the outerHTML of a particular tag will be somewhat trickier.To be clear: these are assumptions of mine. I am not entirely sure about all of them, particulary those w.r.t. the differences to the upcoming v1.0.0 Please do correct me whereever I might be wrong!
As for 2) and 3), the actual output in the example is valid XML but invalid HTML. So maybe one shouldn't expect to get a non-self-closing, empty
<div></div>from.xml(). But.html()definitely should NOT return invalid HTML - the more if it was valid in the first place. However, in this example, it makes no difference which one you call; the output's the same for both:<div/>.The workaround to enforce empty
<div></div>rather than self-closing<div/>(and the like): put an empty text node inside!This can be achieved by
$('div').filter((i,e) => !e.children.length).text(''))The
.filter(...)ensures that we don't overwrite the innards of any non-empty<div>.Of course this should be applied to all empty tags, that are not allowed to be self-closing in HTML.
There's quite a bit more to say about this, e.g. when
<script>tags must be empty (but not self-closing), or how to actually enforce self-closing tags like<meta .../>even when they were (incorrectly) either empty (<meta...></meta>) or simply left open (just<meta...>)...Also TODO: 4) and 5)