Cheerio: Cheerio produces self-closing tags when element content is empty

Created on 12 Oct 2018 · 5Comments · Source: cheeriojs/cheerio

Hi Cheerio Team

I'm trying to solve the issue with self-closing tags.
This code:

const cheerio = require('cheerio')
const $ = cheerio.load('<div data-message="msg"></div>',
  {
    xmlMode: true,
    decodeEntities: false,
  })

console.log($.xml())

Outputs: <div data-message="msg"/>

But I need the following output: <div data-message="msg"></div>
Also I have to keep cheerio options xmlMode: true and decodeEntities: false

Is there any way to tell cheerio to close tags in such cases?

Source

dmrzn

👍6

Most helpful comment

Even though I'm rather new to cheerio, I've already been struggling quite a bit with those self-closing tags - and I think now that I can help here.

First, let me state a few assumptions I am making:

The exact version used should be stated. I think we're talking v1.0.0-rc.2 here, since that's what npm install cheerio gets you at the moment, and your example can be reproduced with it.
XML allows any tag to be self-closing.
By contrast, HTML only allows a few tags to be self-closing:
- allowed: <link />, <meta />, <br/>, and some more
- NOT allowed are really most of the tags. Particularly <script> (!). Of course something like <div/> isn't allowed either.
As I understand it, the API will change substantially with v1.0.0 (NOT yet in v1.0.0-rc.2). In particular:
a) the static methods $.xml, $.html and $.text will be deprecated.
b) cheerio.load('<div class="foo"></div>') will NOT anymore get you a representation only of that fragment, but rather behave like a browser: it'll put a <!DOCTYPE html><html><head></head><body>...</body></html> around it.
c) instance.html(selector), instance.xml(selector) and instance.text(selector) will also be deprecated. So: to get (just) the outerHTML of a particular tag will be somewhat trickier.
There seems to be a number of issues revolving around the problem at hand - they should cross-reference each other, IMHO.

To be clear: these are assumptions of mine. I am not entirely sure about all of them, particulary those w.r.t. the differences to the upcoming v1.0.0 Please do correct me whereever I might be wrong!

As for 2) and 3), the actual output in the example is valid XML but invalid HTML. So maybe one shouldn't expect to get a non-self-closing, empty <div></div> from .xml(). But .html() definitely should NOT return invalid HTML - the more if it was valid in the first place. However, in this example, it makes no difference which one you call; the output's the same for both: <div/>.

The workaround to enforce empty <div></div> rather than self-closing <div/> (and the like): put an empty text node inside!
This can be achieved by $('div').filter((i,e) => !e.children.length).text(''))
The .filter(...) ensures that we don't overwrite the innards of any non-empty <div>.

Of course this should be applied to all empty tags, that are not allowed to be self-closing in HTML.

There's quite a bit more to say about this, e.g. when <script> tags must be empty (but not self-closing), or how to actually enforce self-closing tags like <meta .../> even when they were (incorrectly) either empty (<meta...></meta>) or simply left open (just <meta...>)...

Also TODO: 4) and 5)

meisl on 19 Oct 2018

👍4

All 5 comments

Also facing this issue with Automattic/juice :(

raja-s on 17 Oct 2018

Even though I'm rather new to cheerio, I've already been struggling quite a bit with those self-closing tags - and I think now that I can help here.

First, let me state a few assumptions I am making:

The exact version used should be stated. I think we're talking v1.0.0-rc.2 here, since that's what npm install cheerio gets you at the moment, and your example can be reproduced with it.
XML allows any tag to be self-closing.
By contrast, HTML only allows a few tags to be self-closing:
- allowed: <link />, <meta />, <br/>, and some more
- NOT allowed are really most of the tags. Particularly <script> (!). Of course something like <div/> isn't allowed either.
As I understand it, the API will change substantially with v1.0.0 (NOT yet in v1.0.0-rc.2). In particular:
a) the static methods $.xml, $.html and $.text will be deprecated.
b) cheerio.load('<div class="foo"></div>') will NOT anymore get you a representation only of that fragment, but rather behave like a browser: it'll put a <!DOCTYPE html><html><head></head><body>...</body></html> around it.
c) instance.html(selector), instance.xml(selector) and instance.text(selector) will also be deprecated. So: to get (just) the outerHTML of a particular tag will be somewhat trickier.
There seems to be a number of issues revolving around the problem at hand - they should cross-reference each other, IMHO.

Of course this should be applied to all empty tags, that are not allowed to be self-closing in HTML.

Also TODO: 4) and 5)

meisl on 19 Oct 2018

👍4

Yes I'm also having this problem, in both html and xml mode tags with empty contents are made self closing (including divs). Putting an empty space in between fixes the issue but isn't always possible.

Steelcow85 on 11 Dec 2018

Hi Cheerio Team,

this breaks completely <script> tags, self closing them
what is the fix here? the solution like "add empty text node" is not a solution, it's an awful crutch

I see this issue https://github.com/cheeriojs/cheerio/issues/366, which was "politely" closed, probably pointing to the same problem 4 (!) years ago.

slopen on 12 Mar 2020

If you have to use xmlMode, you can disable self-closing tags using the selfClosingTags option:

const $ = cheerio.load('<div data-message="msg"></div>',
  {
    xmlMode: true,
    decodeEntities: false,
    selfClosingTags: false,
  })

console.log($.xml())

Outputs: <div data-message="msg"></div>.

Note that this will disable parsing self-closing tags as well.