Cheerio: Output of german umlauts is wrong

Created on 24 Dec 2013  Â·  2Comments  Â·  Source: cheeriojs/cheerio

If I'm scraping a german webpage containing special characters like Ä, ä, Ö, ö, Ü, ü or ß, the output will be wrong (displaying � or other fancy stuff).
Seems like cheerio doesn't support UTF-8 or ISO-8859-1.

EDIT: It's not an issue by cheerio itself.

Most helpful comment

Well for reference if someone else finds this:
It seems that you need to set decodeEntities: false to get the correct behavior.

All 2 comments

So, what is the issue then? I'm having it right now.

fs.readFileAsync('survey_logic_file.html', {encoding: 'UTF-8'}).then (rawhtml)->
  fs.writeFile 'raw.html', rawhtml
  $ = cheerio.load rawhtml
  fs.writeFile 'cheerio.html', $.html()

In this code after write raw.html is correct while cheerio.html is broken.

raw.html:

<h3>Wofür werden die Ergebnisse der Umfrage benutzt?</h3>

cheerio.html:

<h3>Wof&#xFC;r werden die Ergebnisse der Umfrage benutzt?</h3>

If $.xml() is used the problem disappears. Unfortunately it seems that xml() can not be used on single nodes/selections?

Well for reference if someone else finds this:
It seems that you need to set decodeEntities: false to get the correct behavior.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AlbertoElias picture AlbertoElias  Â·  4Comments

Tetheta picture Tetheta  Â·  3Comments

collegepinger picture collegepinger  Â·  3Comments

Canop picture Canop  Â·  3Comments

becush picture becush  Â·  3Comments