If I'm scraping a german webpage containing special characters like Ä, ä, Ö, ö, Ü, ü or ß, the output will be wrong (displaying � or other fancy stuff).
Seems like cheerio doesn't support UTF-8 or ISO-8859-1.
EDIT: It's not an issue by cheerio itself.
So, what is the issue then? I'm having it right now.
fs.readFileAsync('survey_logic_file.html', {encoding: 'UTF-8'}).then (rawhtml)->
fs.writeFile 'raw.html', rawhtml
$ = cheerio.load rawhtml
fs.writeFile 'cheerio.html', $.html()
In this code after write raw.html is correct while cheerio.html is broken.
raw.html:
<h3>Wofür werden die Ergebnisse der Umfrage benutzt?</h3>
cheerio.html:
<h3>Wofür werden die Ergebnisse der Umfrage benutzt?</h3>
If $.xml() is used the problem disappears. Unfortunately it seems that xml() can not be used on single nodes/selections?
Well for reference if someone else finds this:
It seems that you need to set decodeEntities: false to get the correct behavior.
Most helpful comment
Well for reference if someone else finds this:
It seems that you need to set
decodeEntities: falseto get the correct behavior.