This code produces an incorrect Unicode character instead of the right single quote (U+2019):
var request = require('request');
var cheerio = require('cheerio');
request('https://www.google.com/finance?q=NYSE%3ASGL', function (error, response, body) {
if (!error && response.statusCode === 200) {
console.dir(response.headers['content-type']);
var $ = cheerio.load(body);
console.log($('.companySummary').text().match(/The Fund.s/g)[0]);
}
});
Expected:
'text/html; charset=utf-8'
The Fund’s
Actual:
'text/html; charset=utf-8'
The Fund�s
The incorrect character has the code point U+FFFD.
Using cheerio 0.17.0 on Windows. With the same invocation, the Unicode for this Japanese page is produced correctly:
'use strict';
var request = require('request');
var cheerio = require('cheerio');
request('https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8', function (error, response, body) {
if (!error && response.statusCode === 200) {
console.dir(response.headers['content-type']);
var $ = cheerio.load(body);
console.log($('.mw-headline').text());
}
});
Seeing the exact same thing when parsing text from:
http://google.com/movies?near=90504&mid=4cc6e9bd8d940bc1&date=8
The ® symbol is parsed as � when printing out text.
Also on chinese websites
any news on this issue? i'm also running into the case of characters getting turned into �s.
Try to use request.get({ uri: baseURI, encoding: 'binary' }, function it solved my problem, but don't ask me why it works. Solved problem thanks to this topic:
https://github.com/request/request/issues/118
So it's not an issue with Cheerios but with encoding and request module.
It would be great if someone could provide a test case that doesn't depend on an HTTP request (which might use a different encoding, which is way out of cheerio's scope). Closing this until that's the case.
Please give perfect solution.
Most helpful comment
Try to use
request.get({ uri: baseURI, encoding: 'binary' }, functionit solved my problem, but don't ask me why it works. Solved problem thanks to this topic:https://github.com/request/request/issues/118
So it's not an issue with Cheerios but with encoding and request module.