Cheerio: Incorrect handling of UTF-8 encoding on Google Finance page

Created on 13 Aug 2014  Â·  6Comments  Â·  Source: cheeriojs/cheerio

This code produces an incorrect Unicode character instead of the right single quote (U+2019):

var request = require('request');
var cheerio = require('cheerio');

request('https://www.google.com/finance?q=NYSE%3ASGL', function (error, response, body) {
  if (!error && response.statusCode === 200) {
    console.dir(response.headers['content-type']);
    var $ = cheerio.load(body);
    console.log($('.companySummary').text().match(/The Fund.s/g)[0]);
  }
});

Expected:

'text/html; charset=utf-8'
The Fund’s

Actual:

'text/html; charset=utf-8'
The Fund�s

The incorrect character has the code point U+FFFD.

Using cheerio 0.17.0 on Windows. With the same invocation, the Unicode for this Japanese page is produced correctly:

'use strict';
var request = require('request');
var cheerio = require('cheerio');

request('https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8', function (error, response, body) {
  if (!error && response.statusCode === 200) {
    console.dir(response.headers['content-type']);
    var $ = cheerio.load(body);
    console.log($('.mw-headline').text());
  }
});
Parser-specific

Most helpful comment

Try to use request.get({ uri: baseURI, encoding: 'binary' }, function it solved my problem, but don't ask me why it works. Solved problem thanks to this topic:

https://github.com/request/request/issues/118

So it's not an issue with Cheerios but with encoding and request module.

All 6 comments

Seeing the exact same thing when parsing text from:
http://google.com/movies?near=90504&mid=4cc6e9bd8d940bc1&date=8
The ® symbol is parsed as � when printing out text.

Also on chinese websites

any news on this issue? i'm also running into the case of characters getting turned into �s.

Try to use request.get({ uri: baseURI, encoding: 'binary' }, function it solved my problem, but don't ask me why it works. Solved problem thanks to this topic:

https://github.com/request/request/issues/118

So it's not an issue with Cheerios but with encoding and request module.

It would be great if someone could provide a test case that doesn't depend on an HTTP request (which might use a different encoding, which is way out of cheerio's scope). Closing this until that's the case.

Please give perfect solution.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tndev picture tndev  Â·  4Comments

collegepinger picture collegepinger  Â·  3Comments

unicrus picture unicrus  Â·  4Comments

chenweiyj picture chenweiyj  Â·  5Comments

rajkumarpb picture rajkumarpb  Â·  3Comments