Node-fetch: charset detection not working

Created on 21 Nov 2018 · 4Comments · Source: node-fetch/node-fetch

The URL https://www.aksam.com.tr/guncel/baskan-erdogan-48inci-muhtarlar-toplantisinda-konusuyor/haber-795519 has:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-9" />
<meta http-equiv="Content-Type" content="text/html; charset=windows-1254" />

However res.textConverted() returns
Cumhurba�kan� Recep Tayyip Erdo�an, Be�tepe'de 48'inci Muhtarlar Toplant�s�'nda konu�tu. Erdo�an'�n a��klamalar�ndan sat�r ba�lar� ��yle:

I see that in body.js convertBody the charset is supposed to be detected. I think the reason is that Content-Type is uppercase, but the regex in body.js only matches lowercase. The preview str should be lowercased.

Node-specific

Source

bittlingmayer

All 4 comments

Thx for the report.

The detection does support extracting both uppercase and lowercase, whether encoding package handles properly, is another matter.
I am not certain the said content-type/page encoding is correct, either way, I recommend not relying on auto-detection if you are dealing with legacy webpages, as this API was never a part of Fetch Spec.
We intend to deprecate and replace textConverted API with a separate module in v3, as encoding detection is guess work that shouldn't be a part of node-fetch.

bitinn on 21 Nov 2018

👍1

Hi @bitinn,
I saved the page and removed the encoding from html. Chrome successfully auto detected the encoding, but Safari failed ( I guess they don't auto detect at all ).

minas90 on 22 Nov 2018

👍1

In the new charset encoding detection system, we've replaced the regex system with an actual HTML parser and the conversion system with a more up-to-date package. https://github.com/Richienb/fetch-charset-detection