Node-fetch: charset detection not working

Created on 21 Nov 2018  ·  4Comments  ·  Source: node-fetch/node-fetch

The URL https://www.aksam.com.tr/guncel/baskan-erdogan-48inci-muhtarlar-toplantisinda-konusuyor/haber-795519 has:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-9" />
<meta http-equiv="Content-Type" content="text/html; charset=windows-1254" />

However res.textConverted() returns
<p>Cumhurba�kan� Recep Tayyip Erdo�an, Be�tepe'de 48'inci Muhtarlar Toplant�s�'nda konu�tu.</p> <p><strong><em>Erdo�an'�n a��klamalar�ndan sat�r ba�lar� ��yle:</em></strong></p>

I see that in body.js convertBody the charset is supposed to be detected. I think the reason is that Content-Type is uppercase, but the regex in body.js only matches lowercase. The preview str should be lowercased.

Node-specific

All 4 comments

Thx for the report.

  • The detection does support extracting both uppercase and lowercase, whether encoding package handles properly, is another matter.
  • I am not certain the said content-type/page encoding is correct, either way, I recommend not relying on auto-detection if you are dealing with legacy webpages, as this API was never a part of Fetch Spec.
  • We intend to deprecate and replace textConverted API with a separate module in v3, as encoding detection is guess work that shouldn't be a part of node-fetch.

Hi @bitinn,
I saved the page and removed the encoding from html. Chrome successfully auto detected the encoding, but Safari failed ( I guess they don't auto detect at all ).

In the new charset encoding detection system, we've replaced the regex system with an actual HTML parser and the conversion system with a more up-to-date package. https://github.com/Richienb/fetch-charset-detection

The PR was merged - time to close this.

Was this page helpful?
0 / 5 - 0 ratings