Vscode: Wrong guess encoding as Windows 1252

Created on 2 Sep 2017 · 54Comments · Source: microsoft/vscode

Upstream issue: https://github.com/aadsm/jschardet/issues/48

VSCode Version: Version 1.15.1
OS Version: Windows 10.0.15063

Steps to Reproduce:

The settings.json of my vscode

    "files.encoding": "utf8",
    "files.autoGuessEncoding": true,

Ceate two txt files, make sure the files are saved as utf-8

test1.txt

Created on: 2017年9月2日
测

test2.txt

Created on: 2017年9月2日
测试

Reopen the files,test1.txt guessed encoding is Windows 1252 and test2.txt guessed encoding is utf-8.

Reproduces without extensions: Yes

file-guess-encoding insiders-released on-release-notes upstream upstream-issue-linked verification-needed verified

Source

maooyer

👍64

Most helpful comment

Any updates? I'm still getting this issue today.

std4453 on 14 Jun 2018

👍9

All 54 comments

I have same problem with utf8 and iso88591.

rodolfomuller on 11 Sep 2017

👍4

I had same problem.
However, it was maybe fixed in 1.17.

Please try Insiders.
https://code.visualstudio.com/insiders/

sou-lab on 5 Oct 2017

Well I tried on 1.17 and still happens. On my case a blank txt file with some word with accents even saving with UTF-8 still reopens with Western 1252 or ISO 8859-2.

codecrafting-io on 5 Oct 2017

😕2

Sorry. not fixed by 1.17.0.
I still happens too😢
but, Insiders guessed encoding is UTF-8.

sou-lab on 6 Oct 2017

Still not fixed in 1.23:
Test case

#!/bin/sh

foo() {
    echo "starting …"
}

Ellipsis symbol … makes vscode guess cp1252

Yanpas on 19 May 2018

Any updates? I'm still getting this issue today.

std4453 on 14 Jun 2018

👍9

Could the fix in this issue https://github.com/Microsoft/vscode/issues/23997 lead to regression?

Yanpas on 19 Jun 2018

The code from latin1prober.js:113 looks very suspicious

    this.getCharsetName = function() {
        return "windows-1252";
    }

It's the cause of the problem.

For those of you who are interested in debugging here is a snippet:

const detect = require("./init").detect
const fs = require('fs')
let args = []
process.argv.forEach(v => args.push(v))
let fname = args[2]
let buf = fs.readFileSync(fname)

console.log(detect(buf))

Place it to src/tst.js. To run call node tst.js somefile from src folder

Yanpas on 19 Jun 2018

😄1

Same issue here... autoGuessEncoding is disabled and yet VS Code still attempts to guess encoding as Windows-1252 even though that causes invalid characters because it's actually UTF-8...

chylex on 6 Aug 2018

It happens to me when there is a copyright symbol in the file. VSCode incorrectly guesses Windows-1252, which shows an invalid character next to the copyright symbol.

elmonty on 10 Oct 2018

👍4

The library vs code is using seems to be abandoned. No activity for a year. This means that until it won't be forked by someone we will still be expiriencing this issue.

Yanpas on 6 Nov 2018

😕1

Hi,
Any news on this subject? I'm using version 1.28.2 and every time I open a Windows-1252 encoded file, vs code open it up as Windows-1251, messing up all the non-standard characters.

cesarliws on 12 Nov 2018

VS Code can do lots of things, but can't handle INTL in 2018. Shame.

AlexandreGagner on 23 Nov 2018

Same problem. It crashes my files a few times.

Yuki-Nagato on 15 Dec 2018

Same here.
Save as UTF-8 and after reopen its EUC-KR
So "ä" becomes "채"

Mordef on 28 Jan 2019

👍1

I think this is probably related to issue #64931 as well. There seems to be a lot of unfixed problems related to _file-encoding_ in this _text editor_.

phgmacedo on 29 Jan 2019

I had to disable autoGuessEncoding, in order to prevent VSC from guessing attached markDown file as utf-8 encoding, while it was saved in Windows-1251.
Tarefas.zip

JulioNobre on 15 Feb 2019

👍1

Me too. Many problems with auto detect encoding. It seems a very old reported bug, Is there any news about fix?

ottopic on 27 Feb 2019

👍5

Still not fixed in 1.32.3

Antecer on 25 Mar 2019

@bpasero is there any news about it? It's frustrating...

ottopic on 25 Mar 2019

I just twitted to VSC about it, before Googling :) VSC 1.32.3 on Win10 64bit. Same issue.

jozsefk9 on 28 Mar 2019

😕2

Incredible... Almost two years ago I gave up using VSCode because of this bug, and now, when I decide to give another try, I've discovered that it still happens... I think that I'll stick with phpStorm :/

VSCode 1.32.3 (linux and macOS)

simonardejr on 3 Apr 2019

😕3 👍2

Yep. Encoding bugs prevent me from using VSCode as my editor in entire classes of programming languages and text editing tasks.

One would think that text editing would be core functionality to a text editor, but I guess in VSCode it is a dependency...

phgmacedo on 3 Apr 2019

👍3

Why should Microsoft ignore this issue? Why @bpasero don't answer to our questions or pass ticket to a colleague?

ottopic on 3 Apr 2019

👍7

At least, untill encoding guess is fixed (and properly unit-tested), autoGuessEncoding setting should, by default, be disabled.

JulioNobre on 3 Apr 2019

👍1

files.autoGuessEncoding is already disabled by default.

The core problem is, sometimes people may open various files with different encoding, and they do not want to switch encoding manually. I guess the difficulty is to identify which ANSI encoding is used. UTF-8 (with or without BOM) and UTF-16 (LE/BE) should be easy to distinguish. So I have a solution:

If a file is not encoded in Unicode, VS Code can try to decode it in local ANSI encoding first. If unintelligible, try other encoding. It would take effect most time. Just my idea, I don't understand how it is working. @bpasero

For example, the local ANSI under zh-CN region (not the current interface language of VS Code, it should be the region setting of OS) should be GB18030 (CodePage 54936).

https://docs.microsoft.com/windows/desktop/intl/code-page-identifiers

byyxx128 on 3 Apr 2019

VSCODE 1.33.0 - Still not fixed. Corrupted a project because of it and caused a big loss of time...
Instead of seeing "replace(sValue, "£","£");"
i see replace(sValue, "Â£","£"); OR replace(sValue, "ï¿½","£");
EVEN IF I SELECT CORRECT ENCODING THAT WAS USED BY OLD DEVS

betrin on 11 Apr 2019

VSCode 1.35.0 Still not fixed...

kouhei on 8 Jun 2019

😕12

VSCode 1.37.1 Still not fixed...

ottopic on 24 Aug 2019

😕7

This is the same issue as #64931 which was closed because the bug is "upsteam".

The message is very clear: VS Code is the only text editor where text editing bugs are not a priority.

phgmacedo on 10 Oct 2019

😕3

Excited to see this issue isn't abandoned. Due to so many encodings besides Unicode in the world, maybe it is really hard even impossible to fix.

So I have an idea. Most users can only discover a limited number of specific encodings in their life. Besides Unicode (UTF-8, UTF-16, etc., with or without BOM), most of the rest are used for specific languages.

e.g. GBK for simplified Chinese, JIS for Japanese, etc.

It should be easy to identity if a file is encoded in Unicode (by the file header? sorry I don't really know the principles). Then, if the file is not in Unicode, firstly try to decode with region-corresponding encoding. If it doesn't work, try to guess the encoding.

p.s. Windows has a setting to set default non-Unicode encoding (on notepad it is ANSI). The "region-corresponding encoding" is like this.

In this case it should reduce lots of incorrect guess.

byyxx128 on 18 Oct 2019

👍3

Problem still exist in 1.37.1, save file with utf-8 chinese, when reopen it became windows1252, and the chinese became unrecognized charactor.

liushiqi9 on 20 Nov 2019

👍2

@liushiqi9 Don't worry. This bug will never be fixed because vs code is from Microsoft.
Microsoft's software are buggy, broken, expensive, and are always trying to some new way to spy on you or show you more ads.

Microsoft VS Code is the only text editor that I know of where editing text is not a priority feature

phgmacedo on 20 Nov 2019

👎4

From my own experience, while some encodings area easily recognizable, Windows 1252 is surely not one of them, because, depending on the content, its binary reprepresentation is exactly same as many other encodings.

That's why, in my opinion, trying to achieve a universal 100% guess encoding mecanism is a plain dead end.

In this case, any solution will always be a short blanket. If we pull on one side, we uncover on the other.

P.S.: I have been enjoyable using Visual Studio Code and had no further wrongs encoding guesses since I disable autoGuessEncoding option and started using Reopen With Enconding (accesible thru View menu > Commnd Palette > Change file encoding)

That's why, in my opinion, trying to achieve a universal 100% guess encoding mecanism is a plain dead end.

In this case, any solution will always be a short blanket. If we pull on one side, we uncover on the other.

I have been enjoyable using Visual Studio Code and had no further wrongs encoding guesses since I disable autoGuessEncoding option and started using Reopen With Enconding (accesible thru View menu > Commnd Palette > Change file encoding)

As a final note to last @pedrogarcia's comment, on an opensource project like this it's completely unfair to blame Microsoft for not fixing something that everyone else can fix, including him ;-)

JulioNobre on 20 Nov 2019

On the other hand, there's always the possibility of auto-guessing only unambiguous encodings and presenting a list of ambiguous candidate encodings so user can pick the most adequate for his context.
Taking a solution like this would benefit the exat guessings while preventing misrecognitions.

JulioNobre on 20 Nov 2019

👍1

Dear everyone,
This is an upstream issue caused by https://github.com/aadsm/jschardet/issues/48 which means VS Code cannot fix it by itself. However, they are working on this #84503.
Thanks for https://github.com/microsoft/vscode/issues/33720#issuecomment-555932447 clarification.

byyxx128 on 20 Nov 2019

👍2

while some encodings area easily recognizable, Windows 1252 is surely not one of them, because, depending on the content, its binary representation is exactly same as many other encodings

In my experience, with the exact same text file, I made a point to download all other text editors that I found, and every single editor that I tested got the encoding right, except VS Code.

phgmacedo on 20 Nov 2019

@pedrogarcia we all understand why this seems so easy to solve.
As @byyxx128, already pointed out on issue #84503, which I was not aware until today, VS Code team and its community had already hammered this issue. After ready it carefully, I really think they had already did a good job and are on the right track to achieve a good compromisse. Let's cross ours finger and hope they are successfull.

JulioNobre on 20 Nov 2019

Hello everyone, as https://github.com/microsoft/vscode/issues/84503#issuecomment-552989637 points out, vscode still doesn't have enough reasons to support changes to the code detection tool;

and with as the https://github.com/microsoft/vscode/wiki/Issue-Grooming#out-of-scope-feature-requests point out,

Has the community at large expressed interest in this functionality? I.e. has it gathered more than 10 up-votes or more than 10 comments? This criterion alone covers more than 650 of the 2850 open feature requests as of right now, October 9th, 2019.
https://github.com/microsoft/vscode/issues/84503 didn't have enough up-votes,

so I pray for everyone to vote for it https://github.com/microsoft/vscode/issues/84503 and let vscode and the community support this change.

sunbohong on 21 Nov 2019

Hello everyone, as #84503 (comment) points out, vscode still doesn't have enough reasons to support changes to the code detection tool;

and with as the https://github.com/microsoft/vscode/wiki/Issue-Grooming#out-of-scope-feature-requests point out,

Has the community at large expressed interest in this functionality? I.e. has it gathered more than 10 up-votes or more than 10 comments? This criterion alone covers more than 650 of the 2850 open feature requests as of right now, October 9th, 2019.

84503 didn't have enough up-votes,

so I pray for everyone to vote for it #84503 and let vscode and the community support this change.

Lol @microsoft that the community has to upvote a feature request to have their text editor properly detect encoding and edit text.

phgmacedo on 21 Nov 2019

omg, Hard to believe that this problem actually existed for so long, 1.40.2 Still not fixed...

MxDany on 28 Nov 2019

How to upvote?

JulioNobre on 28 Nov 2019

@JulioNobre I get that may be ambiguous to identify the exact encoding, but something has to be done, even that does not involve a Microsoft code directly. Try to create a blank txt file with the Windows-1252 encoding and write the word "coração". Now open the file, and you still see that even something aparently simple and created by Code, the guessed encoding still wrong. I tried to simulate this on multiple Text Editors, and no one opened the file with wrong encoding. I also testd with ISO 8859-1, same issue. So this is a problem, because if the application offers the options to save with multiple encoding, it should at least open the file created with the same encoding, otherwise don't offer certain encoding options.

codecrafting-io on 17 Dec 2019

👍1

VSCode 1.41.0 Still not fixed...

cv0cv0 on 17 Dec 2019

👍5

still not fixed, version 1.42.1

x4m3 on 21 Mar 2020

although i got it working with a workaround:

in my settings.json file i disabled guessing the file encoding and i force files to be encoded in utf-8:

    "files.encoding": "utf8",
    "files.autoGuessEncoding": false

x4m3 on 21 Mar 2020

although i got it working with a workaround:

in my settings.json file i disabled guessing the file encoding and i force files to be encoded in utf-8:
    "files.encoding": "utf8",
    "files.autoGuessEncoding": false

"files.autoGuessEncoding" can also be disabled per language, which mitigates the problem. It would be helpful to be able to do per file extension.

phgmacedo on 21 Mar 2020

although i got it working with a workaround:

in my settings.json file i disabled guessing the file encoding and i force files to be encoded in utf-8:

This only has sense if your whole project is encoded as UTF-8
autoGuessEncoding is precisely useful for projects having mixed encoded files