Vscode: Wrong guess encoding as Windows 1252

Created on 2 Sep 2017  Â·  54Comments  Â·  Source: microsoft/vscode

Upstream issue: https://github.com/aadsm/jschardet/issues/48

  • VSCode Version: Version 1.15.1
  • OS Version: Windows 10.0.15063

Steps to Reproduce:

  1. The settings.json of my vscode
    "files.encoding": "utf8",
    "files.autoGuessEncoding": true,
  1. Ceate two txt files, make sure the files are saved as utf-8

test1.txt

Created on: 2017年9月2日
测

test2.txt

Created on: 2017年9月2日
测试
  1. Reopen the files,test1.txt guessed encoding is Windows 1252 and test2.txt guessed encoding is utf-8.


Reproduces without extensions: Yes

file-guess-encoding insiders-released on-release-notes upstream upstream-issue-linked verification-needed verified

Most helpful comment

Any updates? I'm still getting this issue today.

All 54 comments

I have same problem with utf8 and iso88591.

I had same problem.
However, it was maybe fixed in 1.17.

Please try Insiders.
https://code.visualstudio.com/insiders/

Well I tried on 1.17 and still happens. On my case a blank txt file with some word with accents even saving with UTF-8 still reopens with Western 1252 or ISO 8859-2.

Sorry. not fixed by 1.17.0.
I still happens too😢
but, Insiders guessed encoding is UTF-8.

Still not fixed in 1.23:
Test case

#!/bin/sh

foo() {
    echo "starting …"
}

Ellipsis symbol … makes vscode guess cp1252

Any updates? I'm still getting this issue today.

Could the fix in this issue https://github.com/Microsoft/vscode/issues/23997 lead to regression?

The code from latin1prober.js:113 looks very suspicious

    this.getCharsetName = function() {
        return "windows-1252";
    }

It's the cause of the problem.

For those of you who are interested in debugging here is a snippet:

const detect = require("./init").detect
const fs = require('fs')
let args = []
process.argv.forEach(v => args.push(v))
let fname = args[2]
let buf = fs.readFileSync(fname)

console.log(detect(buf))

Place it to src/tst.js. To run call node tst.js somefile from src folder

Same issue here... autoGuessEncoding is disabled and yet VS Code still attempts to guess encoding as Windows-1252 even though that causes invalid characters because it's actually UTF-8...

It happens to me when there is a copyright symbol in the file. VSCode incorrectly guesses Windows-1252, which shows an invalid character next to the copyright symbol.

The library vs code is using seems to be abandoned. No activity for a year. This means that until it won't be forked by someone we will still be expiriencing this issue.

Hi,
Any news on this subject? I'm using version 1.28.2 and every time I open a Windows-1252 encoded file, vs code open it up as Windows-1251, messing up all the non-standard characters.

VS Code can do lots of things, but can't handle INTL in 2018. Shame.

Same problem. It crashes my files a few times.

Same here.
Save as UTF-8 and after reopen its EUC-KR
So "ä" becomes "채"

I think this is probably related to issue #64931 as well. There seems to be a lot of unfixed problems related to _file-encoding_ in this _text editor_.

I had to disable autoGuessEncoding, in order to prevent VSC from guessing attached markDown file as utf-8 encoding, while it was saved in Windows-1251.
Tarefas.zip

Me too. Many problems with auto detect encoding. It seems a very old reported bug, Is there any news about fix?

Still not fixed in 1.32.3

@bpasero is there any news about it? It's frustrating...

I just twitted to VSC about it, before Googling :) VSC 1.32.3 on Win10 64bit. Same issue.

Incredible... Almost two years ago I gave up using VSCode because of this bug, and now, when I decide to give another try, I've discovered that it still happens... I think that I'll stick with phpStorm :/

VSCode 1.32.3 (linux and macOS)

Yep. Encoding bugs prevent me from using VSCode as my editor in entire classes of programming languages and text editing tasks.

One would think that text editing would be core functionality to a text editor, but I guess in VSCode it is a dependency...

Why should Microsoft ignore this issue? Why @bpasero don't answer to our questions or pass ticket to a colleague?

At least, untill encoding guess is fixed (and properly unit-tested), autoGuessEncoding setting should, by default, be disabled.

files.autoGuessEncoding is already disabled by default.

The core problem is, sometimes people may open various files with different encoding, and they do not want to switch encoding manually. I guess the difficulty is to identify which ANSI encoding is used. UTF-8 (with or without BOM) and UTF-16 (LE/BE) should be easy to distinguish. So I have a solution:

If a file is not encoded in Unicode, VS Code can try to decode it in local ANSI encoding first. If unintelligible, try other encoding. It would take effect most time. Just my idea, I don't understand how it is working. @bpasero

For example, the local ANSI under zh-CN region (not the current interface language of VS Code, it should be the region setting of OS) should be GB18030 (CodePage 54936).

https://docs.microsoft.com/windows/desktop/intl/code-page-identifiers

VSCODE 1.33.0 - Still not fixed. Corrupted a project because of it and caused a big loss of time...
Instead of seeing "replace(sValue, "£","£");"
i see replace(sValue, "£","£"); OR replace(sValue, "�","£");
EVEN IF I SELECT CORRECT ENCODING THAT WAS USED BY OLD DEVS

VSCode 1.35.0 Still not fixed...

VSCode 1.37.1 Still not fixed...

This is the same issue as #64931 which was closed because the bug is "upsteam".

The message is very clear: VS Code is the only text editor where text editing bugs are not a priority.

Excited to see this issue isn't abandoned. Due to so many encodings besides Unicode in the world, maybe it is really hard even impossible to fix.

So I have an idea. Most users can only discover a limited number of specific encodings in their life. Besides Unicode (UTF-8, UTF-16, etc., with or without BOM), most of the rest are used for specific languages.

e.g. GBK for simplified Chinese, JIS for Japanese, etc.

It should be easy to identity if a file is encoded in Unicode (by the file header? sorry I don't really know the principles). Then, if the file is not in Unicode, firstly try to decode with region-corresponding encoding. If it doesn't work, try to guess the encoding.

p.s. Windows has a setting to set default non-Unicode encoding (on notepad it is ANSI). The "region-corresponding encoding" is like this.

In this case it should reduce lots of incorrect guess.

Problem still exist in 1.37.1, save file with utf-8 chinese, when reopen it became windows1252, and the chinese became unrecognized charactor.

@liushiqi9 Don't worry. This bug will never be fixed because vs code is from Microsoft.
Microsoft's software are buggy, broken, expensive, and are always trying to some new way to spy on you or show you more ads.

Microsoft VS Code is the only text editor that I know of where editing text is not a priority feature

From my own experience, while some encodings area easily recognizable, Windows 1252 is surely not one of them, because, depending on the content, its binary reprepresentation is exactly same as many other encodings.

That's why, in my opinion, trying to achieve a universal 100% guess encoding mecanism is a plain dead end.

In this case, any solution will always be a short blanket. If we pull on one side, we uncover on the other.

P.S.: I have been enjoyable using Visual Studio Code and had no further wrongs encoding guesses since I disable autoGuessEncoding option and started using Reopen With Enconding (accesible thru View menu > Commnd Palette > Change file encoding)

From my own experience, while some encodings area easily recognizable, Windows 1252 is surely not one of them, because, depending on the content, its binary reprepresentation is exactly same as many other encodings.

That's why, in my opinion, trying to achieve a universal 100% guess encoding mecanism is a plain dead end.

In this case, any solution will always be a short blanket. If we pull on one side, we uncover on the other.

I have been enjoyable using Visual Studio Code and had no further wrongs encoding guesses since I disable autoGuessEncoding option and started using Reopen With Enconding (accesible thru View menu > Commnd Palette > Change file encoding)

As a final note to last @pedrogarcia's comment, on an opensource project like this it's completely unfair to blame Microsoft for not fixing something that everyone else can fix, including him ;-)

On the other hand, there's always the possibility of auto-guessing only unambiguous encodings and presenting a list of ambiguous candidate encodings so user can pick the most adequate for his context.
Taking a solution like this would benefit the exat guessings while preventing misrecognitions.

Dear everyone,
This is an upstream issue caused by https://github.com/aadsm/jschardet/issues/48 which means VS Code cannot fix it by itself. However, they are working on this #84503.
Thanks for https://github.com/microsoft/vscode/issues/33720#issuecomment-555932447 clarification.

while some encodings area easily recognizable, Windows 1252 is surely not one of them, because, depending on the content, its binary representation is exactly same as many other encodings

In my experience, with the exact same text file, I made a point to download all other text editors that I found, and every single editor that I tested got the encoding right, except VS Code.

@pedrogarcia we all understand why this seems so easy to solve.
As @byyxx128, already pointed out on issue #84503, which I was not aware until today, VS Code team and its community had already hammered this issue. After ready it carefully, I really think they had already did a good job and are on the right track to achieve a good compromisse. Let's cross ours finger and hope they are successfull.

Hello everyone, as https://github.com/microsoft/vscode/issues/84503#issuecomment-552989637 points out, vscode still doesn't have enough reasons to support changes to the code detection tool;

and with as the https://github.com/microsoft/vscode/wiki/Issue-Grooming#out-of-scope-feature-requests point out,

Has the community at large expressed interest in this functionality? I.e. has it gathered more than 10 up-votes or more than 10 comments? This criterion alone covers more than 650 of the 2850 open feature requests as of right now, October 9th, 2019.
https://github.com/microsoft/vscode/issues/84503 didn't have enough up-votes,

so I pray for everyone to vote for it https://github.com/microsoft/vscode/issues/84503 and let vscode and the community support this change.

Hello everyone, as #84503 (comment) points out, vscode still doesn't have enough reasons to support changes to the code detection tool;

and with as the https://github.com/microsoft/vscode/wiki/Issue-Grooming#out-of-scope-feature-requests point out,

Has the community at large expressed interest in this functionality? I.e. has it gathered more than 10 up-votes or more than 10 comments? This criterion alone covers more than 650 of the 2850 open feature requests as of right now, October 9th, 2019.

84503 didn't have enough up-votes,

so I pray for everyone to vote for it #84503 and let vscode and the community support this change.

Lol @microsoft that the community has to upvote a feature request to have their text editor properly detect encoding and edit text.

omg, Hard to believe that this problem actually existed for so long, 1.40.2 Still not fixed...

How to upvote?

@JulioNobre I get that may be ambiguous to identify the exact encoding, but something has to be done, even that does not involve a Microsoft code directly. Try to create a blank txt file with the Windows-1252 encoding and write the word "coração". Now open the file, and you still see that even something aparently simple and created by Code, the guessed encoding still wrong. I tried to simulate this on multiple Text Editors, and no one opened the file with wrong encoding. I also testd with ISO 8859-1, same issue. So this is a problem, because if the application offers the options to save with multiple encoding, it should at least open the file created with the same encoding, otherwise don't offer certain encoding options.

VSCode 1.41.0 Still not fixed...

still not fixed, version 1.42.1

although i got it working with a workaround:

in my settings.json file i disabled guessing the file encoding and i force files to be encoded in utf-8:

    "files.encoding": "utf8",
    "files.autoGuessEncoding": false

although i got it working with a workaround:

in my settings.json file i disabled guessing the file encoding and i force files to be encoded in utf-8:

    "files.encoding": "utf8",
    "files.autoGuessEncoding": false

"files.autoGuessEncoding" can also be disabled per language, which mitigates the problem. It would be helpful to be able to do per file extension.

although i got it working with a workaround:

in my settings.json file i disabled guessing the file encoding and i force files to be encoded in utf-8:

This only has sense if your whole project is encoded as UTF-8
autoGuessEncoding is precisely useful for projects having mixed encoded files

Still experiencing this problem.
Forced to reopen with UTF-8 every time I open the file.
At least it should try to save my choices.
Annoyed.

People, on the 29th of May we have had this issue open for 1000 days.

Celebrate... :cake:

1002 today..

I think this newest release solves this issue: https://github.com/aadsm/jschardet/releases/tag/v2.2.1

@bpasero

We can pick up a new version for July, as we are currently closing for June endgame.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chrisdias picture chrisdias  Â·  3Comments

VitorLuizC picture VitorLuizC  Â·  3Comments

villiv picture villiv  Â·  3Comments

lukehoban picture lukehoban  Â·  3Comments

v-pavanp picture v-pavanp  Â·  3Comments