Vscode: Allow to configure a list of encodings to use when guessing

Created on 26 Oct 2017  路  24Comments  路  Source: microsoft/vscode

The files.autoGuessEncoding=true doesn't work well in some circumstances.

I think that would be good if you guys add some features like files.forceEncoding="encode1:encode2,encode3:encode4".

So it can force 'encode1' to 'encode2'. That's a solution for wrong encoding detection I think.

feature-request file-guess-encoding

Most helpful comment

I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D
Definitely need a setting like

files.detectEncodings=["utf8","windows1251]

So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected.

All 24 comments

Yes, I'm totally agree because It is so weak for auto guess.
Add a candidate may be better!
For me, of may be Many Chinese Coder, only UTF-8 and GB18030 are most commonly meet, but auto-guess give me the Windows 1532??? I think is is easier to detect in users' encoding candidates.

I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D
Definitely need a setting like

files.detectEncodings=["utf8","windows1251]

So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected.

Verification: There is now a files.guessableEncodings setting where you can fill in encodings to support when guessing. From the explanation: If provided, will restrict the list of encodings that can be used when guessing. If the guessed file encoding is not in the list, the default encoding will be used.

Update: I decided to rename the setting to files.guessableEncodings

@bpasero With these settings:

    "files.autoGuessEncoding": true,
    "files.guessableEncodings": [
      "gbk"
    ]

I still get this file as UTF-8. It is in gbk encoding with two Chinese characters.

foo.txt

@octref you have to use a file that jschardet can detect properly. In your case it tells me:

image

So it makes sense that UTF-8 if used

To verify you can use src/vs/base/test/node/encoding/fixtures/some.cp1252.txt with CP1252 encoding!

@bpasero I see, the logic is

  • Guessed encoding is not in files.guessableEncodings
  • Fall back to utf-8

But I would argue this doesn't solve the users' problems. Let's say the user has a bunch of files that he knows is gbk encoding, but jschardet could have guessed either of these:

image

If the user wants all files to be opened as gbk. This setting would not work for him.
The original request is more for being able to set fallbacks. For example,

  • If guessed encoding is gb2312, gb18030, fall back to gbk.
  • Otherwise, fall back to utf-8.

A setting like this would be more useful:

{
  "files.encodingAssociations": {
    "gbk": ["gb2312", "gb18030"],
    "cp950": ["big5hkscs"]
    // Everything else falls back to "utf-8"
  }
}

Maybe someone from this issue could comment if that was the desired solution or not (@JasonJunMa).

@bpasero in the implementation from original pull request, the encoding falls back to the first one in the list instead of utf-8. It was not a great solution, definitely. I consider that @octref solution will resolve an issue.

It looks like @JasonJunMa and @phobos2077 both made different suggestions and the current solution is more towards https://github.com/Microsoft/vscode/issues/36951#issuecomment-344534006 while https://github.com/Microsoft/vscode/issues/36951#issue-268634326 is more towards https://github.com/Microsoft/vscode/issues/36951#issuecomment-425162895

Since we are late for the endgame and the feature is not clear, I will remove it from the release until we figured out what is the best solution.

I'm facing this issue too since a large portion of codebase I work with is encoded with windows-1251 but often guessed as maccyrillic.

From my point of view, @octref's solution is bulletproof but requires a user to learn what encoding will be guessed by jschardet for pretty much each file in the codebase and fine-tune preferences every time new false positive encoding is guessed. I think that this behaviour can be implemented as a temporary solution.

From my point of view the best solution is to fork jscardet and make it return a list of possible encodings with a probability for each encoding. Then we can make a new setting (something like files.preferableEncodings) which represents an encoding list and if encoding from this list passes a certain threshold (which also may be configurable) it's chosen instead of the most probable for opening the file. I think this solution will cover most of the cases, but if not, a user can fallback to files.encodingAssociations setting proposed by @octref.

@bpasero @octref what do you think about this solution. Use the same settings as was previously implemented (a single list of encodings), but the last one on the list will be used as a fall back? It makes sense in terms of my original suggestion (narrow down the list of possible encodings to only the ones you need). But it is not as flexible as @octref suggestion.

Edit: noticed this was already suggested before... How about this:

  • Allow user to specify either a list of strings (last one is used as fallback), OR an associative array like @octref has suggested?

This should be easier to set up for most cases (like my case), but at the same time flexible enough for more complicated cases.

I use only 2 types of coding: UTF-8 and Windows-1250 (Central European ANSI code page)
I setted the Auto Guess encoding = True
The problem is that the Visual Studio Code incorrectly detects Windows-1250 as ISO 8859-2 and some letters are not displayed correctly.
What and where should I set files.guessableEncodings to use Windows-1250 (polish letters)?

I have the same use case as @Tomek-PL, we only use either utf-8, windows-1250 or windows-1252. Files get detected as ISO 8859-7 rendering characters incorrectly.

Neither files.restrictGuessedEncodings or files.guessableEncodings work.

Click "upvote" in the first post. This will increase the chance that someone will take care of it

HI, all;
What I need is just like fileencodings in vim (see https://vim.fandom.com/wiki/Working_with_Unicode );
It just give a ordered encoding list to let the vim test. I think it can solve the most ambiguous encoding detecting, as I haven't get mess when I use vim with correct setting.


for example, I only use GB18030 and UTF8, so I set as following in .vimrc

fileencodings=gb18030,utf8

I think it is trivial to Impl it. @octref make a bit complex logic, but in my view it may not needed.
@bpasero 's impl may be ok if let the guess list ordered as define order (But I haven't see the impl in vscode release)

Overall, we may

  1. Make sure the needs (I recommend vim impl)
  2. Somebody powerfully impl it
  3. Merge it to release

A coarse suggestion, forgive me if error or bother. Thanks.

I just wanna say, the general issue here is that VSCode guesses encodings that are - from a human perspective - unlikely to appear in the user's environment.

I like @memeda's approach with the ordered list, that way you can specify what's most likely and VSCode takes that into account when guessing. It's just teaching the tool what's common sense to the user.

Think like humans would interact:

  • "Hey Josh, what's the encoding of the project?"
  • "It's mixed, most likely X but some files are in Y or Z"

That's IMHO the smartest and most user friendy way.

I am also patiently waiting for this feature, at work we only use Windows-1252 and UTF-8, but VS Code keeps guessing Greek or maccyrillic or whatever.

Please click up-vote to this thread. This will increase the chance that someone will take care of it

Why ist this still open? It's so annoying. The solution from 2,5 years ago would have been great...

It looks like @JasonJunMa and @phobos2077 both made different suggestions and the current solution is more towards #36951 (comment) while #36951 (comment) is more towards #36951 (comment)

Since we are late for the endgame and the feature is not clear, I will remove it from the release until we figured out what is the best solution.

@bpasero
We need this https://github.com/microsoft/vscode/issues/36951#issuecomment-600964911

I am currently not able to catch up on this, but if someone can come up with a reasonable PR that includes the outcome of the discussions we had, then I can try to review it, time permitting.

It is issue grooming month and I am looking into this issue to understand the latest thinking. There are different proposals here but I think my attempt I did initially showed that e.g. something like VIMs fileencodings config will not work, because of this case:

Let a user configure fileencodings: "gbk", "utf8". Let the user open a gbk file that jschardet wrongly detects as something else. Now we would use utf8 and not gbk because that other encoding is not in the list and also not wanted.

Bottom line, unless jschardet changes to a different model or we switch to another encoding guessing library, I do not really see how VSCode can solve this?

PS: I would like to merge https://github.com/microsoft/vscode/issues/84503 and this issue into one as I think they are very similar.

As I think, the most ideal way is the chardet lib itself can guess in a certain range of encoding.
Otherwise if the lib can return a list of guessing result with confidence value, filter by user's setting "fileencodings".
When the lib can only return one result and not in "fileencodings", which seems to be current case, if not change the lib, maybe show a notice saying the guess fails? It's not really solving the problem, but it's better than now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omidgolparvar picture omidgolparvar  路  3Comments

trstringer picture trstringer  路  3Comments

curtw picture curtw  路  3Comments

biij5698 picture biij5698  路  3Comments

mrkiley picture mrkiley  路  3Comments