Notepad3 (64-bit) v5.19.1205.2695 BETA
Windows 10 Enterprise 1809 x64
I have a number of text files with very few UTF-8 characters. Sometimes only one. A sample is attached. On line 22, the word should read "Pr茅cis:" (with an accented e).
Using a pristine copy of the above Notepad3 version to open the sample file, Notepad3 detects the encoding as Windows-1258 and the accented character gets messed up:

I went back to older Notepad3 versions:
UCHARDET='WINDOWS-1258' Conf=0% || CED='Unicode (UTF-8)' (reliable)UCHARDET='WINDOWS-1258' Conf=72% || CED='Unicode (UTF-8)' (reliable)UCHARDET='WINDOWS-1258' Conf=72% || CED='Unicode (UTF-8)' (reliable)All other subsequent Notepad3 builds I tested detected the encoding incorrectly. I also tested pristine versions of the latest builds of Notepad2 and Notepad2-mod; they detected the encoding correctly.
NB: all testing was done with pristine copies of each version, with no file encoding history.
I'm hoping this is not too much of an edge case, and that including the sample can improve Notepad3's encoding detection accuracy.
Hello @craigo-
Please be patient, because for personal reasons, @RaiKoHoff is taking a break.... 馃槥馃槥
PS: it seems to be related with the issue reported on #1831 馃憤
All good.
I did read #1831 before posting, but thought it sufficiently different to open a separate issue. But if the root cause is the same, happy to close this issue and roll my findings into #1831, if you prefer?
NO please, your complete analyse is precious too ! 馃挴
Here a short list of text editors that open the file with a correct UTF-8 detection !!! 馃槂
Hello @RaiKoHoff ,
I would like to add to this issue a well-known text build_np3portableapp.cmd encoded in UTF-8 with ONLY ONE non-ASCII character "delims=露" on line 33 in this "shorted" batch file.
In attachment: build_np3portableapp.zip
Too few characters for reliable detection ...
Too few characters for reliable detection ...
Hello @RaiKoHoff ,
I hear good you plausible explanation, but why all those text editors https://github.com/rizonesoft/Notepad3/issues/1848#issuecomment-569692566 can do it ? 馃
Hi @hpwamr : maybe they focus on UTF-8 and don't have an advanced encoding detection.
Just set [Settings2] AnalyzeReliableConfidenceLevel=80
which means use only 80% reliable (and higher) detection results,
and:

to tune Notepad3 more into UTF-8 detection results (most files now will be UTF-8, if they have valid UTF-8 encoding).
If you encounter good experiences, we can make this the default settings ...
If you encounter good experiences, we can make this the default settings ...
If will try for while "Use UTF-8 as fallback on detection failure" 馃
and experiment with AnalyzeReliableConfidenceLevel
First test with actual default "AnalyzeReliableConfidenceLevel=50"
New experimental test with:
UseDefaultForFileEncoding=true
SaveRecentFiles=false
DevDebugMode=1
AnalyzeReliableConfidenceLevel=66
With #1147 as a reference, I'm trying to understand Notepad3's encoding detection process. Can you please confirm my thinking regarding the scenario below?

With the following default settings:
...and the following non-default settings:
Is this what should happen?
Therefore, shouldn't Notepad3 use the OS codepage ANSI (CP-1252) and open the file using that encoding? Because it looks like it is using UCD's detection regardless of it being classified as unreliable. (Not that it would help in this instance, since the file is not CP-1252 either...)
Edit: the above is done with Notepad3 (64-bit) v5.20.114.2707 BETA. If I do the same using the earlier Notepad3 (64-bit) v5.20.113.2703 BETA, I get this:

In this instance:
In any case, I'm not sure how Notepad3 has opened as UTF-8 instead, and therefore displays the accented e correctly!
Edit: the above is done with Notepad3 (64-bit) v5.20.114.2707 BETA. If I do the same using the earlier Notepad3 (64-bit) v5.20.113.2703 BETA, I get this:
Hello @craigo- ,
Are you sure of the test version number Notepad3 (64-bit) v5.20.113.2703 BETA , because I can not reproduce your second picture with the correct displaying of the "茅". 馃槙
@craigo- : First of all: The Analyze Debug Msg in the titlebar (reliability) has been fixed meanwhile.
Your description is absolutely correct for former versions of Notepad3, meanwhile we decide to "modernize" Notepad3 to prefer UTF-8 over local system Code-Page internally:
if Analysis-Result is not reliable, check if UTF-8 encoding would be valid for this file. If answer is 'yes it would be valid' then prefer UTF-8 over local ANSI Code-Page.
Discussion: Better use local systems's ANSI Code-Page ?
Compromise: Use UTF-8 over ANSI, if valid _and_ 'Open 7-bit ASCII files in UTF-8 mode' is checked' ?
My opinion is: "UTF-8" is really the way of the future, just look at Linux and all XML documents and also all modern websites that have adopted it unconditionally.
This seems essential to me in the case of mixed multilingual documents (European, Cyrillic and Asian).
After thinking a little bit, I think to have an option for this would be a good idea, so:
Using the Open 7-bit ASCII files in UTF-8 mode option expresses the wish to go more to UTF-8 than to ANSI CP (pure 7-bit ASCII files can be encoded as UTF-8 or ANSI - they share the same first (127) character (ASCII) representation - so I will use this switch to use UTF-8 over local ANSI CP as detection fallback.
Are you sure of the test version number
Notepad3 (64-bit) v5.20.113.2703 BETA, because I can not reproduce your second picture with the correct displaying of the "茅". 馃槙
Yes, I'm sure. I have just repeated the test with a new pristine copy of build 2703 and the results are the same.
Note that I did change the [Settings2] "AnalyzeReliableConfidenceLevel" to 80. Leaving it at the default (50?) results in the file being opened with the incorrect Windows-1258 encoding:

(Note that the debug info in build 2703 incorrectly classifies the UCD result as reliable; this appears to be fixed in build 2707.)
Another possibly is your Recent (History) file list, see if there is a persistent entry there? c.f. #1884.
@craigo- : First of all: The Analyze Debug Msg in the titlebar (reliability) has been fixed meanwhile.
Thanks for that, it was a bit confusing!
Discussion: Better use local systems's ANSI Code-Page ?
Well, it wouldn't have helped in this case :smile:
Compromise: Use UTF-8 over ANSI, if valid _and_ '
Open 7-bit ASCII files in UTF-8 mode' is checked' ?
It sounds like this would get to the correct result.
My opinion is: "UTF-8" is really the way of the future, just look at Linux and all XML documents and also all modern websites that have adopted it unconditionally.
This seems essential to me in the case of mixed multilingual documents (European, Cyrillic and Asian).
I agree. I have been changing Notepad{2|2-mod|3} to default to UTF-8 encoding for years; I was happy to see it become Notepad3's default some time ago (one less configuration change for me).
Using the
Open 7-bit ASCII files in UTF-8 modeoption expresses the wish to go more to UTF-8 than to ANSI CP
As long as we have:
I'll be happy. Seems like the UI could stand a refresh?
Note that I did change the [Settings2] "AnalyzeReliableConfidenceLevel" to 80. Leaving it at the default (50?) results in the file being opened with the incorrect Windows-1258 encoding:
Hello @craigo- , --> "AnalyzeReliableConfidenceLevel=80"
Now I understand why I can not reproduce it with Build_703. 馃槈
With "AnalyzeReliableConfidenceLevel=80", a UCHARDET detection of "Conf=72%" is NOT accepted and Notepad3 returns to its "Default encoding (new file):" (which in pristine = UTF-8). 馃槂
For info: the since Build_2707 the confidence default is: "AnalyzeReliableConfidenceLevel=66"
For info: the since Build_2707 the confidence default is: "AnalyzeReliableConfidenceLevel=66"
Hello @craigo- ,
With Build_709, the new confidence default is: "AnalyzeReliableConfidenceLevel=70"
Feel free to test the BETA version "Notepad3Portable_5.20.116.2709_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2709_BETA.paf.exe.7z.
Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").
Your comments and suggestions are always welcome... 馃槂
Build 2709 out-of-box encoding detection:

Open 7-bit ASCII files in UTF-8 mode ON (2nd tickbox) now results in the file encoding being assumed to be UTF-8Not having read the code, is the above ordering how Notepad3's encoding detection works in practice? If it is (or even if it is not), I'm wondering if the ordering of the encoding options could better reflect this, i.e. the tickboxes are 'processed in order'. This is what I was meaning above regarding a UI refresh. For example:
With "AnalyzeReliableConfidenceLevel=80", a UCHARDET detection of "Conf=72%" is NOT accepted and Notepad3 returns to its "Encoding Default" (which in pristine = UTF-8). 馃槂
Is this right? (I think this refers to my Step 6, 2nd tickbox.) Does it then use the Default Encoding (whatever that is set to), or does it _always_ use UTF-8? Or is this describing what happens in Step 3 (1st tickbox) if the setting is enabled?
With "AnalyzeReliableConfidenceLevel=80", a UCHARDET detection of "Conf=72%" is NOT accepted and Notepad3 returns to its "Encoding Default (new file):" (which in pristine = UTF-8). 馃槂
Is this right? (I think this refers to my Step 6, 2nd tickbox.) Does it then use the Default Encoding (whatever that is set to), or does it _always_ use UTF-8? Or is this describing what happens in Step 3 (1st tickbox) if the setting is enabled
My understanding is that the "Default encoding" is determined by "Default encoding (new file):"
For a test, please, change the "Default encoding (new file):" UTF-8 "to ANSI (CP-1252) and a detection failure will pass from UTF-8 to Windows-1252!
Ah, but that's cheating... Setting Default Encoding to "ANSI (CP-1252)" then sets _and enforces_ "Use as fallback on detection failure". Not sure why this is...
A better test is to set Default Encoding to something like "Western European (DOS-850)":

The final encoding choice is UTF-8.
The behavior should be
(if Perform ANSI Code Page detection and Use reliable detection results only is active):
Use reliable det..." is deactivated):Use as fallback on detection failure is active:Default Encoding (new file) setting! Open 7-bit ASCII files in UTF-8 mode is active:(Maybe Use reliable detection results only should be correlated (sub point) with Perform ANSI Code Page detection and be ghosted, if detection is deactivated).
Hello @craigo- ,
Feel free to test the BETA version "Notepad3Portable_5.20.131.2720_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.131.2720_BETA.paf.exe.7z.
Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").
Your comments and suggestions are always welcome... 馃槂
Notepad3 (64-bit) v5.20.131.2720 BETA & v5.20.202.2721 BETA:
The modified UCHARDET threshold results in the correct encoding type being used for this file. Thanks.
NB: A very small change request... In the case of an unreliable UCD result, could you please add a space between the word 'reliable' and the open bracket - to match the display for a reliable result?


NB: A very small change request... In the case of an unreliable UCD result, could you please add a space between the word 'reliable' and the open bracket - to match the display for a reliable result?
Hello @RaiKoHoff , the mini correction is done. 馃槈
Thanks, @hpwamr.
We're done here 馃槃
Hello @craigo- ,
Feel free to test the BETA version "Notepad3Portable_5.20.204.2722_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.204.2722_BETA.paf.exe.7z.
Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").
Your comments and suggestions are welcome... 馃槂