UTF-8 detection is broken since commit: 63b09def6ab3b6671311bac16438447113202e56
+ enh: optimized UTF-8 enoding detection
My UTF-8 files are detected as ANSI (CP-1252).
Reverting to commit 48e8e2f3e90851bfdac3333bc0ed6815ed046063 fixes the problem.
To reproduce:
[Recent Files] in the .ini config file or through the GUI (File> Recent (History)... > Discard).Related to #614.
Please provide some example files for testing (preferred .zip package).
I'm attaching a test UTF-8 encoded file (utf-8.txt). I'm also including my Notepad3.ini just in case.
issue#618.zip
If you open utf-8.txt with Notepad3 63b09def6ab3b6671311bac16438447113202e56 or newer, the encoding it's not detected correctly.
Hello @RaiKoHoff ,
I confirm the issue of @omega32 馃
utf-8.txt with Notepad3 (64-bit) v4.18.512.992
utf-8.txt with Notepad3 (64-bit) TinyExpr v4.18.821.1066
I am currently reworking the Unicode, UTF-8 and ANSI-CodePage detection stuff.
Thanks for testing - please test development beta version _TinyExpr_4.18.822.1067.
Short explanation of encoding detection settings:

By default, Notepad3 uses the Window System's ANSI CodePage for file encoding ( e.g. Western European (Windows-1252) - which is Code Page 1252). Which is set at the _top_ of the selection list-box (dependent on Systems regional settings).
I the file has been loaded before and is still recorded in Notepad3's the file history, the encoding of the file history record (recorded in history, when file has been closed) is enforced.
If you open a file (and Skip ANSI Code Page detection is checked ON = default) this Systems's ANSI Code Page is used to encode the file - except it is Unicode UTF-8/16.
In case of Unicode files, the encoding should be detected correctly (except Skip UNICODE detection is checked).
In case of only 7-bit characters in file, the file can be encoded as ASCII/ANSI CodePage or UTF-8, the displayed result will be the same.
Use Open 7-bit ASCII files in UTF-8 mode switch to force UTF-8 instead of default ANSI CodePage in this case.
If you have a file with unknown encoding, you can try to use the ANSI Code Page detection (remove the Skip ANSI Code Page detection) to identify the right encoding.
Default Encoding (new file): _obvious_ Use as fallback on detection failure:Skip ANSI Code Page detection : _see explanation above_Skip UNICODE detection : IMO there are only very rare cases, where this option is needed.Open 7-bit ASCII files in UTF-8 mode : _see explanation above_Open 8-bit *.nfo/diz files in DOS-437 mode : -> ANSI ArtDon't parse encoding file tags : this flag wil skip the encoding enforcement by the tag in the file.// encoding:windows-1257 as a first line of a file@hpwamr : ... worth a documentation entry ?
commit e0a5713605a66f1be69df058a9893724b0e28a7b (build 1067) fixes the problem. 馃憤
tested with latest master commit as well and UTF-8 encoding detection seems to be working again. thanks
Hello @omega32, @RaiKoHoff,
Tested with: Notepad3 (64-bit) TinyExpr v4.18.822.1067
Issue corrected. 馃憤 鉂わ笍
As far as I am concerned, this issue may be closed....
I would like to test the changed detection algorithm against some more UTF-8 files (w/o Signature aka BOM). Do you know a download source in the web?
Ed.: Found some kind of artificial test documents https://github.com/bits/UTF-8-Unicode-Test-Documents. Tests of printable files are satisfying to me, real life files would be better ... 馃槈
Hello @RaiKoHoff,
In attachment 2 small files in UTF-8 w/o signature (and to easily check: "FRHED" a little hex editor) 馃槃
Updated: TXT_utf-8_W/O_signature.zip
Hello,
I may have found an UTF-8 file that is not detected properly [utf8-detection-bug-02.zip]. It was detected properly in commits prior to 63b09def6ab3b6671311bac16438447113202e56. But not in the most recent version.
There may be a repository out-there with example UTF-8 files with edge cases to use as a base to automate some testing... but I haven't found any.
@omega32 : Please test development beta version _TinyExpr_4.18.923.1117 (PR https://github.com/rizonesoft/Notepad3/pull/698)
I relaxed the conditions for being UTF-8 (CED Analyzer: allow less reliable UTF-8 detection results)
Hello @omega32, @RaiKoHoff,
Tested with: Notepad3 (64-bit) TinyExpr v4.18.923.1117
The issue with [utf8-detection-bug-02.zip] seems corrected. 馃憤
As far as I am concerned, this issue may be closed....
Thanks. Detection seems to be working better now. It's correctly detecting both my ANSI and UTF-8 files so far. I'll be testing it and reporting back if I find any issues.
@omega32 : thank you for testing - your help is highly appreciated.
Most helpful comment
@omega32 : Please test development beta version _TinyExpr_4.18.923.1117 (PR https://github.com/rizonesoft/Notepad3/pull/698)
I relaxed the conditions for being UTF-8 (CED Analyzer: allow less reliable UTF-8 detection results)