Notepad3: German umlaute often an issue

Created on 9 Mar 2020  路  21Comments  路  Source: rizonesoft/Notepad3

Sample 1: https://s19.directupload.net/images/200309/6lot9a54.png

Sample 2 https://s19.directupload.net/images/200309/h6gatgos.png
Instead the german umlaute the program convert to the wrong char.

But only with some texts document not at all. I've tested different versions from the program
instead a german 盲 its convert to xE4 with UTF-8 seems ok, but with Ansi or ISO-8859-1 must be ok the program show only the ? instead the german umlaute

Edited: correction of the URL's to show the pictures

configuration encoding detection

Most helpful comment

For the latest betas, I removed the double meaning for Open 7-bit ASCII files in UTF-8 mode.
It now has its origin meaning: How to interpret pure (US-)ASCII: as system's locale ANSI or UTF-8.
To tune the system back, to like system's locale ANSI (if UCHARDET think it is this ANSI CP but not sure), you can raise the confidence voting with parameter [Settings2] LocaleAnsiCodePageAnalysisBonus=33.
This parameter gives a bonus on top of UCHARDET's confidence level if the analyzed ANSI CP is the same as current system's locale ANSI code-page (not if they are different):

  • 0 = no bonus
  • 50 = add half of difference between confidence and 100% as bonus
    (example for 50: if confidence is 72% , then, in case of same ANSI:
    (100%-72%) * (50/100) = 14 on top: 72% -> 86%
  • 100 = 100% confidence that UCHARDET is right, if it detects current system's ANSI CP.

All 21 comments

@hpwamr *lol* really... ? he should change the Coding ! or, it should programming in search the switchword "chcp %codingnumber%" to have a automatic recoding in the right codingnumber.. how i told it in the past.. to have the right mutated vowel in German like "f眉r, 眉ber and Br眉cke" / "for, over and bridge" in english.. also the special sigh in a batch file like ascii sign and so on.. ..

@PDirk ... Sachmal Dirk, hast Du vielleicht mal dran gedacht, dass eine Textdatei
in verschiedene arten von codierung sein kann, wie UDF8, UDF16 und in deinem Fall eben sollte sie sein :
=>Datei=>Codierung=> voreinstellung => Pulldownmenu= Central European (Windows-1250)

850 oder eben 1250 damit die Umlaute angezeigt werden.

Wenn Du das so einstellst und dann 眉ber strg+umschalt+F (Recodierung as default) immer eingibst und dann abspeicherst, sollte dann auch deine Umlaute ordendlich angezeigt werden..,

anders ist es, wenn Du so gearbeitest hast und dann eben die codierung im textfile nicht gespeichert wird und als andere codierung ge枚ffnet wird, dadurch dann die umlaute nicht correckt angezeigt wird, was ich eben auch das problem hab, deshalb hab ich vor ne weile auch schon gesagt, da脽 man vielleicht chcp und nummer f眉r die codierung
finden lassen sollte wie eben "chcp 850" oder in deinem fall eben "chcp 1250" in
irgendeiner zeile schreiben k枚nnen sollte, die dann auch gefunden wird und dann das textdocu auch in eben dieser codierung ge枚ffnet werden solle, hilft bei textdateien und hilft bei cmd/batch dateien die eben dann auch exact die codierung braucht die eben gebraucht wird (codierung siehe wikipedia, die ganzen codierungen haben nummern die dann mit chcp in der commandozeile oder beim Prompt im WinNT aufgerufen werden k枚nnen)
Daher, auch mein grundbed眉rftnis in cmd/batch dateien sowas/oder die m枚glichkeiten zu haben.. irgendwo rein tippern zu k枚nnen "chcp %codingnumber%" damit dann bei 枚ffnen des textdocuments eben das genauso codiert wird wie's irgendwo drin steht in der datei.

aber bleib mal bei English, wollte dir das nur mal nebenher mitteilen.

( @RaiKoHoff )

Upps, @ghost is gone ...

@hpwamr : We like UTF-8 everywhere, but maybe we have a lot of users, who are not familiar with file encoding and still use for their text-files their system's code-page (ANSI blabla). Maby we should switch off by default "Open 7-bit ASCII files in UTF-8 mode."
This means, if Encoding-detector is not sure, it prefers local ANSI cod-page over UTF-8.
(This is good old (12y) Notepad2 behavior) 馃 馃

image

@RaiKoHoff,
Being a pioneer is not easy! 馃槵 馃 馃ぃ
Maybe the general public is not yet mature for UTF-8?
OK, let's try with your proposal as a new default and let's see the reactions! 馃槒

I will be sure to keep an eye on this, being one of the users who has fully transitioned to UTF-8! In particular, my recent files with a small number of UTF8-encoded accented characters. I'm hoping this change won't affect their encoding selection.

Question: wasn't that setting overloaded to do something else at some recent point?

In particular, my recent files with a small number of UTF8-encoded accented characters. I'm hoping this change won't affect their encoding selection.

It will an interesting check. 馃憤

Feel free to test the RC version "Notepad3Portable_5.20.309.4_RC3.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.309.4_RC3.paf.exe.7z.7z

Note: "Notepad3Portable RC" can be used in "2 flavors", see with or without extension ".7z".

Your comments and suggestions are welcome... 馃槂

OK, let's try with your proposal as a new default and let's see the reactions! 馃槒

Hello @RaiKoHoff , @craigo-

NOT sure it's a good idea ! 馃

  • Remember Recent Files --> "Unchecked"
  • Et le d茅veloppement continue! ---> save file in "UTF-8 and exit
  • Et le d脙漏veloppement continue! --> reopend the file (it is now in ANSI (CP-1252)) !!! 馃槵

Tiny examples with only few chars to distinguish are no good tests.

Take your example and save it as ANSI CP-1252. Reopen - et voila - it says Windows CP-1250 馃槻
BUT: It is still correct (ANSI CP1250 and CP-1252 have the same encoding fo

image

With this option OFF, this example works perfectly for both UTF-8 and ANSI CP-1252:

image
image

Even this is not German (some German "Umlauts") 馃ぃ

The rule of thumbs 馃憤 is: If users have more UTF-8 files, they should have this option checked (ON),
if they are using more ANSI local CP files, they should have this option unchecked (OFF).
The question for the default setting is: Which users are the majority ?

@craigo- :

Question: wasn't that setting overloaded to do something else at some recent point?

Yes, core meaning is for pure ASCII files, they can be interpreted as ANSI or UTF-8 (no SIG), it does not matter ANSI and UTF-8 have the same encoding for the first 127 chars (7-bit).
This option was to set the users flavor (ANSI or UTF-8).
The double meaning is: If the user prefers the one over the other, the chance is good, that, if the encoding analyzer is not sure to chose ANSI or UTF-8, this option shows the direction 馃槈 .

@ghost i hope it helps now :)

With this option OFF, this example works perfectly for both UTF-8 and ANSI CP-1252:

Hello @RaiKoHoff ,

Maybe, It works because your "AnalyzeReliableConfidenceLevel=92%" is changed to 66% ? 馃 馃

https://en.wikipedia.org/wiki/Code_page
@hpwamr
is not utf-8 8 bit ?
code page 65000 =UTF-7 Unicode
code page 65001 = UTF-8 Unicode
and the US Ascii is 7 Bit :
cp 20127 US-ASCII char (7 Bit)

best regards
Blacky
https://de.wikipedia.org/wiki/Zeichensatztabelle

@blackcrack : UTF-8 is a Multi-Byte-Character-Set (MBCS) to be able to encode Unicode characters.

image
Quelle: https://de.wikipedia.org/wiki/UTF-8#Zul%C3%A4ssige_Bytes_und_ihre_Bedeutung

The English Wikipedia pages don't have the same nice graphic ...

@hpwamr : Let us analyze following case with a confidence level of 92%:

image

The encoding detection (analysis) is not sure (for CP-1252: 80% < 92%) so its answer is CP-1252 but not reliable. Then the fallback will be used:

  • if Open 7-bit ASCII files in UTF-8 mode. is checked OK: it chooses UTF-8, which is WRONG.
  • if Open 7-bit ASCII files in UTF-8 mode. is _NOT_ checked: it chooses my OS-CP, which is CP-1252, which is CORRECT.

So what to do ?
The best is the rule of thumbs from comment above.

@hpwamr :
I increased the cohesion to locale ANSI CP a little bit.
Feel free to test the RC version "Notepad3Portable_5.20.310.1_RC3.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.310.1_RC3.paf.exe.7z.7z

I thing a confidence level of 92% is very high, the power of UCHARDET gets lost.
For a good level:
Let Open 7-bit ASCII files in UTF-8 mode. ON (the UTF-8 lovers default) and
create some test files for your locale ANSI (CP-1252 I assume) containing different Belgian/French text ( 茅 脿 ..., maybe German's 脽盲眉枚脛脰脺 etc), maybe batch files with French comments, etc. and watch the confidence level detected .. 馃 (To get correct CP-1252 instead of UTF-8).

I thing a confidence level of 92% is very high, the power of UCHARDET gets lost.

Hello @RaiKoHoff ,
I'm just back at home, I will test v310.1 asap.

In the test, we have also to taking in account the 20 UTF-8 test files in Notepad3\test\test_files\encoding\UTF-8\

  • if Open 7-bit ASCII files in UTF-8 mode. is checked, it's OK: all 20/20 files are correctly displayed!
    if Open 7-bit ASCII files in UTF-8 mode. is NOT checked: Only 13/20 files are correct (7 files are NOT correctly displayed, because thy are NOT recognized as UTF-8) ! 馃

Hello @RaiKoHoff ,
My opinion is "Open 7-bit ASCII files in UTF-8 mode" must remain checked! 馃槒

Here are 2 examples with a summary of the lines with special characters.

2020-03-10_181201

2020-03-10_181426

My analysis of one file matches that of @hpwamr...

Notepad3 (64-bit) v5.20.310.1 RC3:

(Note that UCD's encoding guess is incorrect in any case.)

output

Hello @RaiKoHoff , @craigo- ,

This text: Summary special chars in UTF-8.txt is definitely an UTF-8 text (confirmed by: Notepad++, VS Code 2019, EditPadLite, Notepad2-zufuliu, etc..)!!! 馃槂

2020-03-11_072402

And "Uncheck" "Open 7-bit ASCII files in UTF-8 mode" should NOT display it as an ANSI text !!! 馃槵

2020-03-11_072514

For the latest betas, I removed the double meaning for Open 7-bit ASCII files in UTF-8 mode.
It now has its origin meaning: How to interpret pure (US-)ASCII: as system's locale ANSI or UTF-8.
To tune the system back, to like system's locale ANSI (if UCHARDET think it is this ANSI CP but not sure), you can raise the confidence voting with parameter [Settings2] LocaleAnsiCodePageAnalysisBonus=33.
This parameter gives a bonus on top of UCHARDET's confidence level if the analyzed ANSI CP is the same as current system's locale ANSI code-page (not if they are different):

  • 0 = no bonus
  • 50 = add half of difference between confidence and 100% as bonus
    (example for 50: if confidence is 72% , then, in case of same ANSI:
    (100%-72%) * (50/100) = 14 on top: 72% -> 86%
  • 100 = 100% confidence that UCHARDET is right, if it detects current system's ANSI CP.

This text: Summary special chars in UTF-8.txt is definitely an UTF-8 text (confirmed by: Notepad++, VS Code 2019, EditPadLite, Notepad2-zufuliu, etc..)!!! 馃槂

Hello @RaiKoHoff ,
Test with Notepad3 (64-bit) v5.20.313.3 RC3:
Now, the UTF-8 file remains correctly UTF-8 when in encoding "Open 7-bit ASCII files in UTF-8 mode" is unchecked! 馃憤

Hello @ghost ,
As far as I'm concerned, I think you (requester) can close this issue... 馃ぃ

Was this page helpful?
0 / 5 - 0 ratings