Notepad-plus-plus: Unicode characters inconsistently cannot be displayed in Notepad++

Created on 18 Sep 2017  Β·  53Comments  Β·  Source: notepad-plus-plus/notepad-plus-plus


Description of the Issue


In a Notepad++ document that is encoded as UTF-8 (no BOM), many Unicode characters are not displayed, but the hollow square appears in their place. If a displayable Unicode character is added to a line containing undisplayable Unicode characters, those undisplayable ones suddenly appear. Removing the "good" one makes the others revert to the hollow square. A simple example:

β˜†β—¬βŠ—βŠ β‹†β§†β¨‚

Paste that line into NP++ and you will see all the characters. Remove the leading star β˜† and the others become squares. Restore the star and the others re-appear.

Steps to Reproduce the Issue

  1. Create a UTF-8 (no BOM) text file. (This is the only hard part of the procedure.)
  2. copy & paste the following string into a UTF-8 (no BOM) Notepad++ document: β˜†β—¬βŠ—βŠ β‹†β§†β¨‚
  3. all of those characters will display properly
  4. delete the leading star β˜†
  5. the other characters become hollow squares
  6. restore the β˜† and the other characters reappear

Expected Behavior


All of the characters always should appear.

Actual Behavior


They only appear if an always-acceptable Unicode character is on the same line. If an always-acceptable Unicode character is in the document but not on the same line, certain Unicode characters, such as, but not limited to, the ones shown above, will not be displayed properly.

Debug Information


Notepad++ v7.5.1 (32-bit)
Build time : Aug 29 2017 - 02:35:41
Path : C:\Program Files (x86)Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS : Windows 10 (64-bit)
Plugins : ComparePlugin.dll mimeTools.dll NppConverter.dll NppExport.dll NppFTP.dll NppTextFX.dll PluginManager.dll SpellChecker.dll


This occurs with characters from many of the Unicode blocks.

accepted

Most helpful comment

@Ekopalypse

using a plugin or NppExec to make the characters display correctly is not exactly what I call a solution of the problem. It's a possible work-around, but wouldn't it be nice when the characters are displayed correctly without additional actions?

And, @ValZapod, how long would it take until we have a new Scintilla version? (Perhaps, the Scintilla developers will say: Use technology 1 or higher! That would be interesting)

I would feel better, if Npp itself would switch the technology to a working one. May be, it can be included in the configuration somehow, so that there is a safe fallback if the technology switch doesn't work on some systems.

All 53 comments

I was able to reproduce this as well.

I tested with the Default Style in Style Configurator set to Courier New, Consolas, Arial and Times New Roman. The file was a TXT file and I tested encoding in UTF-8, UTF-8 BOM, UCS-2 BE BOM and UCS-2 LE BOM. All of them showed the same result.

I believe this issue would happen any time you enter a character that is NOT contained in the selected font and then add/remove on the same line a character which IS contained in the selected font.

IMHO seems like something not quite right with the font-substitution routines. This was in a TXT file encoded with

Debug Information

Notepad++ v7.5.9 (64-bit)
Build time : Oct 14 2018 - 15:19:55
Path : C:\Program FilesNotepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS : Windows 10 (64-bit)
Plugins : DSpellCheck.dll mimeTools.dll NppConverter.dll

I believe this issue would happen any time you enter a character that is NOT contained in the selected font

This may be, but it also appears to occur in other circumstances as-well. For-example, "⎷" (U+23B7 "RADICAL SYMBOL BOTTOM") is present in Consolas, Courier New, DejaVu Sans Mono, and Lucida Console, but if you put that in a new text file, it won't show up with any of those fonts.

it also appears to occur in other circumstances as-well

Cannot confirm this example on my system.

The U+23B7 character is not in my _DejaVu Sans Mono_. There is U+23AE, followed by U+23CE.

The U+23B7 character is not in my _Courier New_ either. There is U+2321, followed by U+2500.

Same for my _Consoleas_, U+2321 followed by U+2460, same for my _Lucida Console_, this ends at U+0433.

So this example seems not to poke a hole into the theory, that only characters unavailable in the current font are affected.

The U+23B7 character is not in my DejaVu Sans Mono.

The same. There is smth magical about β˜†, β‡’ (line 1 on my gif), πŸ™Œ, βœ“, β˜› (but not ☞) that when present on the same line it fixes the issue?? But even when it is fixed you can broke it if
a) you will highlight the bracket
b) the fixing character is on the same side of the bracket!

I will paste my funny gif and close my issue #8305 about ∈ symbol. notepadcorruption

Duplicate of #442, #671, #675, #813, #870, #1621, #3458, #4056, #4086, #4490, #5513, #8305, #8756 may be many more.

So, we need to test Notepad++ 7.6.6 as it was good (?) in #4490. Also all of it (brackets also) is said in this comment (except it is wrong that only before shows elements after also works). https://github.com/notepad-plus-plus/notepad-plus-plus/issues/1621#issuecomment-260655014

And, yes it totally does not happen in MS Gothic. Strange.

It might be, somehow, related to SCI_SETTECHNOLOGY configuration.

font_issue

@ValZapod
but I don't see the issue with the larger autocompletion box.
Maybe this was already fixed with the scintilla version used by npp.

@Ekopalypse

with the larger autocompletion box.

You mean brackets? I think you use not Courier New Font? It is bad in it, and good in DejaVu Sans Mono. You can try from #442

@Ekopalypse,
the SCI_SETTECHNOLOGY approach looks very promising on my system too.

I did include an execute(SCI_SETTECHNOLOGY, <n>); into ScintillaEditView::init to test it.

Techology 0:

Techno-0

Technologies 1, 2 and 3:

Techno-1

Both screenshots show the same file automatically loaded after start, the only thing I did was moving the cursor to the right bracket of the two marked brackets.

I used _Courier New_ here.

The new techologies seem to size the substituted chars better then techology 0. _Edit:_ But it has nothing to do with "fixed font" anymore, the substitutions seem to have quite variable widths.

I found that this symbol β†˜ (U+2198) is rendered in Consolas (it does not have this symbol) as ? in a square. Not just a square. But apart of that it is all the same.

@ValZapod

You mean brackets? I think you use not Courier New Font? It is bad in it, and good in DejaVu Sans Mono. You can try from #442

No, I mean the screenshot and discussion you linked to
https://trac.wxwidgets.org/attachment/ticket/17804/17804-SetTechnology1.png

There it has been reported that the words in an autocompletion box are bigger using directwrite
instead of default. I'm using RobotoMono font.

@ValZapod @Uhf7
I don't seem to be able to get the results you get with the Courier New font.
The ∈ is always displayed, so I assume that something else has an additional effect.

Downloaded 7.8.6 x64 and did a retest

font_issue2

The "bracket issues" doesn't seem to happen for me. (??)

@Ekopalypse

I can make the ∈ visible with the _Courier New_ font now too, using technology 0 and some hand-configured font linking, which looks actually a little ill:

Techno-0fs

What I did: There is a registry entry

HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontLink\SystemLink

Under it, there are many multi-string values, named after fonts. There was no value named Curier New. So I created a multi-string value named Courier New and copied the data of the Lucida Sans Unicode into it. A shot in the blue. Immediately after it, nothing improved, but after I rebooted my system, the missing characters became visible. And: If I move the cursor to the famous right bracket, then the small ∈ mutates to a big ∈ instead to an empty frame. So it certainly depends on the font linking setup too. And if I could set up the font linking in a way, that the font-linked ∈ looks the same as the "normal" ∈ (where ever it comes from), then everything would be fine with technology 0.

@Uhf7 Is not set for me.
image

The "bracket issues" doesn't seem to happen for me. (??)

@Ekopalypse The font never actually changed, you need to press on Enable global font. Windows 7?? It is EOL...

@Ekopalypse

then certainly another trick exists. Running out of thoughts here. On my system, the 32 bit version works exactly like the 64 bit version.

A major difference is the system itself: I use
OS Name : Windows 10 Pro (64-bit)
OS Version : 1607
OS Build : 14393.576,

you use Windows 7. I try it on my Windows 7 system ...

Ok, under Windows 7, the ∈ works fine, with and without bracket highlighting, but some other characters don't. Using technology 0.

Techno-0-w7

With the 64 bit version of Npp, the same characters are missing.

Using technology 1:

Techno-0-w7-t1

What else could you ask for? Looks perfect to me.

Technology 1, Windows 7, 64 bit version of Npp:

Techno-0-w7-t1-64

I dream, if you ask me. Compared to the current state.

@Uhf7 Now remove β˜† and β‡’. Oops. Cannot we somehow force the rendering that is used when yout type in β˜†?

@ValZapod

The font never actually changed, you need to press on Enable global font

No, it was started with these settings. I switched to global overwrite to show that no other font is defined. If global override is NOT checked, the default setting takes precedence.
Yes, I still use Windows 7 why not? I don't do any "mission critical transaction" with windows OS anyway.

Cannot we somehow force the rendering that is used when yout type in β˜†?

Npp has no setting for this yet. What you can do is to use one of the scripting language plugins,
like PythonScript, LuaScript ..., even NppExec can be used to set the technology to DirectWrite

@Uhf7 - hmm :-D what should I say - Windows 10 broke it :-(
Thanks for testing and btw. thank you for your contributing work. Much appreciated.

@ValZapod

I see no negative effect when removing the β˜†, if I use technology 1:

Techno-0-w7-t1-64-2

or the β‡’:

Techno-0-w7-t1-64-3

I certainly start to get messed up with my screenshot names here, that's why I don't post any pictures from re-inserting the β˜† and the β‡’ successfully, but for me, with technology 1 everything works fine. Under Windows 7 and under Windows 10, I believe.

My knowledge of fonts, rendering, directwriting... refers to what I have posted.
I can't say for sure if it's a problem in Scintilla or Windows or the font used or ...,
but if using DirectWrite offers a way to either work around or solve this, then I would
vote to add something to the settings that would allow the user to set it.

@Ekopalypse

using a plugin or NppExec to make the characters display correctly is not exactly what I call a solution of the problem. It's a possible work-around, but wouldn't it be nice when the characters are displayed correctly without additional actions?

And, @ValZapod, how long would it take until we have a new Scintilla version? (Perhaps, the Scintilla developers will say: Use technology 1 or higher! That would be interesting)

I would feel better, if Npp itself would switch the technology to a working one. May be, it can be included in the configuration somehow, so that there is a safe fallback if the technology switch doesn't work on some systems.

@Uhf7 - 100% correct :-D

https://sourceforge.net/p/scintilla/bugs/1393/ is our bug. But there is this problem about brackets... That is not there.

An issue close to this one is #2287. It is the same problem they describe there, existing since 2016, and it is solved there by setting the technology to DirectWrite with the help of a plug-in.

Thank you for that solution, but this is something for insiders. As a new user, or as a user who is just using it for editing files without caring about development, this solution is this is very hard to find.

So I would fully support what @jefflomax said in #2287:

Notepad++ should support ligatures out of the box, not thru hacks or adding plugins users neither need nor want.

So I will try to push it to the master now, with a PR. If we not do this now, the next ones come in two years wasting their time with testing it again and again and again.

Found an old Windows Vista in my virtual machine park, the following screenshots support the necessity to make the DirectWrite technology feature configurable. That Scintilla can load Direct2D does not mean automatically that this produces better results on old systems.

Vista, Technology 0, Courier New
Techno-0-Vista

Vista, Technolgy 1, Courier New
Techno-1-Vista-1

Vista, Technology 0, DejaVu Sans Mono
Techno-0-Vista-DejaVu

Vista, Technology 1, DejaVu Sans Mono
Techno-1-Vista-DejaVu

That is just because there was no support for Unicode that far in Vista?

May be. But Unicode itself was already there under Vista. What bugs me more is, that technology 1 under this Vista seems to wreck "normal" characters nearby the ∈ character, sometimes.

technology 1 under this Vista seems to wreck "normal" characters nearby the ∈ character, sometimes

Screenshot?

@ValZapod

You wrote 2 days ago

https://sourceforge.net/p/scintilla/bugs/1393/ is our bug.

The Unicode character U+25C6 (β—†)displays in Npp with and without DirectWrite technology. Even in Windows 7.

So I cannot verify that this is exactly "our" bug. And it was 2012. And he used Windows XP. And I'm sure there are many effects which can lead to empty frames instead of correct characters. I simply don't believe that it's promising to go to them and ask them to fix exactly this issue now.

Screenshot?

The second screenshot of my Vista screenshots, headlined "Vista, Technolgy 1, Courier New".
Most "normal" characters in line 3 don't look like _Courier New_ anymore.

So I cannot verify that this is exactly "our" bug. And it was 2012. And he used Windows XP.

Okay, maybe open another issue?? Maybe also lets try @nyamatongwe?

Valerii Zapodovnikov:

Okay, maybe open another issue?? Maybe also lets try @nyamatongwe?

For Scintilla bug #1393, text shaping for East Asian text can be influenced by the locale used so displaying in a Japanese context may differ from a Chinese context. There are other bugs about this like https://sourceforge.net/p/scintilla/bugs/2027/.

Problems with displaying particular symbol characters may be different. They seem to occur when the specified font does not include some characters so Windows tries to use glyphs from backup fonts. Scintilla does not have much control over this.

For GDI (technology 0) you could try experimenting with the font creation setup call in SetLogFont inside win32/PlatWin.cxx. It is possible that the lfQuality and lfCharSet parameters will influence the behaviour.

DirectWrite was originally implemented for Windows Vista but that early version had some problems and DirectWrite has improved over time. Applications could default to using DirectWrite from Windows 7 if there are too many problems with Vista or add an option that users can select. Some people prefer GDI’s less anti-aliased (blurry) text.

Neil

@nyamatongwe Wow. Paste βŠ—βŠ β‹†β§†β¨‚ in your notepad3, it will get broken! Nice, I will open an issue there. P.S. Or it is not yours? https://github.com/rizonesoft/Notepad3/issues/2404

Wow. Paste βŠ—βŠ β‹†β§†β¨‚ in your notepad3, it will get broken! Nice, I will open an issue there. P.S. Or it is not yours?

Nope, The owner of Notepad3 is "Derick Payne" πŸ˜‰

Wow, look here @Uhf7 https://github.com/rizonesoft/Notepad3/issues/2404#issuecomment-640456912 this is genious. One can choose technology that you Draw with.

@ValZapod - seems you misunderstood most of the thread.
This is what I suggested 13 days ago and what @Uhf7 is working on.

Well, it will be without a UI?

I don't think so, if you check his PR then you will see that he added it to the preference dialog.

Well, 4 variants vs 2 and need to restart to preserve plugin behaviours...
Also maybe try this advice from Sci author?

For GDI (technology 0) you could try experimenting with the font creation setup call in SetLogFont inside win32/PlatWin.cxx. It is possible that the lfQuality and lfCharSet parameters will influence the behaviour.

Too much noise for my taste, I'm out.

@ValZapod I saw the UI already, but had no really opinion about it, because it doesn't belong to this project, so it does not help me here. My opinion regarding the technology settings in the screen shot: Two options too many. The difference in text rendering is between the Windows GDI TextOut function on one side and the DirectWrite equivalent on the other side. The rest is about how to bring the rendering result of DirectWrite to the screen.

Somebody just proposed a patch for this! https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8756#issuecomment-679320347

Change

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/84430809df2f6607a3bc5b9f05866149842f2bd9/scintilla/win32/PlatWin.cxx#L407

to

        auto TLen = text.length();
        if(TLen>1)TLen++;
        if (0) { //unicodeMode
            tlen = static_cast<int>(UTF16FromUTF8(text, buffer, TLen));
        } else {
            tlen = ::MultiByteToWideChar(codePage, 0, text.data(), static_cast<int>(TLen),
                buffer, static_cast<int>(TLen));
        }

Valerii Zapodovnikov:

Somebody just proposed a patch for this! #8756 (comment) https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8756#issuecomment-679320347 We already know that changing the text changes the presentation of this bug. The patch is not a fix.

Neil

This "fix" is at least a hint where the problem comes from: It comes directly from the Windows GDI text output functions for wide characters. I did some experiments based on this information.

The Windows GDI functions, which are used by Scintilla and which do not work correctly, are:

  • ExtTextOutW
  • GetTextExtentPoint32W
  • GetTextExtentExPointW

The common error of these functions seems to be, that they use squares instead of characters for some 'bad' Unicode characters, as long as there is no 'good' Unicode character in the text string.

I have no list of 'good' or 'bad' Unicode characters, this is only a term for it I invented here. But I can name two 'good' Unicode characters: 0x0000 and 0x200B. If one of those two characters is in the text, all other Unicode characters are displayed correctly. The 0x0000 character has been used by @KnIfER for the "fix". Unfortunately, it has a width, when we use it with the Windows GDI functions.

So I went for the 0x200B character (Zero width space) in my experiments. A possible fix is to append the 0x200B character silently to all text strings passed to the Windows functions mentioned above. Then they produce the correct character width's and the correct output.

To make this experiment fly without additional text copy operations, I modified the TextWide class in a sneaky way. The VarBuffer is now one character longer than the actual text and this additional character is the Zero width space. tlen remains as it is, to avoid any behavior modifications.

class TextWide : public VarBuffer<wchar_t, stackBufferLength> {
public:
    int tlen;   // Using int instead of size_t as most Win32 APIs take int.
    TextWide(std::string_view text, bool unicodeMode, int codePage=0) :
        VarBuffer<wchar_t, stackBufferLength>(text.length() + 1) {
        if (unicodeMode) {
            tlen = static_cast<int>(UTF16FromUTF8(text, buffer, text.length()));
        } else {
            // Support Asian string display in 9x English
            tlen = ::MultiByteToWideChar(codePage, 0, text.data(), static_cast<int>(text.length()),
                buffer, static_cast<int>(text.length()));
        }
        buffer [tlen] = 0x200b;
    }
};

After modifying the TextWide class this way, I can use tlen+1 as character count for the ExtTextOutW call and for all GetTextExtentPoint32W calls, to smuggle in the 'good' Unicode character.

What remains here, is the GetTextExtentExPointW call in SurfaceGDI::MeasureWidths. Here, I had to increase the size of the poses buffer, and I had to set the result parameter fit to the actual length of the text. This can be done without side effects, because the maxWidthMeasure parameter is equal to INT_MAX, so that I assume, that all characters fit into this width anytime.

    const TextWide tbuf(text, unicodeMode, codePage);
    TextPositionsI poses(tbuf.tlen + 1);
    if (!::GetTextExtentExPointW(hdc, tbuf.buffer, tbuf.tlen + 1, maxWidthMeasure, &fit, poses.buffer, &sz)) {
        // Failure
        return;
    }
    fit = tbuf.tlen;

This experimental fix runs on my system without assertions in debug mode and displays the correct characters using the Windows GDI functions.

I don't know whether such a solution would be accepted by Scintilla, but perhaps there is someone who wants to try it this way too ...

squares instead of characters for some 'bad' Unicode characters

Those are not squares. It is .notdef (yes, it is not .null, U+0000) glyphs of the font)) You can see it in Fontlab. That is why Consolas is showing not just a square, but "?" in a square! https://docs.microsoft.com/en-us/typography/opentype/otspec170/recom#shape-of-notdef-glyph

Here's another way to reproduce this, from #3747 originally reported with #813

Open a new Notepad++ file, set the encoding to UTF-8 and paste these symbols (Double Arrow Unicode characters) on the first empty line
β‡β‡‘β‡’β‡“β‡”β‡•β‡–β‡—β‡˜β‡™
Position the cursor before the last two characters and enter a newline, like this
⇐⇑⇒⇓⇔⇕⇖⇗
β‡˜β‡™
The last two characters should turn into blocks.
notepad unicode character corruptuion

This comment has a nice video showing the issue
https://github.com/notepad-plus-plus/notepad-plus-plus/issues/5513#issuecomment-482701890

Why did you close it if #8756 is still open and this issue is still not fixed??? Did you report it upstream?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

grosorg picture grosorg  Β·  46Comments

oriso picture oriso  Β·  56Comments

GhbSmwc picture GhbSmwc  Β·  51Comments

freeHKfreechina picture freeHKfreechina  Β·  58Comments

sokcuri picture sokcuri  Β·  79Comments