Terminal: CP437 "low ASCII" characters not working in the v2 console

Created on 2 May 2018 · 16Comments · Source: microsoft/terminal

Your Windows build number:

Microsoft Windows [Version 10.0.16299.371]

What you're doing and what's happening:

If you look at the full character set for code page 437, there are a number of useful symbols in the "control" range (for example, the symbols for card suits), which were commonly used in text-based games from the DOS era. I was just trying to use some of these recently, and noticed that they no longer work in the new console.

I thought at first this was a problem with the Consolas font, since they do work with the old Raster fonts, but I then tried using the Consolas font in the legacy console, and was surprised to find that they work there too. So there's obviously something the old console is doing to support those characters that is missing in the v2 console.

Below you can see a simple test case comparing the v2 console with Raster fonts (left), the legacy console with the Consolas font (middle), and the v2 console with the Consolas font (right).

In case it's not obvious, the characters being echoed are entered with the Ctrl key, i.e. Ctrl+B, Ctrl+D, etc.

What's wrong / what should be happening instead:

When code page 437 is selected (or one of the regional equivalents) I would expect the characters in the control range to display their correct symbol as defined by the code page (assuming of course that the character isn't actually functioning as a control character).

Area-Output Area-Rendering Issue-Task Needs-Tag-Fix Product-Conhost Resolution-Fix-Committed

Source

j4james

👍1

Most helpful comment

Sorry I think I'm miscommunicating or mixing the two things here. Let me try a bulleted list of things to see if that helps.

I didn't mean to imply that the low ASCII characters ever worked in UTF-8 or that you expected it. My apologies.
The console has never had true UTF-8 support. It still doesn't. There's a lot of pieces that still don't work right despite setting 65001 and using some of our APIs.
My description above is in relation to how the feature that used to exist was implemented. It was implemented in the arguably "wrong" way as a strange stitching of raster font glyphs into TrueType fonts. This happened before my time and was a headscratcher to stare down.
The arguably "right" implementation is what you describe: to remap the codepoints to where they already exist in the TrueType fonts. If they're in the fonts, use them. If they're not, the font can display its replacement character and the user can choose a font with them if they want to see them.
The "wrong" implementation itself was a liability in working on other features that we've added in the last several releases of Windows. So I took it out. We knew when I took it out that it would impact a few people and that the stated response would be "use legacy mode if you want that feature".
Replacing it with the "right" implementation ranks too low on our list of priorities to perform at this time.
I am giving "implementing proper UTF-8 support" as an example of a work item that is very high on our list of priorities and is currently being worked on. Completely orthogonal to this feature of "(re)implement drawing low ASCII characters in 437 by remapping them to the correct codepoints" which is very low on our priority list.
To raise it up on our priority list, we would need a massive outcry on Github, Feedback Hub, or through other channels to signal that we should raise the feature up on our list. We haven't seen any real signal at all that people noticed it was gone until this thread. We'll probably go bump the rank a few points in response to this thread, but it isn't really a "big" deal yet and has a simple workaround (versus some of our other ranked features).
It is a touch more complicated internally than it appears to an outsider. It would be way more clear to you what happened here if you could see the pile of source code and our task board ranking.

We'd love to have time to implement everything that people are concerned with, but the team and I do not. I just want to be honest: this was removed from the new console on purpose, we believe the legacy console is a sufficient workaround for now, fixing this exists on our backlog, we know about it, we have an idea of how we'd fix it, but it is currently ranked too low relative to other features.

I hope this helps clarify it.

miniksa on 3 May 2018

👍2

All 16 comments

This is by-design. Legacy mode is required to run those old DOS things. The console team is trying to move away from that old stuff and get into the new such as getting UTF-8 vice UTF-16, somewhat implementing xterm, etc. due to the feedback they've gotten from the community. We don't want the old stuff so if you need it, that's what legacy mode is for.

WSLUser on 2 May 2018

I mean, as the person who pulled it out, it's by design at that point in time because it made my life more difficult in terms of the other changes I was trying to make to update the rest of the console code.

The way this worked was that it kept an in-memory display canvas of those symbols out of the raster font and tried to stitch them into every other font that used them.

If there were enough demand, we could theoretically re-add it somehow. Probably by attempting to remap them to a Unicode codepoint for each of those symbols behind a checkbox that lets you turn it on or off. But I'm pretty sure demand overall is low as this is the first report I've heard of this in the 3 to 4 years since I took it out. For now, I'd advise using the legacy mode to get at this behavior.

But to one extent, @DarthSpock is right. Implementing UTF-8 is currently a higher priority to us than maintaining or reimplementing support for this particular feature. If more people vocally want this than UTF-8, we can switch gears. :P But I doubt that.

miniksa on 2 May 2018

👍1

There is one character that is used on e.g. Swedish keyboards that has code 21 decimal in CP437 (i.e. after chcp 437): §, and that works in the non-legacy console and doesn't have a raster font. I think you could do the same hack for all in range 0-31. In DOSBox 0.74 Ctrl+U, Ctrl+V, Ctrl+W, and Alt+21, Alt+22, Alt+23 gives §▬↨ (number-keys from numerical keyboard).

This also affects CP850 etc: "code points 1–31 and 127 (01–1F_hex and 7F_hex) may be either ASCII control characters or code page 437 graphics, depending on context." https://en.wikipedia.org/wiki/Code_page_850#Character_set

All CP437 characters have a corresponding UTF-8 character, see
https://en.wikipedia.org/wiki/Code_page_437#Characters

Alt+127 works in the non-legacy console in CP437 and CP850 and gives ⌂.

I'm also for xterm and UTF-8. WSLtty (WSL Terminal) supports entering CP437 1-31 using Alt+1 to Alt+31, and this even though UTF-8 is selected, but this might collide with acting as an xterm.

mobluse on 2 May 2018

@miniksa: Based on your response, I think you may not have understood the limited conditions under which I was expecting this to work. So just to clarify:

I wasn't expecting these characters to work when utf-8 was selected as the code page - I only expected them to work if the selected code page had actually defined symbols for those code points. They don't work in the legacy console when utf-8 is selected either.
I also wasn't expecting them to work with fonts that don't have those glyphs. But Consolas certainly does, and I expect most other fonts would too, considering all those characters are included in WGL4. So I wasn't expecting font fallback to the raster fonts - I assumed it was simply a matter of mapping those code points to existing symbols in the active font, much like you'd be doing for the rest of the characters in the code page.

That said, I can understand if you're just saying this isn't a priority at the moment. I can also accept that it may be a lot more complicated than it appears to an outsider. I just wanted to be sure you hadn't misinterpreted my request.

j4james on 3 May 2018

Sorry I think I'm miscommunicating or mixing the two things here. Let me try a bulleted list of things to see if that helps.

I didn't mean to imply that the low ASCII characters ever worked in UTF-8 or that you expected it. My apologies.
The console has never had true UTF-8 support. It still doesn't. There's a lot of pieces that still don't work right despite setting 65001 and using some of our APIs.
My description above is in relation to how the feature that used to exist was implemented. It was implemented in the arguably "wrong" way as a strange stitching of raster font glyphs into TrueType fonts. This happened before my time and was a headscratcher to stare down.
The arguably "right" implementation is what you describe: to remap the codepoints to where they already exist in the TrueType fonts. If they're in the fonts, use them. If they're not, the font can display its replacement character and the user can choose a font with them if they want to see them.
The "wrong" implementation itself was a liability in working on other features that we've added in the last several releases of Windows. So I took it out. We knew when I took it out that it would impact a few people and that the stated response would be "use legacy mode if you want that feature".
Replacing it with the "right" implementation ranks too low on our list of priorities to perform at this time.
I am giving "implementing proper UTF-8 support" as an example of a work item that is very high on our list of priorities and is currently being worked on. Completely orthogonal to this feature of "(re)implement drawing low ASCII characters in 437 by remapping them to the correct codepoints" which is very low on our priority list.
To raise it up on our priority list, we would need a massive outcry on Github, Feedback Hub, or through other channels to signal that we should raise the feature up on our list. We haven't seen any real signal at all that people noticed it was gone until this thread. We'll probably go bump the rank a few points in response to this thread, but it isn't really a "big" deal yet and has a simple workaround (versus some of our other ranked features).
It is a touch more complicated internally than it appears to an outsider. It would be way more clear to you what happened here if you could see the pile of source code and our task board ranking.

I hope this helps clarify it.

miniksa on 3 May 2018

👍2

I must admit that considering NTVDM isn't supported on x64 so old DOS apps aren't supported anyway, and I don't see many Win32 CUI apps that would get impacted.

There is a workaround for impacted apps by using the Unicode (UTF-16) version of the console API. Apps trying to render ANSI-art should simply #define UNICODE and use proper code-point anyway as box-drawings and block-elements might not map to the current user codepage either, instead of changing the user's codepage or expecting the whole world to use CP 437.
Codebase using CP 437 internally could stay as it and simply replace their print function by a combination of MultiByteToWideChar and WriteConsoleW to easily become console codepage independent.

Outputting these characters in the v2 console with CP 437 selected.
c0 graphic characters

PhMajerus on 8 May 2018

Codepage 437 is a superset of ASCII. It also includes an alternate mapping for the legacy PC glyphs, which you can get instead of ASCII control characters by specifying the MultiByteToWideChar flag MB_USEGLYPHCHARS.

eryksun on 24 Sep 2018

This would be a lot less painful if the Legacy Console setting wasn't global!

LurkingKiwi on 11 Feb 2019

While looking into another issue, I came across something in the code which I think is the reason why these characters aren't displaying "correctly". In the WriteCharsLegacy method in __stream.cpp_, there is actually some code that tries to handle the conversion of control characters to their OEM glyphs (see here). The problem is that the _CharType_ returned by GetStringTypeW is a combination of bit flags - not a single value - so it can't be compared directly with C1_CNTRL. This can easily be fixed by changing that condition to a proper flag test:

if (WI_IsFlagSet(CharType, C1_CNTRL))

I'd love to provide a PR for this if nobody has any objections.

j4james on 11 Jul 2019

@miniksa knows best here, but that sounds like the right fix to me!

DHowett-MSFT on 12 Jul 2019

I'm only moderately terrified that it is, in fact, in WriteCharsLegacy. That function is ... one of the worst parts of conhost, and one of our biggest compatibility concerns. I'd wait for Michael to report on whether this is sane or insane, or whether that behavior is now codified and set in stone and can never be changed. :smile:

DHowett-MSFT on 12 Jul 2019

if (WI_IsFlagSet(CharType, C1_CNTRL))

In short, this is probably fine to fix.

However, I would personally feed a few characters through WriteCharsLegacy under the debugger and assert that your theory is correct first (that multiple flags coming back are what the problem is) before making the change.

I am mildly terrified, less than Dustin, because it is freaking WriteCharsLegacy which is the spawn of hell and I fear some sort of regression in it.

In long, why is it fine to fix?

For reference, this particular segment of code https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/_stream.cpp#L514-L539 appears to only be used when the codepoint is < 0x20 or == 0x7F https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/_stream.cpp#L408 and ENABLE_PROCESSED_OUTPUT is off. https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/_stream.cpp#L320

I looked back at the console v1 code and this particular section had a divergence for "Western" countries and "Far East" countries (a geopolitically-charged term, but what it was, nonetheless.)

For "Western" countries, we would unconditionally run all the characters through MultiByteToWideChar with MB_USEGLYPHCHARS without the C1_CNTRL test and move the result into the buffer.

For "Eastern" countries, we did the C1_CNTRL test and then if true, we would run through MultiByteToWideChar with MB_USEGLYPHCHARS. Otherwise, we would just move the original character into the buffer and call it a day.

Note in both of these, there is a little bit of indirection before MultiByteToWideChar is called through some other helper methods like ConvertOutputToUnicode, but that's the effective conversion point, as far as I can tell. And that's where the control characters would turn into acceptable low ASCII symbols.

When we took over the console codebase, this variation between "Western" and "Eastern" countries was especially painful because conhost.exe would choose which one it was in based on the Codepage for Non-Unicode Applications set in the Control Panel's Regional > Administrative panel and it could only be changed with a reboot. It wouldn't even change properly when you chcp to a different codepage. Heck, chcp would deny you from switching into many codepages. There was a block in place to prevent going to an "Eastern" codepage if you booted up in a "Western" codepage. There was also a block preventing you from going between "Eastern" codepages, if I recall correctly.

In modernizing, I decided a few things:

What's good for the "Far East" should be good for the rest of the world. CJK languages that encompassed the "Far East" code have to be able to handle "Western" text as well even if the reverse wasn't true.
We need to scrub all usages of "Far East" from the code. Someone already started that and replaced them with "East Asia" except then they left behind the shorthand of "FE" prefixing dozens of functions which made it hard to follow the code. It took us months to realize "FE" and "East Asia" were the same thing.
It's obnoxious that the way this was handled was to literally double-define every output function in the code base to have two definitions, compile them both into the conhost, then choose to run down the SB_ versions or the FE_ versions depending on the startup Non-Unicode codepage. It was a massive pile of complex pre-compilation #ifdef and #elses that would sometimes surround individual lines in the function bodies. Gross.
The fact that the FE_ versions of the functions were way slower than the SB_ ones was unacceptable even for the same output of Latin-character text.
Anyone should be free to switch between any codepage they want at any time and restricting it based on a value from OS startup or region/locale is not acceptable in the modern world.
I concluded by all of the above that I was going to tank/delete/remove the SB_ versions of everything and force the entire world to use the FE_ versions as truth. I would fix the FE_ versions to handle everything correctly, I would fix the performance characteristics of the FE_ versions so they were only slower when things were legitimately more complicated and never otherwise, I would banish all usage of "Far East", "East Asia", "FE_", and "SB_" from the codebase, and codepages would be freely switchable.
Oh. Also, the conhost used to rewrite its entire backing buffer into whatever your current codepage was whenever you switched codepages. I changed that to always hold it as UTF-16.

Now, after that backstory. This is where the problem comes in. It looks like the code you're pointing to that didn't check flags and instead checked direct equality... is the way that it was ALWAYS done for the "Eastern" copy of the code. So it was ALWAYS broken for the "Eastern" codepages and country variants of the OS.

I don't know why the "Eastern" copy was checking C1_CNTRL at all in the first place. There is no documentation. I presume it has to do with Shift-JIS or GB-2312 or Unified Hangul or something having a conflict < 0x20 || == 0x7F. Or alternatively, it's because someone wrote the code naively thinking it was a good idea in a hurry and never tested it. Very possible and even probable.

Presuming CJK codepages have no conflict in this range for their DBCS codepages... we could probably remove the check with GetStringTypeW entirely and always run everything through ConvertOutputToUnicode. More risky than just the flag test change... but theoretically an option as well.

miniksa on 12 Jul 2019

👍1

Thank you for taking the time to go into this in so much detail. I appreciate the amount of work that must have gone into refactoring that code, and I do understand your concern about breaking things. I have stepped through the function in the debugger to confirm that it was working the way I expected, but I'll try and do a lot more testing before I submit the PR.

j4james on 12 Jul 2019

Having now checked the character type for all the characters in the control range (0 to 31, and 127), I can confirm that they all have both the C1_CNTRL and C1_DEFINED flags set. A few also have C1_BLANK and/or C1_SPACE. So none of them would have matched on a direct comparison. That said, it seems that the C1_DEFINED flag didn't always exist (I believe it was added in XP), so it's possible a direct comparison with C1_CNTRL would actually have worked in the past (in most cases at least), so maybe that's the reason why this code was originally written the way it is.

Something that came up in testing, though, was that my suggested fix would actually change the behaviour of the null character. Because a null is now successfully matched as a C1_CNTRL, it would no longer fall through to the special-case handling that previously mapped it to a space, and that caused problems elsewhere in the code (see issue #1955). So to address that I've also had to reorder the code slightly - not quite the straightforward fix I'd thought it would be!

One other thing. You mention above that this only applies when ENABLE_PROCESSED_OUTPUT mode is off, but it's actually the other way around - it only applies when ENABLE_PROCESSED_OUTPUT mode is set. That seems wrong to me (I'd expect it to work in both cases), but that's the way the legacy console behaved as well, so if that's a bug it's a separate issue. And since the majority of code would be using the default mode (where ENABLED_PROCESSED_OUTOUT is set), I don't think it's that big a deal.

j4james on 14 Jul 2019

Having now checked the character type for all the characters in the control range (0 to 31, and 127), I can confirm that they all have both the C1_CNTRL and C1_DEFINED flags set. A few also have C1_BLANK and/or C1_SPACE. So none of them would have matched on a direct comparison. That said, it seems that the C1_DEFINED flag didn't always exist (I believe it was added in XP), so it's possible a direct comparison with C1_CNTRL would actually have worked in the past (in most cases at least), so maybe that's the reason why this code was originally written the way it is.

Something that came up in testing, though, was that my suggested fix would actually change the behaviour of the null character. Because a null is now successfully matched as a C1_CNTRL, it would no longer fall through to the special-case handling that previously mapped it to a space, and that caused problems elsewhere in the code (see issue #1955). So to address that I've also had to reorder the code slightly - not quite the straightforward fix I'd thought it would be!

One other thing. You mention above that this only applies when ENABLE_PROCESSED_OUTPUT mode is off, but it's actually the other way around - it only applies when ENABLE_PROCESSED_OUTPUT mode is set. That seems wrong to me (I'd expect it to work in both cases), but that's the way the legacy console behaved as well, so if that's a bug it's a separate issue. And since the majority of code would be using the default mode (where ENABLED_PROCESSED_OUTOUT is set), I don't think it's that big a deal.