Terminal: [Epic] COOKED READ and character conversion fun with all our different Convert* functions

Created on 29 Sep 2020 · 15Comments · Source: microsoft/terminal

[Claimed by miniksa]
This issue exists to resolve many issues around cooked read data processing including choking on the return of bytes in various A function calls from Read methods (with both DBCS and UTF-8 codepages), the improper display of Emoji or other U+10000 characters on the edit line (commonly seen in cmd.exe), and the general buffer grossness of mixmatching both char and wchar_t text within the same client-owned buffer space during waiting read operations.

Related issues

These issues are related and should be improved or resolved by this work:

#1503
#4551
#7589
#3786
#190
#5618
#1058
#4493

These issues should be easier to resolve after this work or be fixed by it:

#4975
#6555
#780
#1802

Uncategorized:

#4628

This is what I expect to do:

Start with _handlePostCharInputLoop and revise it so it doesn't attempt to precount the number of "bytes" by counting the "width" of characters. Instead move to just translating the text and storing the excess, much like an aliased command would do.
Realize TranslateUnicodeToOem has almost no usages anymore (and ConvertToOem) and could likely be converged with ConvertToA. Also that the fallback behavior of TranslateUnicodeToOem can be accomplished just by asking WC2MB to replace with the default character anyway.
Realize that if we don't need to know the CP's default character, then why are we loading it on startup? Oh, it's still because we need the lead byte table to know when something IsDBCSLeadByteConsole or not.
Realize that now that we don't use custom codepages anymore... there doesn't really seem to be a difference between IsDBCSLeadByteConsole and just calling the winnls.h exported IsDBCSLeadByteEx with the same codepage (and then not holding onto the CPInfo stuff at all.)
Doing the same thing for output codepage as input codepage in respect to dumping the load of that information and just asking winnls.h directly.
Look closely at CheckBisectStringA because it only seems to have one consumer that's really just checking if the final cell of a row IsDBCSLeadByteEx...
The ConvertInputToUnicode and ConvertOutputToUnicode are pretty darn close to ConvertToA and ConvertToW anyway. The only variations I can see in the pattern of using MB2WC and WC2MB are: no flags at all, putting the default character in when choking, or using glyph chars instead of ctrl chars. So why can't we just have ConvertToA and ConvertToW have those modes and run all the translations through those and use the safer and more sensible string-in-string-out pattern to translate everything.
If we're going all in on ConvertToA and ConvertToW... perhaps it's time for til::u8u16 and til::u16u8 to get their til::au16 and til::u16a variants brought in, have the flags added, and just have a unified way of converting things around here.
Can we also torch CharToWchar since its just translation of a short string (single character) but with the glyph for ctrl char substitution into a til::au16 with the glyph chars flag?

The expected effects are primarily:

Returning classic codepage or UTF-8 information out of ReadConsoleA, ReadFile and such should work correctly
We should eliminate a giant pile of static analysis suppressions
The cooked line should be able to just hold/display/convert emoji correctly and erase them correctly

More will show up in each of these headings as I discover it or we have feedback

Area-Input Impact-Correctness Issue-Bug Issue-Feature Issue-Task Product-Conhost Product-Conpty

Source

DHowett

Most helpful comment

@codeofdusk we like to reserve issue numbers that are easily remembered for future megathreads 😝

zadjii-msft on 30 Sep 2020

😄3

All 15 comments

Huh?

codeofdusk on 30 Sep 2020

@codeofdusk we like to reserve issue numbers that are easily remembered for future megathreads 😝

zadjii-msft on 30 Sep 2020

😄3

I expect (hope) that @eryksun has some things to contribute here, given how intimately he's reverse-engineered our stack
and how much of it he understands :smile:

DHowett on 5 Nov 2020

Research from today....

I'm writing with WriteConsoleInputW. I'm reading with ReadConsoleInputA off of a cooked read.

This is the scenario overview:
|Char Name | Char | wchar_t | 437 char | 932 char* |
|-|-|-|-|-|
|Greek Lowercase Alpha|α|0x03b1|0xe0|0x83 0xbf|
|Greek Lowercase Beta|β|0x03b2|0xe1|0x83 0xc0|
|Greek Lowercase Delta|δ|0x03b4|0xeb|0x83 0xc2|
|Greek Lowercase Epsilon|ε|0x03b5|0xee|0x83 0xc3|

In conhostv1, if you were half way through reading a multibyte sequence and changed the codepage, the remaining bytes are discarded for the next call in the new codepage. By contrast, you're normally allowed to read byte by byte, even in a multibyte codepage like Japanese (shift-jis 932) and we'll store the leftover trail byte for the next call. But it looks like there's precedent to chuck it if you change codepage mid way. It also gives you an extra "byte read" for funsies.
- SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
- WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
- SetConsoleCP to 932.
- ReadConsoleInputA for 3 bytes. Receive 0x83 0xbf 0x83. Reported 4 bytes read.
- SetConsoleCP to 437.
- ReadConsoleInputA for 490 bytes. Receive 0xee 0x0d 0x0a. Reported 3 bytes read.
- 0xc0 from beta discarded.. where'd the delta go? Also... thanks for reporting another byte read that you didn't give me...
Also in conhostv1... it looks like if you change codepage while reading.... it just flat out loses the next character to the void.
- SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
- WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
- SetConsoleCP to 932.
- ReadConsoleInputA for 2 bytes. Receive 0x83 0xbf. Reported 2 bytes read.
- SetConsoleCP to 437.
- ReadConsoleInputA for 490 bytes. Receive 0xeb 0xee 0x0d 0x0a. Reported 4 bytes read.
- Where'd the beta go?

miniksa on 6 Nov 2020

Current thoughts:

Giving leftover bytes of previous codepage after a switch on read continuation

I'm pretty sold that the v1 behavior is right here. If you changed the input codepage half way through reading a byte stream, I think it's easy to say you don't care about the left behind trailing bytes for DBCS or theoretically the rest of the UTF-8 sequence.

Losing a character when the codepage changes

This feels absolutely wrong on v1's part. We shouldn't be losing characters for switching a codepage. I'd like to figure out why, but I don't intend to preserve that.

Returning the read bytes as how many were pulled off the input queue, not how much of the given buffer was used

This feels really strange and feels like an ages old mistake in conhostv1. I'm not sure why you'd return lpNumberOfCharsRead to be longer than the initial size of the buffer given in nNumberOfCharsToRead for the ReadConsoleA API. That means that the caller now has extra work to realize that we didn't overrun their buffer but we're implying that there's a continuation that didn't fit. That's a really strange API pattern relative to other Read* sorts of APIs in Windows. But it now has sort of a 20 year precedent and I can imagine someone is depending on that amount read > number to read to realize that there are trailing bytes.

So I could see that one going either way. And for UTF-8 it could easily be 1-3 greater than the buffer size, not just the 1 greater that happens with a torn DBCS lead byte.

EDIT: I'm now more strongly favoring the number read should never exceed the buffer size. Because then what do we say on the return of the extra held bytes? Is the total count now absurdly larger than the total concatenated byte length? Is it zero or smaller than the length of buffer we filled? Yuck!

miniksa on 6 Nov 2020

Also in conhostv1... if you read 1 byte at a time off of Japanese codepage 932.... it loses the trailing character to the void every time.
- SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
- WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
- SetConsoleCP to 932.
- ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
- ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
- ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
- ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
- ReadConsoleInputA for 1 byte. Receive 0x0d. Reported 1 bytes read.
- ReadConsoleInputA for 1 byte. Receive 0x0a. Reported 1 bytes read.
- Where are my trailing bytes?!
Also in conhostv1... if you read 1 byte off and then read the rest in Japanese codepage 932... it loses the trailing character and doesn't stitch it to the next string. But it does present the rest of the string with a reasonable byte count.
- SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
- WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
- SetConsoleCP to 932.
- ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
- ReadConsoleInputA for 100 bytes. Receive 0x83 0xc0 0x83 0xc2 0x83 0xc3 0x0d 0x0a. Reported 8 bytes read.
- You lost one of my trailing bytes!

miniksa on 6 Nov 2020

Losing the trailing byte when reading 1 by 1

The console host, including v1, has a lot of provisions for storing that trailing byte in the input handle that made the request. It theoretically should be presenting it as the first thing up on the next read.

Am I not seeing it because I'm using a thread to read? I had to change the screen buffer to allow shared reading to make my threaded read work... maybe its actually on a different handle and therefore I don't get the trail?

miniksa on 6 Nov 2020

Losing the trailing byte when reading 1 by 1

The console host, including v1, has a lot of provisions for storing that trailing byte in the input handle that made the request. It theoretically should be presenting it as the first thing up on the next read.

Am I not seeing it because I'm using a thread to read? I had to change the screen buffer to allow shared reading to make my threaded read work... maybe its actually on a different handle and therefore I don't get the trail?

Doesn't look like it's a threading thing. I had to allow shared on the test screen buffer because cooked input is on and it needs a handle to the output buffer so it can echo. So an unshared test screen buffer won't work.

miniksa on 6 Nov 2020

I have a theory on this. I think if we look in TranslateUnicodeToOem which has been historically responsible for identifying and storing that trailing byte (this code is very similar in v1 console)..

https://github.com/microsoft/terminal/blob/015675d87c5902fe0391aa44cb5c76134f432b13/src/host/dbcs.cpp#L183-L202

I think the only time that it will actually manage to store the trailing byte in the INPUT_READ_HANDLE_DATA and not set AsciiDbcs[1] = 0; immediately is when the size of the buffer coming down is landing on an odd number (or has been offset by 1 somehow).

However, if we look in the method that is decoding the ReadConsole packet and deciding what to do... ServerReadConsole... which is arguably a lot cleaned up in v2 console but should have about the same logic as v1... GetAugmentedOutputBuffer is always doubling the user's specified buffer size... presumably because it's using it willy nilly as scratch space while it operates.

https://github.com/microsoft/terminal/blob/ae550e0969595f062b15c2ff5cc33d4afe8ebc3f/src/server/ApiDispatchers.cpp#L271-L276

So I'm not sure it's even possible to have the trailing byte get stored appropriately and be delivered later.

But I've only looked at this for cooked read. So...

Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?
Is there really a way to get the buffer to give me just the lead byte and faithfully hold the trail? Offsetting the buffer by 1 somehow with a 1 byte shift-JIS sequence that shifts everything off?
If it never worked... do we go with the spirit of what it was supposed to do and hold the trailing bytes correctly so someone could read byte by byte? And if so... what do we indicate on the "number of chars read"?

miniksa on 6 Nov 2020

Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?

Still need to check.

Is there really a way to get the buffer to give me just the lead byte and faithfully hold the trail? Offsetting the buffer by 1 somehow with a 1 byte shift-JIS sequence that shifts everything off?

Nope. Doesn't work. I tried zα and reading either 1 or 2 bytes. If you read 1, you get just the z. If you read 2, it sends back z then 0x83 0xbf and reports 3 bytes read, but only 2 ever fill the client app's buffer and it's just gone. I didn't find what in the driver/client servicing is preventing the overrun... but something is. And it's losing the extra bytes.

If it never worked... do we go with the spirit of what it was supposed to do and hold the trailing bytes correctly so someone could read byte by byte? And if so... what do we indicate on the "number of chars read"?

I believe the answer is yes. conhost.exe should never be returning more bytes than the user buffer said it could hold. The fact that the driver/client is preventing an overrun by truncating isn't something we should rely on.

Also reading the client code, we should be returning the number of chars read as the number of the client application buffer filled. Not more than that. Something's saving the console system from the overrun, but we're being mean and inducing an overrun in the client.

miniksa on 10 Nov 2020

Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?

Still need to check.

For cooked, raw....(ReadConsoleA) no. The buffer is overfilled/overdeclared in conhostv1 to the driver. It truncates it to the actual size of the user buffer, losing the trailing byte into space.

For direct...(ReadConsoleInputA) no. The buffer is filled with the trailing byte, but then the method early returns without telling the driver that the buffer has >0 bytes of payload to return.... again losing the trailing byte into space.

miniksa on 11 Nov 2020

For direct... (ReadConsoleInputA)... the no is with an exception. It's only no if it's asked for 1 record at a time. If it's anything greater than 1, it will successfully return the trail. That's pretty much the only way to successfully get a trailing byte back, as far as I can tell.

miniksa on 12 Nov 2020

Oh no. Oh no no. ReadConsoleA also has the problem where if you set it to raw read and ask it for more bytes than it has in storage... it just fills the buffer up with wchar_ts and early returns without translating them to the codepage you asked for.

miniksa on 13 Nov 2020

So here it is.

For conhostv1 (and thus as far back as we can tell for compatibility reasons):

The only A way to read input correctly is with ReadConsoleInputA and providing a buffer much larger than you ever expected to fill.
Otherwise, you must use the W versions of the ReadConsole* APIs to avoid pitfalls.

The pitfalls include:

Trailing bytes disappearing to space in one of two different ways.
Extra nulls being padded into the result.
Receiving UTF-16 data that is untranslated to the codepage you selected.

So my plan is to codify this in a series of tests against v1. Then add a switch in the tests for what the behavior SHOULD be for any app to actually be able to use these APIs (that is... take all those bugs/mistakes out.) And as a part of running the bugs out, I should also be lighting up UTF-8 on those APIs.

This is technically a compatibility break, but given that the APIs were pretty much useless for as far back as we can determine providing garbage data to any app unfortunate enough to attempt to use them... we're going to call that acceptable risk for now.

miniksa on 13 Nov 2020

To future self: It's not safe to go alone, take this:

%TAEF% .\bin\x64\debug\conhost.feature.tests.dll /select:"(@Name='*ReadCharByChar*' or @Name='*ReadChangeCodepageBetween*' or @Name='*ReadChangeCodepageIn*' or @Name='*ReadLeadTrail*')" /p:TestAsV1=true

miniksa on 13 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings