Terminal: How to write as UTF-8 to console?

Created on 27 Mar 2019 · 19Comments · Source: microsoft/terminal

Various blogs have mentioned that the internal buffer for the console now can represent full UTF-8. What is not clear to me is how, as a normal command line app, I can write/print/output UTF-8 to the console. Do I still use WriteConsole? But how do I signal that what I'm passing is UTF-8 and not UCS-2?

Issue-Question Product-Conhost

Source

davidanthoff

Most helpful comment

It's fine if you feel that way. I just want you to understand that the dev team who works on this thing and has minimal control over product decisions and business process including licensing, that your statements about "I've already paid for it" rub us severely the wrong way.

We're trying to build a community here of developers helping developers directly because we believe it's the most expedient way to help each other out and that helping someone out should be done whether or not it is in someone's specific job description and regardless of how money is being exchanged. To us, a community of where everyone can help out a little bit is a community where everyone's lives get a bit better.

We ask for your help, like we ask for anyone in the community's help, not because we're asking you to work for free and you've already paid for it. No one likes working for free. It's because we don't see the world the way an external person sees the world. We believe that your description of how this works for you, the problems you encountered while trying to use our software, or the bug ticket in your words on the appropriate tracker conveys a more accurate picture of the world than we are capable of as folks on the inside.

Given our limited time as folks on the inside with a ton of folks shouting at us from every angle, we prefer to work with people who appear at least mildly sympathetic to where our small dev team falls as cogs within a giant machine and are willing to help us help everyone in the mildest way. Your comments don't strike me that way, and as such...

I of course also don't mind if you ask me to do something, I just won't do it.

miniksa on 2 Apr 2019

👍4

All 19 comments

Use SetConsoleOutputCP and/or SetConsoleCP to set CP_UTF8 which is 65001.

If you use the stream based APIs of the console like WriteConsole, WriteFile, ReadConsole, and ReadFile, it should work fine. If you start using more of the complicated APIs that use structured data like ReadConsoleOutput, it is probably going to have issues that you won't like for assorted reasons I don't have time to get into right now.

miniksa on 27 Mar 2019

But the buffers I pass to WriteConsole still have to be UCS-2 encoded in that case? Or can I pass a buffer that is actually UTF-8 encoded? I just tried the latter, and that doesn't seem to work.

davidanthoff on 27 Mar 2019

WriteConsoleA with SetConsoleOutputCP set to CP_UTF8 should accept a UTF-8 encoded stream if your revision of Windows is high enough to contain the support. If you have 1809 or 1903, it should be fine. I don't know about 1803.

miniksa on 27 Mar 2019

Ah, that works, thanks! I had been trying WriteConsoleW!

davidanthoff on 27 Mar 2019

It would be nice if this was documented somewhere :)

davidanthoff on 27 Mar 2019

You're welcome to drop an issue or send a PR to the docs site: https://github.com/MicrosoftDocs/Console-Docs

miniksa on 27 Mar 2019

I won't. I'm happy to contribute to open source projects, but last time I checked I paid for my Windows license :)

davidanthoff on 28 Mar 2019

👎1 👍1

That's fine. You do you.

miniksa on 28 Mar 2019

Ok, one more question :) Are there any features/things that only work if one uses the WriteConsoleA with UTF-8 strings, or does it essentially not matter which API one uses?

One thing is that one probably can represent some extra unicode chars that can't be represented as UCS-2, right? Should one expect a performance difference? Any other difference worth keeping in mind?

davidanthoff on 1 Apr 2019

Sorry, I'm not willing to keep answering your questions because you were flippant about your Windows licensing costs.

miniksa on 1 Apr 2019

👍3 😄1

I’m sorry, I did not mean to be flippant at all! But I’m not willing to provide free labor for a product that Microsoft then charges a license fee for. Heck, I don’t even mind that arrangement at all, I have no beef whatsoever with you guys charging for Windows, I’m a happy, willingly paying customer. But don’t ask me to work for free for your commercial product.

davidanthoff on 2 Apr 2019

And just to be even more clear: I of course also don't mind if you ask me to do something, I just won't do it. And sorry again for the less than ideal wording in my response above.

davidanthoff on 2 Apr 2019

I of course also don't mind if you ask me to do something, I just won't do it.

miniksa on 2 Apr 2019

👍4

I'm highly sympathetic to your situation, and I think what you and your team are doing is awesome. I did not mean my comments to be confrontational at all, clearly that misfired and I apologize for them. I do have a very limited time budget for this kind of stuff, and I generally devote that to open source projects. Please don't interpret that in any form as criticism of 1) what you do, 2) what your team does, 3) or even the general arrangement of how Microsoft sells Windows. As I wrote above, I'm perfectly happy to buy Windows and getting an awesome product in return, no qualms with that in any form. I am a happy customer, and I think the kind of outreach and user engagement you are doing here at the console team is fantastic.

I'm interested in figuring out the answer to my question for the libuv project. That powers the console experience for nodejs and julia, and probably many others. Currently it takes UTF-8 encoded strings from "clients" (like julia and nodejs), converts them to UCS-2 and then calls the WriteConsoleW API. I suggested over there that one could just pass the UTF-8 strings directly (like you suggested above), and one question that came up was what one would gain from that, given that the existing implementation works, and that they want to continue to support older Windows versions and are worried about code complexity. So I'm simply trying to understand whether there are things that only work when one uses the UTF-8 API, or whether some things work better that way.

davidanthoff on 2 Apr 2019

There is no additional functionality you gain by passing UTF-8 over the A API versus passing UTF-16 over the W API.

We will convert one or the other into whatever format is required for us to maintain the internal cellular storage. If the cellular storage improves release over release, then it will improve for transmissions on both API surfaces.

miniksa on 2 Apr 2019

Hi miniksa, I just like to confirm something in your comment just now, since I've had to field this question on several occasions: In the past, the Console (via W) only supported the UCS-2 subset of UTF-16, presumably for backwards compatibility. Are you saying there's a plan to change W to support full UTF-16 (or already have), or just that the UTF-8 (codepage 65001) is intended to give the same behavior? If this is changing, I'd just note that wikipedia currently says this explicitly, so it would be great if the appropriate team at Microsoft could document it officially, and then fix the citation link at https://en.wikipedia.org/wiki/Win32_console#Windows_NT_and_Windows_CE

vtjnash on 2 Apr 2019

When it is officially supported and completed, we will document it. Until then, it's been a multi-release journey that is still incomplete and buggy. It is in progress. You should still stay below U+FFFF inside the UCS-2 boundary on both the A and W APIs until officially announced and documented.

We will be unable to fix the Wikipedia article as I believe their code of conduct prohibits the people who work on the thing or who are the thing to write the article about themselves. We would update docs.microsoft.com and likely our blog when the console is capable of crossing the UCS-2 boundary on our APIs, A or W.

miniksa on 2 Apr 2019

👍1

OK, I understand, that's great to hear.

vtjnash on 2 Apr 2019

If you use the stream based APIs of the console like WriteConsole, WriteFile, ReadConsole, and ReadFile, it should work fine
[...]
WriteConsoleA with SetConsoleOutputCP set to CP_UTF8 should accept a UTF-8 encoded stream if your revision of Windows is high enough to contain the support. If you have 1809 or 1903, it should be fine. I don't know about 1803.

The console in 1803 does not support reading the input buffer as UTF-8. Specifically, non-ASCII characters are converted to null characters because it can't handle the multibyte encoding (e.g. "abcĀdef" is read as "abc\x00def") . Prior to Windows 10 it didn't even keep ASCII characters. The call would succeed with 0 bytes read, which looks like EOF.

Writing to the screen buffer works well in Windows 8+. However, with older versions, WriteConsoleA and WriteFile (to the console) mistakenly return the number of UCS-2 codes written instead of the number of bytes, which confuses buffered writers, including C FILE streams. This results in a sequence of writes that appear as random characters after a write that contains non-ASCII characters.

All in all, if you're supporting Windows 7 still, then using UTF-8 in the standard Windows console is probably not an option. It will be good enough if you're just writing to the console directly via WriteConsoleA or WriteFile instead of using a buffered stream and don't need to read non-ASCII characters.