Powershell: OSC sequences count into wrapping/truncation limits

Created on 8 Sep 2018  ·  36Comments  ·  Source: PowerShell/PowerShell

ESC ] aka OSC (operation system control) sequences are used in terminals to control things like the terminal window, or insert hyperlinks.

From wikipedia:

Starts a control string for the operating system to use, terminated by ST.[14]:8.3.89 In xterm, they may also be terminated by BEL.[15] In xterm, the window title can be set by OSC 0;this is the window title BEL.

The sequence is followed by plain characters and terminated by ST/BEL. The plain characters are not shown to the user, they are interpreted by the terminal, therefor they should not be considered for line wrapping or truncation, but PowerShell does so.

This affects values in table format, group headers, etc.

Steps to reproduce

Run this a PowerShell terminal:

@{ test = "`e]8;;https://github.com/PowerShell/PowerShell#this_is_a_very_long_link_to_show_the_problem_with_osc_sequences`aThis is the link text which should not be truncated`e]8;;`a" }

The OSC sequence used here as an example creates a hyperlink (works in iTerm2 and Gnome terminal, and will only show the link text in terminals without support).

Expected behavior

Output (make sure the resize the terminal so that the text should fit on the screen):

Name                           Value
----                           -----
test                           This is the link text which should not be truncated

Actual behavior

Text gets truncated:

Name                           Value
----                           -----
test                           This is the link text wh...

The amount of text truncated depends on the terminal size of course, but the point is it should not get truncated at all if it fits on the screen.

If you use a string without an OSC sequence, it will not get wrapped:

@{ test = "This is the link text which should not be truncated" }

What's worst is that through truncation PowerShell can cut off the escape sequence terminator, making the sequence apply to all following terminal output (e.g. everything becomes the hyperlink).

This affects values in tables as shown, but also group headers:

github.com/sourcegraph/sourcegraph-classic > app/node_modules/jest-cli/node_modu
les/optimist/example/usage-options.js


Line Preview
---- -------
  18 console.log('\n\nInspecting options');


github.com/sourcegraph/sourcegraph-classic > app/node_modules/jsfmt/node_m
odules/docopt/examples/any_options_example.coffee


Line Preview
---- -------
   1 doc = """Example of program which uses [options] shortcut in pattern.
   4   any_options_example.coffee [options
   6 Options:

Note the wrongly wrapped the group headers (the links are also broken because of this).

Environment data

> $PSVersionTable
Name                           Value
----                           -----
PSVersion                      6.0.4
PSEdition                      Core
GitCommitId                    v6.0.4
OS                             Darwin 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64
Platform                       Unix
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
Issue-Bug Issue-Enhancement WG-Interactive-Console

Most helpful comment

FYI, for the unicode esc sequeunce, you don't have to use 4 digits like in C#. You can use just the required two:

"`u{9b}34mblue`u{9b}m."

All 36 comments

@iSazonov why is this Issue-Discussion and not Issue-Bug?

@felixfbecker I don't see the problem on Windows. On WSL a console is frozen. I haven't MacOs to test.
We need more info. I don't know if the issue related with escapes.

I would bet the issue also happens in Linux, did you try a VM? I am almost certain it's caused by the escapes, because the point where truncation occurs is in direct relation to how long the link text inside the escape sequence is.
Builds are run on macOS/Linux, right? If there are unit tests for the truncation behaviour, maybe it would be possible to write a failing test?

@rjmholt Have you time to look the Issue on Linux/MacOs? Does escapes really affect?

While I haven't dug deep, there is definitely _something_ going on with respect to escape sequences, and it affects _all_ platforms:

[pscustomobject] @{ test = "`e[32mgreen`e[m" } | ft -property @{ e='test'; width = 5 }

The above the string green in green, as expected, on Windows, macOS, Linux (Ubuntu).
That is, the length of the string is correctly recognized as 5, despite the presence of the escape sequences, which are seemingly ignored in _this_ case.

Things go awry if you force truncation, however, simply by lowering the width to 4:

[pscustomobject] @{ test = "`e[32mgreen`e[m" } | ft -property @{ e='test'; width = 4 }

This _should_ render "g...", and that's what it does _without_ escape sequences, but with them present what is rendered is the equivalent of "`e..." - i.e., the very first character in the string is blindly used, without regard to whether it is part of an escape sequence or not.

How that renders depends on the platform / terminal application, but it's definitely broken.

Because the partial escape sequence doesn't succeed in changing the color in this case, there are no aftereffects, but, as in @felixfbecker's example, if a complete _start_ escape sequence happens to be rendered, but not its complementary _end_ sequence, effects, such as color changes, _linger_ - and solving that problem generically sounds nontrivial.

For instance, the following will print gr... in green _and persistently switch to green output_, because the sequence that turns green _off_, e[m, is not sent to the terminal:

[pscustomobject] @{ test = "`e[32mgreen, green grass of home`e[m" } | ft -property @{ e='test'; width = 10 }

I don't think the following is a valid terminating seq:

 "`e[m"

Try using:

"`e[0m"

@rkeithhill: The 0 is optional; try the following (on any supported platform):

"before `e[32mgreen`e[m after"

Yeah, now I see the line in the docs that I missed:

When no parameters are specified, it is treated the same as a single 0 parameter.

Good to know.

Yeah, truncated text including opening escapes but possibly truncating the closing escapes is another problem, but I would say only loosely related to this issue.

Your experiment above seems to show that _length_ is _not_ a problem for _color_ escape codes, but the issue description shows that it _is_ a problem for OSC sequences. Perhaps PowerShell recognizes and filters ESC[ sequences (which are terminated with m), but not ESC]8 sequences (which are terminated with ST/BEL).

Good point about the distinction between color codes and OSC sequences, @felixfbecker, and I think your guess about PowerShell not recognizing the latter is correct:

[pscustomobject] @{ test = "`e]8;;http://example.org`aexample.org`e]8;;`a" } | ft @{ e='test'; width = 11 }

The above breaks, even though 11 is exactly the length of example.org, and it only renders correctly with a minimum width of 41, which is the raw character count of the string _including_ escape sequences.

I've opened a separate issue for the end-escape-sequence-not-getting-output problem: #7767

cc @SteveL-MSFT @TravisEz13

Given the support of VT100 in Windows Console, I expect more use of escape sequences. Looks like the formatting system needs to be more aware of this.

I think this would be a new feature not a fix to an existing feature. I think this would also probably require an RFC. I'm not sure embedding the raw control sequences is the way we would want to support this.

BTW, this kind of works for me.

@{ test = "`e]8;;https://github.com/PowerShell/PowerShell#this_is_a_very_long_link_to_show_the_problem_with_osc_sequences`aThis is the link text which should not be truncated`e]8;;`a" } | ft -AutoSize -Wrap

@TravisEz13 it may make sense to have some syntactic sugar making it easier for novice users to leverage vt100 escape sequences, but we can't prevent users from embedding them and I see it more and more already now that Win10 supports it.

The problem with this argument is this issue is expecting both us and the console to format this string (at least for this issue.) Our formatting code would have to effectively render this to be able to format it into a table, then send a modified string to the console, all while expecting us to honor the intent of the escape sequences. I think this particular item, would be better implemented by sending some equivalent markdown to Show-Markdown (assuming it returns an object that the format commands can understand) and then rendering than with format-table.

Perhaps, with more style settings to override the control code we use to render links.

we can't prevent users from embedding them and I see it more and more already now that Win10 supports it.

I think too that we should think about enhancement of PowerShell console host to support advanced rendering features like coloring. We need this for help system, markdown rendering, syntax coloring in command line (PSReadline) and so on

Markdown standard say nothing about colors. We would have to do our custom enhacement. Or maybe better to use a HTML render. HTML looks more promising for text console and GUI.

I dont see how Show-Markdown helps with this. It returns a string with VT sequences that is subject to the same wrapping and truncation problems.

It’s true that PowerShell needs to become smarter about escape sequences. PowerShell is _not_ just a byte stream shell like other shells, it is an object shell, and therefor also takes the responsibility of _formatting_ objects into byte streams. Part of that is wrapping and truncating strings. For that, you need to know the _user-visible_ length of a string as the terminal would render it, which means you need to be aware of VT seuqences. It’s not difficult to detect them, because it’s not a context-free language like HTML with nested balanced tags that you would have to parse, it’s a simple state machine. You can probably detect them with regular expressions.

Now, the second related problem is, how does PowerShell prevent that a truncated string didn’t open VT sequences that were closed in the truncated part and now leak into all following output. One way is that it could detect sequences and close them.

The other big alternative would be for PowerShell to not allow VT sequences in strings to be output raw, but sanitize/render them, e.g. replacing the ESC byte with the symbol “␛”. Since it’s important that we have the capability to output escapes, PowerShell would then have to add other capabilities to Format files to output color. For example, there could be an attribute sanitize=“false” on ScriptBlock to allow the output of escapes, and element that allows to _always_ append a string even if the string gets truncated, which can be used for the user to specify closing sequences. This option would however break native commands that output VT sequences, because they output strings.

PowerShell could also introduce a new type, VTEncodedString, that can be returned in those scriptblocks and does not get sanitized, but the truncation/closed tag issue would remain.

It’s also possible that PowerShell adds tags to Format files to color parts, like . But that would not work for the contents returned by a scriptblock (imagine coloring something complex like a git diff).

Considering all options, I think PowerShell should bite the bullet, become aware of VT escapes and properly count/close them.

Also, this issue suggests that PowerShell _already_ has knowledge about escape sequences: https://github.com/PowerShell/PowerShell/issues/7570

Since PowerShell now has support for ignoring VT escape sequences (such as those for setting colors) to be ignored while calculating the widths of columns, this works (needs PowerShell 6 for `e):

It's just that it seems to only handle color sequences, but not OSC sequences.

Indeed; just to recap the simpler demonstration from above:

[pscustomobject] @{ test = "`e[32mgreen`e[m" } | ft -property @{ e='test'; width = 5 }

The fact that this renders correctly demonstrates that the escape sequences were ignored for the width (length) calculation.

@felixfbecker I agree with most of the points you make in https://github.com/PowerShell/PowerShell/issues/7744#issuecomment-423452538, but this calls out my point that this is a new feature that requires a design (which these comments start) and by our process an RFC to take comments on that design.

@mklement0 I took a look at the code and don't see anywhere where it's aware of vt100 escape sequences in format and output. I think your example happens to work by chance. If you change the width to 4, ideally it should have rendered g..., but instead renders nothing.

Interesting, @SteveL-MSFT - can you point us to the source code?

What it outputs with width 4 is "`e...", i.e., it blindly takes the first 4 chars., as previously stated - different terminal programs just happen to render that differently - on Windows nothing prints visibly, but that indeed suggests a lack of escape-sequence awareness at least with respect to _truncation_, as previously stated and reported in #7767.

@joeyaiello, can you please remove the OS-macOS label? While the symptoms differ, the issue at hand is definitely a problem on all supported platforms.

I really hope we can solve the problem for both width calculation and truncation (without having to resort to an encoded output form, as @felixfbecker mentions as an alternative) , although the latter sounds challenging.

To summarize, the two related challenges are:

  • Only _printable_ (user-visible) characters in a VT-encoded string must be counted, for proper column-width / wrapping / truncation calculations

  • If truncation must be applied (based on the correct, printable character count) and that truncation - whose position must be calculated based on the printable characters - falls _inside_ a _pair_ of escape sequences, the closing half of the pair must be emitted too (unlike what I pondered in #7767, emitting some sort of generic _reset_ sequence, if available, is suboptimal)

    • The challenge is not only to recognize all those pairs, but that they can be _nested_ too; consider the following example:
[pscustomobject] @{ test = "`e]8;;http://example.org`aexample.`e[32morg`e[m`e]8;;`a" }

In terminals that support it, that renders a link, with part of the link _label_ - org - rendered in a different color. If truncation were to fall inside that colored part, you'd need _two_ closing sequences.

@mklement0 I haven't spent much time thinking how to fix this, but the relevant code should be in
ComplexWriter.cs. It seems that escape sequences should never be truncated, but we can avoid the complexity of calculating a closing sequence by always resetting the console between columns.

Thanks for the source-code link, @SteveL-MSFT.

but we can avoid the complexity of calculating a closing sequence by always resetting the console between columns.

That is certainly preferable to accidentally leaving styles linger, and perhaps that's the best we can do, but note that properly paired closing sequences at least sometimes make for a better user experience: for instance, in the hyperlink example in the initial post, it would be nice if you still ended up with a working link even if the link _label_ gets truncated; using a generic reset sequence won't give you that - if there even _is_ such a generic sequence (at this point we only what it is for _colors_ and styles such as underlining, namely `e]m)

The strategy used by DbgShell's custom, VT100 color-aware formatting engine (which predates conhost support) is to simply preserve all escape sequences when truncating (code here).

The simple solution would be to strip all VT100 escape codes with a regex before calculating string lengths. However, it won't work right when a cell spans multiple rows, but we can defer that as an edge case.

The simple solution would be to strip all VT100 escape codes with a regex before calculating string lengths.

It depends what you are doing. If you are going to leave escape sequences out of the content when displaying it as well, that's fine. But if you want to leave escape sequences in, and you want to truncate, it won't work.

Suppose you have a string with Length 31, and you strip escape sequences, and now its Length is 25. You need to fit it into a cell that is 20 chars wide... where do you cut the string? If you leave the sequences in, and just blindly omit the chars from str[20] on, you might end up with less than 20 chars of content, and possibly garbage at the end (if you chopped in the middle of an escape sequence).

So in general, my strategy is not only to leave escape sequences out when calculating content width, but also to be escape-sequence-aware when truncating strings. (So if I need to fit a string into a cell 20 chars wide, I might actually end up with a string of Length == 24, because the display width will be 20.)

I do like the idea of leaving the escape sequences while being _content_-length-aware, @jazzdelightsme, because it gives us the best of both worlds:

  • proper truncation based on the count of visible characters
  • without sacrificing the intended visual effect of the escape codes, such as rendering colors, or even hyperlinks, as in this case.

The caveats are:

  • There are many more types of escape sequences beyond SGR (colors, styles, ...) and OSC (e.g., hyperlinks) - see https://en.wikipedia.org/wiki/ANSI_escape_code#Escape_sequences:

    • They must _all_ be recognized in order to know what's visible content.

    • Ideally, some of them should still be _stripped_, because they are pointless as part of formatted output; for instance, retaining the escape sequence in @{ hi="`ec" } | Format-List is pointless, because `ec, the so-called RIS sequence, resets the entire terminal and clears the screen. The same goes for CSI sequences that control the cursor position, for instance.

      • That said, not handling these cases could be a documented limitation.

As an aside, @jazzdelightsme: Your code seems to handle SGR sequences only and uses a _single-character_ CSI representation (U+009b), whereas VT processing always requires a _two-character_ sequence, which works fine with VT (Windows), as @oising points out, but not on _all_ Unix-like platforms, notably not on macOS; therefore, the _two-character_ sequence `e followed by [ is needed for cross-platform support.

As an aside, @jazzdelightsme: Your code seems to be handle SGR sequences only and uses a _single-character_ CSI representation (U+009b), whereas VT processing always requires a _two-character_ sequence, `e followed by [.

As an aside to this, this isn't true. An 8-bit CSI "`u{009b}" or 7-bit CSI sequence "`e[" should work in VT, and in fact does work in VT. CSI is CSI - 8bit or 7bit data is the differentiator.

PS> "`u{009b}6n"
[40;1R

Thanks, @oising - indeed it works fine in VT, and at least on some Linux distros (I've tried Ubuntu 18.04), but _not on macOS_ (neither in Terminal.app nor in iTerm2.app), so for cross-platform support the two-char. sequence is a must - I've updated my previous comment accordingly.

Here's a simple test command:

# Prints 'This make me blue.' with 'blue' in blue on Windows, Ubuntu, but not macOS, 
# where it prints '›34mblue›m'.
 "This makes me `u{9b}34mblue`u{9b}m."

# Works OK on all platforms: "`e[" substituted for "`u{9b}"
 "This makes me `e[34mblue`e[m."

Bash equivalent of the above:

# Doesn't work on macOS (even with Bash v4+)
# Note: Only works with *Unicode* escape sequence $'\u009b', not with $'\x9b'
$ echo $'I am \u009b33mblue\u009bm.'

# OK on all platforms.
$ echo $'I am \e[33mblue\e[m.'

Sounds like someone should log a bug with Apple.

FYI, for the unicode esc sequeunce, you don't have to use 4 digits like in C#. You can use just the required two:

"`u{9b}34mblue`u{9b}m."

Good point, @rkeithhill (example commands updated).
One more for the pile of asides, @oising:

Consistent cross-platform functionality would be nice (not holding my breath re Apple doing something about it), but I'm not sure it qualifies as a _bug_:

From https://en.wikipedia.org/wiki/ANSI_escape_code#Escape_sequences (footnote [18] refers to the ECMA-48 standard):

The standard says that in 8-bit environments these two-byte sequences can be merged into single C1 control code in the 0x80–0x9F range.[18]:

It's fair to assume that _8-bit environment_ refers to a _fixed-length single-byte encoding_ where these codes can be represented _as single bytes_.

By contrast, in the multi-byte-for-anything-beyond-the-7-bit-codepoint-range UTF-8 encoding, 0x9b is _invalid_ as a single byte; the _Unicode character_ that represents CSI, U+009b, must be represented as _two-byte_ sequence 0xc2 0x9b.

Pragmatically speaking,

  • this obviously negates the benefit of using a single character, given that Unicode character sequence `e[ is also just 2 bytes in UTF-8 (0x1b 0x5b)
  • any processor of such escape sequences then needs to support _both_ representations (recognize both 0xc2 0x9b and 0x1b 0x5b as CSI).

The flip side of the argument is that a dedicated CSI character clearly _is_ a part of Unicode, and should therefore be recognized as such.

The most recent edition of ECMA-48 is from June 1991, shortly _before_ the first volume of the Unicode standard was published, so, unsurprisingly, Unicode aspects are not covered.

Sure, I agree that it's amusing that the original benefits of using the single byte eight bit CSI have been nullified with the advent of UTF-8, but I digress; we're no longer working off teletypes, acoustic couplers or hayes smartmodems :) I think the point is that the high-bit flipped escape should work. The "benefits" of either scheme are moot these days, right?

I agree, @oising, but let me frame it a bit differently (that pile is getting awfully tall):

If we were to start today, in the age of Unicode, we'd _only_ support the _dedicated control characters_ (in the C1 range) that are part of the Unicode standard, irrespective of these characters' _encoding_ and how many bytes the encodings happen to require.

In other words, using the example at hand: Unicode-aware applications (and such a qualifier should by now be the equivalent of asking for unleaded gas at a filling station) would have to recognize _only_ U+009b as CSI (irrespective of its encoding) and wouldn't have to worry about _also_ recognizing the 7-bit legacy representation that involves _2 characters_ (ESC + [).

From that perspective alone my vote is for the macOS terminal applications to (also) recognize a UTF-8 encoded U+009b char. as CSI.


Re the _strict semantics_ of Unicode characters in the C1 control-character range (U+0080 - U+009f); from https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Unicode (emphasis added):

Unicode sets aside 65 code points for compatibility with ISO/IEC 2022. The Unicode control characters cover U+0000—U+001F (C0 controls), U+007F (delete), and U+0080—U+009F (C1 controls). Unicode only specifies semantics for U+001C—U+001F, U+0009—U+000D, and U+0085. The rest of the control characters are transparent to Unicode and their meanings are left to higher-level protocols.

In other words: at the level of Unicode itself, U+009b - its specific name notwithstanding - has no _intrinsic_ meaning.

Also (ibid., emphasis added):

Except for NEL [U+0085] these are almost never used (CSI is often used, but almost always by using the ESC,'[' 7-bit replacement). The C1 characters require 2 bytes to be encoded in UTF-8 (for instance CSI at U+009B is encoded as the bytes 0xC2, 0x9B in UTF-8). Thus the corresponding control functions are more commonly accessed using the equivalent two byte escape sequence intended for use with systems that have only 7-bit bytes.


Speaking pragmatically, again:

  • The 7-bit 2-character representation won't go away anytime soon.

  • `e[ is easier to type than `u{9b} (at least on a US keyboard).

This issue also shows when using ConvertTo-Markdown together with list views:

image

Notice how text wraps in unexpected places, is not indented correctly and text decorations persist in the whitespace instead of being cleared and reapplied.

Was this page helpful?
0 / 5 - 0 ratings