Typescript: Paragraph separator character causes TS Server to get into bad state

Created on 21 Apr 2020  Β·  10Comments  Β·  Source: microsoft/TypeScript


From https://github.com/microsoft/vscode/issues/95685

TypeScript Version: 3.9.0-dev.20200420


Search Terms:

  • invisible character
  • paragraph
  • diagnostics / errors

Code
For the JS

var ftext = [
"Und", "dann", "eines"];
var a = 1;

Note there is an invisible paragraph separator character after the opening[

  1. In the file, delete the space after the = on the second line.

Bug:
TS starts reporting bogus errors:

Screen Shot 2020-04-20 at 3 31 24 PM

It looks the paragraph character messes up the parsing and causes the error line numbers to be off by one

[Trace  - 22:31:41.88] <semantic> Event received: semanticDiag (0).
Data: {
    "file": "/Users/matb/projects/san/test.js",
    "diagnostics": [
        {
            "start": {
                "line": 2,
                "offset": 8
            },
            "end": {
                "line": 2,
                "offset": 12
            },
            "text": "Cannot find name 'dann'.",
            "code": 2304,
            "category": "error"
        },
        {
            "start": {
                "line": 2,
                "offset": 16
            },
            "end": {
                "line": 2,
                "offset": 21
            },
            "text": "Cannot find name 'eines'.",
            "code": 2304,
            "category": "error"
        }
    ]
}
External

Most helpful comment

Currently, VS Code considers the following characters/sequences to be line terminators: CRLF, CR or LF.

I feel that this is a difficult topic to get wide agreement on, since we need to work with multiple programming languages, each one with its own definition of what a line terminator is. So, if a compiler creates a diagnostic at line 100, it appears that line number is defined according to the programming language's own specification.

From my reading of https://www.unicode.org/reports/tr14/tr14-32.html , Unicode defines a mandatory line break after the following:

  • LB4. Always break after hard line breaks (BK). In the BK class there are the following:

    • 000C - FORM FEED (FF)

    • 000B - LINE TABULATION (VT) - [OPTIONAL]

    • 2028 - LINE SEPARATOR (LS)

    • 2029 - PARAGRAPH SEPARATOR (PS)

  • LB5. Treat CR followed by LF, as well as CR, LF, and NL as hard line breaks:

    • 000Dx000A- CARRIAGE RETURN (CR) x LINE FEED (LF)

    • 000D - CARRIAGE RETURN (CR)

    • 000A - LINE FEED (LF)

    • 0085 - NEXT LINE (NEL)

Here I'm looking at the most popular languages used in VS Code for which I could find a specification or where the decision is not pushed out to specific implementations (looking at you, C++).

It looks like the JS spec Table 33 defines the following as line terminators. I suppose TS uses the same:
image

Interestingly, the C# spec A.1.1 Line terminators defines the same line terminators:
image

HTML however defines newlines as just CR, LF and CRLF:
image

Python also defines just CR, LF and CRLF:
image

PHP also defines just CR, LF and CRLF:
image

Java defines LF, CR and CRLF:
image

YAML section 3.1.4 Line Breaks uses the following:
image

To sum up:

| char/seq | Unicode | JS | C# | HTML | Python | PHP | Java | YAML |
|----------|---------|----|----|------|--------|-----|------|------|
| CRLF | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… |
| CR | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… |
| LF | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… |
| LS | βœ… | βœ… | βœ… | | | | | βœ… |
| PS | βœ… | βœ… | βœ… | | | | | βœ… |
| NEL | βœ… | | | | | | | βœ… |
| FF | βœ… | | | | | | | |
| VT | βœ… | | | | | | | |

I am not sure what would be best, since it would be very difficult to make VS Code end-of-line sequence dependent on the file's language and changing things now might break language extensions which follow our definition.

@dbaeumer Is the end-of-line sequence something that is specified in the Language Server Protocol?

All 10 comments

In JavaScript, Paragraph Separator is a line terminator; however, in the edits described above, VS Code says that edits are occur on line 2.

From TypeScript's perspective, line 2 occurs immediately after the Paragraph Separator. Ultimately, editors need to account for each language's newline settings if they intent to use line/offset encodings for commands, so I don't think we can fix this on the TypeScript side.

See also:

Here's the edit we send to TSServer when deleting the space after the = in a = 1:

[Trace  - 23:20:02.22] <syntax> Sending request: updateOpen (12). Response expected: yes. Current queue length: 0
Arguments: {
    "changedFiles": [
        {
            "fileName": "/Users/matb/projects/san/test.ts",
            "textChanges": [
                {
                    "newText": "",
                    "start": {
                        "line": 2,
                        "offset": 8
                    },
                    "end": {
                        "line": 2,
                        "offset": 9
                    }
                }
            ]
        }
    ],
    "closedFiles": [],
    "openFiles": []

The indexes one based so they look correct to me.

@RyanCavanaugh @DanielRosenwasser Does this also match what you see?

Yeah, that's what I'm seeing as well, but if I understand correctly the edit needs to happen at line 3 because the paragraph separator is a line terminator.

VS Code does not consider the character a line break. Looks like VS does though

@alexdima should be able to say if this is intentional or not. If you have this character in your code and it's not inside a string though, it's probably a bug

This issue has been marked as 'External' and has seen no recent activity. It has been automatically closed for house-keeping purposes.

Currently, VS Code considers the following characters/sequences to be line terminators: CRLF, CR or LF.

I feel that this is a difficult topic to get wide agreement on, since we need to work with multiple programming languages, each one with its own definition of what a line terminator is. So, if a compiler creates a diagnostic at line 100, it appears that line number is defined according to the programming language's own specification.

From my reading of https://www.unicode.org/reports/tr14/tr14-32.html , Unicode defines a mandatory line break after the following:

  • LB4. Always break after hard line breaks (BK). In the BK class there are the following:

    • 000C - FORM FEED (FF)

    • 000B - LINE TABULATION (VT) - [OPTIONAL]

    • 2028 - LINE SEPARATOR (LS)

    • 2029 - PARAGRAPH SEPARATOR (PS)

  • LB5. Treat CR followed by LF, as well as CR, LF, and NL as hard line breaks:

    • 000Dx000A- CARRIAGE RETURN (CR) x LINE FEED (LF)

    • 000D - CARRIAGE RETURN (CR)

    • 000A - LINE FEED (LF)

    • 0085 - NEXT LINE (NEL)

Here I'm looking at the most popular languages used in VS Code for which I could find a specification or where the decision is not pushed out to specific implementations (looking at you, C++).

It looks like the JS spec Table 33 defines the following as line terminators. I suppose TS uses the same:
image

Interestingly, the C# spec A.1.1 Line terminators defines the same line terminators:
image

HTML however defines newlines as just CR, LF and CRLF:
image

Python also defines just CR, LF and CRLF:
image

PHP also defines just CR, LF and CRLF:
image

Java defines LF, CR and CRLF:
image

YAML section 3.1.4 Line Breaks uses the following:
image

To sum up:

| char/seq | Unicode | JS | C# | HTML | Python | PHP | Java | YAML |
|----------|---------|----|----|------|--------|-----|------|------|
| CRLF | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… |
| CR | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… |
| LF | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… |
| LS | βœ… | βœ… | βœ… | | | | | βœ… |
| PS | βœ… | βœ… | βœ… | | | | | βœ… |
| NEL | βœ… | | | | | | | βœ… |
| FF | βœ… | | | | | | | |
| VT | βœ… | | | | | | | |

I am not sure what would be best, since it would be very difficult to make VS Code end-of-line sequence dependent on the file's language and changing things now might break language extensions which follow our definition.

@dbaeumer Is the end-of-line sequence something that is specified in the Language Server Protocol?

Actually, a straight forward way to tackle this would be for us to prompt users and ask for permission to "fix" their files. Most likely, LS, PS or NEL are inserted by accident in files by copy-pasting and they are unwanted there anyways. I've opened https://github.com/microsoft/vscode/issues/96142 to track this.

The LSP support \r, \n and \r\n

Was this page helpful?
0 / 5 - 0 ratings