Toml: Backslashes at end of line

Created on 1 Nov 2016 · 15Comments · Source: toml-lang/toml

TOML faithfully replicates the decades-old multiline string trap:
(Edit: Nope, seems to be a bug in the toml npm package, and it shouldn't have been on the "v0.4.0 compliant" list.)

nodejs -p 'require("toml").parse(require("read-all-stdin-sync")());' <<<'
  msg1 = """hello \
    world"""
  msg2 = """hello \ 
    world"""
  '
{ msg1: 'hello world', msg2: 'hello \\ \n  world' }

That really doesn't fit the "O" in TOML.
Could we drop this tradition and boldly adopt a fault-tolerant rule that enables human readers to _visually_ determine what's in a string?

Source

mk-pmb

All 15 comments

Use ''' instead of """? See the section on strings for more detail.

Could we drop this tradition and boldly adopt a fault-tolerant rule that enables human readers to visually determine what's in a string?

I don't understand your objection. Could you say more? Could you also address the downsides of your suggestion? What are the trade offs?

BurntSushi on 1 Nov 2016

👍1

With triple apostrophe, I get even more spaces, backslashes and newlines into the text of my paragraphs. Here's a short one for example, but others have really enough words to warrant multi-line:

noClutter = """Iis igitur est difficilius \
  satis facere, qui se Latina scripta \
  dicunt contemnere."""

hasClutter = """Iis igitur est difficilius \    
  satis facere, qui se Latina scripta \ 
  dicunt contemnere."""

evenMoreClutter1 = '''Iis igitur est difficilius \
  satis facere, qui se Latina scripta \
  dicunt contemnere.'''

evenMoreClutter2 = '''Iis igitur est difficilius \  
  satis facere, qui se Latina scripta \ 
  dicunt contemnere.'''

If your browser displays this similar to mine, it's not "obvious" (the O in TOML) where in hasClutter the clutter occurs, or without the name, that there was clutter at all.

One way to solve this would be to merge lines if the backslash is the last printing character in a line, trimming not just the whitespace at start of next line but also in the line with the backslash.

The change will only affect files that had whitespace at the end of lines, which I consider a bad style anyway. In cases where that whitespace was accidential, my first-idea method will silently fix the accident. In cases where people rely on whitespace at EOL, I can only recommend to use a less fragile style, e.g. add a custom padding character and strip that at runtime. It may take some more CPU cycles but the files won't break as easily when someone('s editor) "cleans" them up.

mk-pmb on 1 Nov 2016

Looks like you're asking for multi-line literal strings:

Multi-line literal strings are surrounded by three single quotes on each side and allow newlines. Like literal strings, there is no escaping whatsoever. A newline immediately following the opening delimiter will be trimmed. All other content between the delimiters is interpreted as-is without modification.

With single quotes you don't need the backslashes.

lines  = '''This is a line
This is another line
This is yet another line. No backslashes'''

TheElectronWill on 1 Nov 2016

One way to solve this would be to merge lines if the backslash is the last printing character in a line, trimming not just the whitespace at start of next line but also in the line with the backslash.

If I'm understanding correctly, then that's precisely what the current specification says should happen:

For writing long strings without introducing extraneous whitespace, end a line with a \\. The \\ will be trimmed along with all whitespace (including newlines) up to the next non-whitespace character or closing delimiter.

BurntSushi on 1 Nov 2016

With single quotes you don't need the backslashes.

So you mean this example:

noBackslashes = '''Iis igitur est difficilius
  satis facere, qui se Latina scripta
  dicunt contemnere.'''

… should produce the same text as noClutter = …?
That is, the words with punctuation and single spaces, but no multi-space sequence, no backslashes, no newline characters?

If so, I shall instead report a bug for the toml package on npm because it produces two occurrences of "\n " (newline, space, space) in noBackslashes.

that's precisely what the current specification says should happen:

In this case, the wording "end a line with" is too ambiguous here. It's easy for parser authors to think that it is enough to check the character immediately before the newline. This interpretation seems to be used in the toml npm package, making it produce a sequence of " \\\t\n " (space, backslash, tab, newline, space, space) in the hasClutter example, and a second similar sequence with space instead of the tab character.

If you clarify that "end a line with" means the visible portion of a line, I'll file a toml package bug for that, too.

mk-pmb on 2 Nov 2016

If you'd like to open a PR with better wording, then that sounds fine to me.

Single quotes are literal strings. No escaping. From the spec:

Multi-line literal strings are surrounded by three single quotes on each side and allow newlines. Like literal strings, there is no escaping whatsoever. A newline immediately following the opening delimiter will be trimmed. All other content between the delimiters is interpreted as-is without modification.

BurntSushi on 2 Nov 2016

All other content between the delimiters is interpreted as-is without modification.

Yeah, that was my understanding, too. Then probably @TheElectronWill got something mixed up.

If you'd like to open a PR with better wording, then that sounds fine to me.

Will gladly do so. So do I understand correctly that lines shall be merged when the last _non-whitespace_ character before the newline is a backslash? Can we expect parsers to understand Unicode enough to use Unicode's WSpace character property, or should we limit this rule to low-ASCII whitespace (effectively tab, carriage return, space)?

mk-pmb on 2 Nov 2016

That description sounds right. TOML does define that "whitespace" means ASCII.

BurntSushi on 2 Nov 2016

Oh thanks, missed that part. PR pending.

mk-pmb on 2 Nov 2016

@mk-pmb You're right, I misunderstood the topic ^^

TheElectronWill on 2 Nov 2016

Draft preview, what do you think?

mk-pmb on 2 Nov 2016

It seems like about 90% of this issue was people misunderstanding each other. =/

The only question is: In multi-line basic strings, when using the extraneous-whitespace-killer \ to end a line, should whitespace be allowed between that backslash and the actual newline character?

@mk-pmb You think it should be allowed, as it increases obviousness, since it appears to be visually identical to an immediately terminated line. Let's ponder it. As the spec is right now, a correct parser should produce an error since backslash-space or backslash-tab are not allowable escape sequences. I can't imagine any reasonable explanation for adding whitespace there on purpose, so we must consider it a mistake. This then becomes a question of allowing sloppy input.

Postel's law says "Be conservative in what you send; be liberal in what you accept." I tend to believe that, having authored a great many invalid HTML documents in my youth. =) Since we can be sure that this aberrant whitespace you disdain will only ever be a mistake, and we can discern with certainty what the intended meaning was, I agree that it meets the criteria of obviousness that TOML aims for.

However, I think your proposed wording is a bit overmuch, so I'll propose something more minimal in a moment.

mojombo on 3 Jan 2017

👍1

@mojombo I would recommend being careful with Postel's "law", with years of hindsight we now know that being as strict as possible in the spirit of "fail early" will help lead to a more robust and stable ecosystem (see also The Harmful Consequences of Postel's Maxim, which I tend to agree with).