Nim: [RFC] String null terminator should be inaccessible.

Created on 22 Mar 2017  路  14Comments  路  Source: nim-lang/Nim

The purpose of this RFC is to change current Nim behavior allowing to refer to null terminator of a string. E.g. the following code is currently valid:

var s = ""
assert(s[0] == '\0') # We can refer to element at index 0 even though it exceeds string len
s[0] = 'a' # We can even change the null terminator which may potentially have bad consequences

Such logic is justified by the fact that different parsing/scanning algorithms are handier to implement relying on the null-terminator, even though its still possible to implement the same logic through index vs length checks.

The proposal is to raise index errors in cases when null terminator is referred just like it is done for indexes that exceed string len, or like it is done for seq types.

The change itself is trivial, however there are quite a few places where current behavior is relied upon. So all of these places will have to be adjusted before the change.

Pros: consistent and more intuitive behavior.
Cons: breaking change.

Thumbs up?

RFC

Most helpful comment

After some discussion with @Araq, the null-termitator deprecation direction is approved, and we can start moving in this direction. My plan is:

  1. No longer rely on the old behavior when str[str.len] == '\0' in new code.
  2. Introduce charAtIndexOrNull (we might pick a better name though) in strutils module (we might pick a better place though).
  3. Gradually patch Nim and stdlib code relying on the old behavior, with small PRs.
  4. When all of the above is done, switch the default behavior of [] operator to raise outOfBounds, introducing a -d:allowStringNullTerminatorAccess flag for those who need more time to migrate.

Thumbs up?

All 14 comments

Originated from #5592

This behaviour is very nice for parsers so I'm afraid it's a 馃憥 from me.

IMO, the fact that this behavior is nice for parsers doesn't justify it. This behavior can easily be emulated specifically for parsers like so:

proc `{}`(s: string, i: int): char =
  # returns char or `\0` if at end. raises if out of bounds.
  if i == s.len: return '\0'
  return s[i]

proc parse(s: string) =
  var i = 0
  while true:
    let c = s{i} # Special op for using in parsing algorithms.
    # ...

Needless to say that a lot of parsing algorithms can be changed to not use this behavior at all without sacrificing in readability.

This behavior can easily be emulated specifically for parsers like so:

Of course it can but that's slower and produces more code than is necessary.

But I'm torn on this.

The current behaviour makes sense to me. My rationale is that all strings in Nim are delimited by a \0 for easy C interop, and because of this even an empty string is delimited by a \0.

C interop doesn't require null terminator to be accessible from nim side. C interop should not affect Nim's semantics if it is possible not to. Nim has to be a standalone language done right, not dictated by legacy state of affairs.

Null terminator is just an implementation detail that should not be considered and heavily relied upon. It also makes strings potentially unsafe allowing to overwrite it and then passing to C side. Implementing this RFC I would remove the null terminator mention from the manual, leaving a tiny statement that "string conversion to cstring is zero cost for native targets (because somewhere deep under the hood strings are null terminated :)"

Null terminators make indexing behavior inconsistent with seq/array indexing which adds another tiny detail to the language.

FWIW, Facebook are experimenting with removing the null terminator from their C++ string library (and creating it on demand when the string must be passed to a C library):

https://www.reddit.com/r/rust/comments/5732kh/the_strange_details_of_stdstring_at_facebook/

They do experiments like this in order to get tiny performance improvements, but It's an interesting idea nevertheless.

From an implementation design perspective, I agree with @yglukhov - it feels like a leaked implementation detail.

Of course it can but that's slower and produces more code than is necessary.

@Araq, in fact, it may be implemented as efficiently as current indexing:

proc `{}`(s: string, i: int): char {.inline.} =
  # returns char or `\0` if at end. raises if out of bounds.
  if i > s.len: raise IndexError
  return cstring(s)[i] # No check here.

That's it.

Probably the name should be a bit more scary (like charAtIndexOrNull) instead of nice {}.

After some discussion with @Araq, the null-termitator deprecation direction is approved, and we can start moving in this direction. My plan is:

  1. No longer rely on the old behavior when str[str.len] == '\0' in new code.
  2. Introduce charAtIndexOrNull (we might pick a better name though) in strutils module (we might pick a better place though).
  3. Gradually patch Nim and stdlib code relying on the old behavior, with small PRs.
  4. When all of the above is done, switch the default behavior of [] operator to raise outOfBounds, introducing a -d:allowStringNullTerminatorAccess flag for those who need more time to migrate.

Thumbs up?

Can be closed now?

Todo:

[x] Update JS codegen.
[x] Update the VM.
[x] Update the tutorials / manual.

Now complete. Closing.

Was this page helpful?
0 / 5 - 0 ratings