Nim: Add byte-array-to-string conversion to system module

Created on 25 Jun 2020  路  8Comments  路  Source: nim-lang/Nim

Summary

The standard library could use a function to convert an openarray[char] or openarray[byte] to a string.

Description

Converting between byte arrays and strings is pretty common, in my experience. It's possible to cast from a seq[char] or seq[byte] to a string using cast[string](...), and the compiler allows you to do the same with an openarray, but at runtime you get a garbage string object. (It appears that the initial bytes of the array get reinterpreted as the string's length and address.) Here's an example.

This seems rather dangerous: an idiom that works in one case fails in another, but subtly enough that it might be overlooked at first. Results I've observed have been either a several-megabyte string of garbage, or an out-of-memory exception (when Nim tries to allocate a copy of a terabytes-long string.) I'm guessing that mutating the string might lead to memory corruption ... and it might be possible to craft a byte array that overwrites specific memory addresses, making this a potential security exploit.

I realize the cast operator is clearly documented as dangerous! The problem here isn't that using it causes undefined behavior, rather that (a) it's the only simple way to convert an array to a string, and worse, (b) it works for some types of arrays but not others.

Proposal

Add some toString(...) procs to the system module:

proc toString(chars: openarray[char]): string
proc toString(bytes: openarray[byte]): string

($ would be a better name, but that operator is already defined in dollars.nim, and has a different purpose.)

Alternatives

The only way I've found is to create an empty string and copy the bytes:

proc toString(bytes: openarray[byte]): string =
  result = newString(bytes.len)
  copyMem(result[0].addr, bytes[0].unsafeAddr, bytes.len)

It's a fine solution, but we shouldn't make developers use highly unsafe language features just to perform a common operation! Instead, that should be the implementation of the standard library function.

Feature Stdlib

Most helpful comment

I agree with the general idea (common cases like this should not require a cast for reasons you've already mentioned); but I'd do it a bit differently than what you're suggesting.

  • not in system but in some low level module with no imports that can be imported by other modules; it could be either system/dollar.nim (yuck since system) so I'm leaning towards a new std/strings low-level module that other modules (including user code) can import
  • it should be compatible with https://github.com/nim-lang/RFCs/issues/191 in particular would take var result
  • see also https://github.com/nim-lang/Nim/pull/13792 which has some relevance to this
  • compiler/strutils2.nim could be merged with it

All 8 comments

I agree with the general idea (common cases like this should not require a cast for reasons you've already mentioned); but I'd do it a bit differently than what you're suggesting.

  • not in system but in some low level module with no imports that can be imported by other modules; it could be either system/dollar.nim (yuck since system) so I'm leaning towards a new std/strings low-level module that other modules (including user code) can import
  • it should be compatible with https://github.com/nim-lang/RFCs/issues/191 in particular would take var result
  • see also https://github.com/nim-lang/Nim/pull/13792 which has some relevance to this
  • compiler/strutils2.nim could be merged with it

This would be really useful.

Ideally, it should be possible to get an immutable string from a openArray[byte] (and a way to get an immutable openArray[byte] from a string) without having to copyMem the datas.

I propose this solution in std or fusion:

func toByteSeq*(str: string): seq[byte] {.inline.} =
  ## Converts a string to the corresponding byte sequence.
  @(str.toOpenArrayByte(0, str.high))

func toString*(bytes: openArray[byte]): string {.inline.} =
  ## Converts a byte sequence to the corresponding string.
  let length = bytes.len
  if length > 0:
    result = newString(length)
    copyMem(result.cstring, sequence[0].unsafeAddr, length)

Your inline here is cute, it's not like that the calling overhead of X nano seconds makes a difference for a memory allocation plus copy operation... The real problem here is insufficient design, if you used string for binary data just stick with it, don't convert it to seq of bytes back and forth...

Yes, sometimes I need this.
For examples, nimcrypto only provides openArray[byte] interface, I need to convert this to string.

proc randomString*(size = DefaultEntropy): string {.inline.} =
  ## Generates a new system random strings.
  result = size.randomBytesSeq.fromByteSeq

And sometimes I need to debug openArray[byte] or for tests, I want to see the pretty format for openArray[byte] and I need stream interface.

    let 
      serialize = [0'u8, 0, 2, 0, 8, 1, 71, 174, 20, 1, 99, 99, 0]

    var str = fromByteSeq(serialize)
    let 
      strm = newStringStream(str)

    discard strm.readFrameHeaders

IMO, the functionality is needed.

When you use multiple external API, there are cases where you need one string and other where you need seq[byte].

My issue with the proposed solution :

  • It duplicates datas.
  • Having to allocate + copyMem every time is costly.
  • Ideally, there could very well be some APIs that store binary 16bit datas as seq[uint16] (not sure about this since casting seq between type of different sizeof is tricky).

Something like this (or whatever syntax is best) :

var buffer = newBuffer(bufsize)

with buffer as seq[uint16]:
  # ...
with buffer as seq[byte]:
  # ...
with buffer as string:
  # ...

relevant but doesn't close this issue: https://github.com/nim-lang/Nim/pull/15951

Yes, I realised after thinking about it I was a bit too focused on my own point of view.

What I've done instead is create the Nimble package bytesequtils that covers my immediate (and personal) needs. But of course, anybody is welcome to use it if they find it useful.

If it's still relevant, it can be merged in the future once progress have been made on a standard or fusion solution.

For reference : https://github.com/Clonkk/bytesequtils

Was this page helpful?
0 / 5 - 0 ratings