Crystal: Move to_<num_type> from String to Slice

Created on 15 Jul 2018  路  16Comments  路  Source: crystal-lang/crystal

While working on a terminfo parsing library, where I manipulate Slice of bytes (Bytes type), I needed to convert digits to an actual number.
I realized that all the to_<num_type> conversion methods (like to_i64, to_u8, to_f, ..) are in String, and I needed to allocate a String just to convert the bytes to a number.

I think thoses methods should be available in Slice too. The number conversion methods don't need anything from String as they directly work with a pointer and a size. They can almost directly be copy/pasted to Slice!

I have a working branch with this change, I can make a PR when it's accepted!

Note: the only problem I see here is that those methods will now exists for all kind of Slice, not only Slice(UInt8).. But maybe it's ok? or maybe we can add a macro check inside the to_<num_type> methods that raise when T != UInt8 ?
Another possibility would be to make Bytes a new type (not just an alias) that inherits from Slice(UInt8) so it's easier to add methods that works specifically on bytes.

WDYT?

feature discussion stdlib

Most helpful comment

Crystal strings are always UTF-8 encoded, so when it comes to number conversion they work in ASCII so to say. You cannot assume that any Slice(UInt8) is a valid ASCII encoded sequence or that the "right" conversion between "a number" and a "byte sequence" is interpreting that byte sequence as ASCII, there are many more ways to encode a number to a byte sequence.

We should rather try to find an allocation free way to turn a Bytes into a String assuming the Bytes is valid ASCII or UTF-8 already. Or we could support ASCII number conversion as part of the IO::ByteFormat interface I guess. As a last resort I would extract the number conversion methods to static methods that work with a Bytes, but remain on String or a new util class whose name makes it clear that it's about string based number conversion.

All 16 comments

Crystal strings are always UTF-8 encoded, so when it comes to number conversion they work in ASCII so to say. You cannot assume that any Slice(UInt8) is a valid ASCII encoded sequence or that the "right" conversion between "a number" and a "byte sequence" is interpreting that byte sequence as ASCII, there are many more ways to encode a number to a byte sequence.

We should rather try to find an allocation free way to turn a Bytes into a String assuming the Bytes is valid ASCII or UTF-8 already. Or we could support ASCII number conversion as part of the IO::ByteFormat interface I guess. As a last resort I would extract the number conversion methods to static methods that work with a Bytes, but remain on String or a new util class whose name makes it clear that it's about string based number conversion.

I agree that String.unsafe_from_slice is probably a better solution.

Unfortunately, it can't be implemented because strings are layed out differently from every other class. Strings don't have a pointer to their data, they have their data directly after their class in memory, so it's impossible to turn a Bytes into a String without allocating.

I think that String.to_i(slice, ...) is the best interface thats workable.

I wouldn't mind having to_i, etc., directly in Slice.

Having to_i on Slice is only logic for Slice(UInt8), right @asterite ? What do you think should be done for a Slice(Person) ?

Give an error. We already have methods on slice that only work for byte slices. And byte slices are the most common slice type (in fact, but I would happily change Slice(T) to a non generic Bytes type and remove the Slice type from the std, or maybe keep the two but return Bytes instead of Slice(UInt8) in all cases).

Ok, that's what I thought too. And I was also thinking about a Bytes type which inherits from Slice(UInt8) and would have Bytes-specific methods like to_* methods.

Can't inherit from struct. They would be two different types with a few similar methods.

oh right, then maybe make it as a wrapper like:

struct Bytes
  def initialize(@raw : Slice(UInt8))
  end

  forward_missing_to @raw

  def to_i(...)
    # ...
  end
end

A little copying is better than a little dependency. I won't get tired of saying that.

just put it on string

@asterite by "a little copying" you mean the 550 lines of Slice ?

It's bit less than that, because there's hexdump and others in Slice that should only belong to Bytes.

Maybe adding such methods to Slice but only having them compile for a slide of bytes isn't that bad, though.

I guess this could be a nice use case for extension methods as described in #6438

it's easy enough to throw a {% raise unless T == UInt8 %} in the aforementioned methods and you'll hardly be able to tell the difference

@jhass what do you mean in this paragraph:

You cannot assume that any Slice(UInt8) is a valid ASCII encoded sequence or that the "right" conversion between "a number" and a "byte sequence" is interpreting that byte sequence as ASCII, there are many more ways to encode a number to a byte sequence.

By there are many more ways to encode a number to a byte sequence ? What other ways? (I'm probably just not seeing it)

Note that in this case to_<num-type> methods works with raw bytes directly, interpreting each byte as an ASCII char.

@bew A random example would be BCD, the main point is that it would be confusing for to_i to be to_i_assuming_this_is_really_ascii_here.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ArthurZ picture ArthurZ  路  3Comments

relonger picture relonger  路  3Comments

Sija picture Sija  路  3Comments

oprypin picture oprypin  路  3Comments

TechMagister picture TechMagister  路  3Comments