Crystal: Bytes can't be concatenated

Created on 13 Mar 2018 · 13Comments · Source: crystal-lang/crystal

Since Bytes is just an alias for Slice(UInt8), it lacks a lot of methods you'd usually want for byte strings / byte arrays. One of these is concatenation. The following doesn't work:

bytes1 = Bytes[0x43, 0x72, 0x79, 0x73]
bytes2 = Bytes[0x74, 0x61, 0x6c, 0x21]
combined = bytes1 + bytes2

This of course goes for repeating using *, too.

Not being able to concatenate bytes makes handling binary data a nightmare (and even makes some things impossible without using .to_unsafe), so I'd say it's a good feature to implement.

Source

obskyr

Most helpful comment

That's true. But the API need improvements as well, obviously. =)

straight-shoota on 13 Mar 2018

👍2

All 13 comments

This is by design. Slice represents a pointer and a size. It's very often just a window into memory that you don't own, so it's impossible to resize or append to slices. You could implement + and * because they return new slices, i.e. they copy the data onto the heap, but that really goes against the idea of slices in many ways.

Can you explain your usecase more? Where you find yourself using to_unsafe? I haven't found myself ever reaching for these methods. I'm almost sure what you actually want to do, instead of implementing + and * for slice, is to use IO.

RX14 on 13 Mar 2018

@obskyr See my extended answer on SO for an example of how to use an IO for this.

straight-shoota on 13 Mar 2018

Sure! I posted a question on SO about it, but my example seems to have been simple enough as to cause confusion.

Basically, I've written a decoder for a binary data format. The data is split up into chunks of different types, and there's no way of knowing how long the data will be in advance. This means I have to read the data progressively, adding more to the end of the data I already have as I go along. What I'd like to do is soemthing along the lines of this:

def decode(file)
    # No idea how many bytes we're gonna end up with,
    # so I can't set this to a known size.
    data = Bytes.new(0) # ...Or just "Bytes.new" or something.

    chunk_type = data.read_byte.not_nil!
    until chunk_type == 0
        case chunk_type
        when 1
            data += read_chunk_type_1(file)
        when 2
            data += read_chunk_type_2(file)
        end
        chunk_type = data.read_byte.not_nil!
    end
end

def read_chunk_type_1(file)
    length = file.read_byte.not_nil!
    data = Bytes.new(length)
    file.read(data)
    return data
end

# ...

However, since bytes can't be concatenated, what I've instead been doing is using Array(UInt8) and eventually calling Bytes.new(data.to_unsafe, data.size * sizeof(UInt8)) on that.

Is using IO the correct way to go about this, perhaps?

obskyr on 13 Mar 2018

Yeah, in that case I'd use IO::Memory and IO.copy to build up the data.

def decode(io)
  data = IO::Memory.new

  loop do
    chunk_type = data.read_byte
    case chunk_type
    when 1
      read_chunk_type_1(from: io, to: data)
    when 2
      # ...
    when nil
      raise "Unexpected EOF in chunk type header"
    when 0
      break
    end
  end
end

def read_chunk_type_1(from, to)
  length = file.read_byte
  raise "Unexpected EOF reading chunk type 1 length header" unless length
  copied_bytes = IO.copy(from, to, length)
  raise "Unexpected EOF in chunk type 1" unless copied_bytes == length
end

or similar

you can use data.to_slice to get a slice out of an IO::Memory. Also even in your above example, you probably want to use file.read_fully.

RX14 on 13 Mar 2018

You're right, I should indeed be using read_fully. Based on the name I assumed it'd read the entire file, but I suppose that's not the case.

The only thing I'm missing now is being able to do things like:

bytes_to_repeat = Bytes.new(length)
file.read(bytes_to_repeat)
return bytes_to_repeat * times_to_repeat

But I supppose I'll just have to do the slightly more verbose:

bytes_to_repeat = Bytes.new(length)
file.read(bytes_to_repeat)
times_to_repeat.times do
    IO.copy(bytes_to_repeat, io)
end

obskyr on 13 Mar 2018

@obskyr I've never come across wanting to send the same data multiple times, it seems pretty wasteful and an edge case so having it be a little bit more code is fine.

You probably want io.write instead of IO.copy though, since bytes_to_repeat isn't an IO.

RX14 on 13 Mar 2018

I've never come across wanting to send the same data multiple times

That's the point - it's a file format that uses RLE here and there, so when decoding (not encoding to send/store) it you have to repeat strings every now and then.

obskyr on 13 Mar 2018

Ah, I see. Yeah, your best bet is to read the part which is RLE encoded into a side-buffer and write it back out however many times is needed. It's a bit more code but it's probably pretty rare. Especially considering that RLE is hardly state of the art in compression these days.

Can I close this issue?

RX14 on 13 Mar 2018

You can't close me, I quit!

Yes, my use case for this is solved, at least. I do think the documentation could be a bit more informative about this - perhaps the docs for either IO#read or Bytes could link to IO::Memory?

obskyr on 13 Mar 2018

I think it'd be possible to mention IO::Memory in the doc for IO. However, covering specific usecases and common workflows sounds like something for more long-form tutorials myself.

RX14 on 13 Mar 2018

👍1

I think this might be fairly closely tied to reading binary data in general, which goes beyond a specific use case. Maybe, maybe not - the info I needed wasn't in the places I looked, at least. At least this conversation will show up on Google for "crystal read binary" now!

obskyr on 13 Mar 2018

I think the answer is StackOverflow. Now if someone stumbles upon this problem, even using your exact same words ("concatenate bytes") they will find the answer.

I mean, whenever I have a problem or doubt I ask it to Google and I get an answer, usually in StackOverflow, not language docs (well, sometimes language docs, like when I search for a specific type or method). What's good about StackOverflow is that it's like a community wiki and it can grow without having to modify the language source code. So I prefer that, and over the years it has proven to be the way to document these things and relations.