Chapel: Should bytes be a mutable type or not?

Created on 12 Aug 2019 · 28Comments · Source: chapel-lang/chapel

We are adding a new bytes type to the language. bytes type is very similar to string type, however one major difference is bytes type does not have any encoding, i.e. can store arbitrary bytes.

Another potential difference is mutability. string type is immutable in-place. One school of though is to make bytes type also immutable for making it as similar to string as possible. Another option is to make it mutable in-place to allow it to be used in wider use cases.

Python has both bytes and bytearray types where the former is immutable and the latter is immutable. As of today, bytes implementation is in the internal modules (albeit we are still working on its integration to the language) as an in-place immutable type similar to string

Source

e-kayrakli

Most helpful comment

I think it's important for bytes and string to support the same API where possible. People will be toggling code between them a lot (at least if Python 3 experience is any indication).

mppf on 26 Aug 2019

👍2

All 28 comments

My 2 cents on the issue: I like bytes type to be as similar to strings as possible but having bytes type encoding free and mutable is not horrifically different IMHO.

That being said, I am not 100% comfortable with this and making it immutable seems "safer" and less confusing at a high level. I am not sure what kind of use cases are out there where bytes mutability (or string's immutability as opposed to bytes' mutability) can throw people off.

A relevant read is python's PEP 3137

e-kayrakli on 12 Aug 2019

I find myself wondering if we could select between the two using either:

const b: bytes; vs. var b: bytes;
[const | var] b: bytes(mutable=[true | false]);

I think the downside to the first approach is that we have const and var strings even though all strings are immutable, so the const vs. var essentially says whether or not the variable can point to a different string buffer (I think?). So that probably doesn't fly for bytes.

I think the multi-byte/variable-width characters of UTF-8 are a main reason not to make strings mutable. Since bytes have fixed-width elements, I'm not sure of reasons to make them never be mutable.

We could just use var b: [1..n] uint(8); to get an array of "bytes", but my impression is that in some of our programs (e.g., the CLBG entries?), there have been opinions in the past that a mutable bytes variable would clean up the code significantly (presumably because the operations on the variables are more string-like than array-like?).

bradcray on 12 Aug 2019

I don't think bytes should be mutable other than +=

I think the multi-byte/variable-width characters of UTF-8 are a main reason not to make strings mutable. Since bytes have fixed-width elements, I'm not sure of reasons to make them never be mutable.

Here is my reasoning.

One day, I hope that string and bytes can use reference counting (at least for relatively large strings). This kind of thing is sometimes referred to as a "rope" implementation (e.g. https://en.wikipedia.org/wiki/Rope_(data_structure) but I mean something more general than what is exactly described there). If we make them mutable, we cannot share the same data buffer between two string or bytes.

E.g.

var x = computeSomeGiganticBytes();
var y = x; // this could use reference counting to share the data
y += computeSomeOtherBytes(); // this can append by keeping a list of pointers to data
                              // similar to the `list` we are building

So I'd say my concern has to do with the future ability to change algorithms if trying them shows that they significantly improve performance.

mppf on 12 Aug 2019

@mppf: Couldn't a compile-time distinction between mutable and immutable bytes support the optimization above for the immutable cases without precluding the creation of mutable bytes variables?

bradcray on 12 Aug 2019

@bradcray - I want to be able to append with += to the normal bytes or string type and still have the "rope"-y implementation. I don't think it makes sense for += to do anything to a const variable.

mppf on 12 Aug 2019

What if var b: bytes; was equivalent to var b: bytes(mutable=false); yet we permitted people to declare var b:bytes(mutable=true); and only applied your optimization in the former cases?

bradcray on 12 Aug 2019

@bradcray - I view bytes(mutable=true), when that is not the default, as philisophically the same as adding a mutableBytes type (just with a different name). I have no problem with having a separate type that is mutable.

Are you imagining that one could += on a var b: bytes(mutable=false) (I think you are?)

mppf on 12 Aug 2019

👍1

I view bytes(mutable=true), when that is not the default, as philisophically the same as adding a mutableBytes type (just with a different name).

I agree they're equivalent; I think the benefit is that there's no need to come up with (and potentially reserve?) a new type name. mutableBytes or bytesArray is also a bit more of a mouthful (and if considered built-in, would be the first camelCase built-in type I think?)

Are you imagining that one could += on a var b: bytes(mutable=false) (I think you are?)

Sure. I think the ability to modify bytes using subscripting was the main benefit that mutability would give to CLBG codes (though @e-kayrakli and/or @benharsh may know better).

bradcray on 12 Aug 2019

Separate type seemed to be the solution to the problem since the beginning. I think I like bytes(mutable=true) slightly better than mutableBytes or byteArray (or probably any other name). A user can define a type alias if they wany, anyways.

Sure. I think the ability to modify bytes using subscripting was the main benefit that mutability would give to CLBG codes (though @e-kayrakli and/or @benharsh may know better).

I don't know others off hand but knucleotide is doing in-place case changes IIRC.

e-kayrakli on 12 Aug 2019

Following from my previous comment, one thing that we should consider is the behavior of toUpper, toLower etc family of methods. As of now, they return new bytes/string objects. I (rather strongly) think if they are called on mutable bytes object, they should return new mutable bytes object.

But providing alternatives like convertToUpper or something like that that can make the change in-place sounds like a good idea to me both in terms of elegance and performance. Another interface alternative is to provide overloads with inPlace=false argument that can be passed to mutable bytes only. I slightly prefer new set of methods.

p.s. python's bytearray.upper() method also returns new bytearray object. I don't see an in-place alternative, though.

e-kayrakli on 13 Aug 2019

I agree that toUpper and toLower should behave consistently for bytes and strings whether the bytes type is mutable or not. I remain confused after last week's meeting about the assertion then that it is in-place for strings and follow-up comments that it's not. I.e., do we want it to be in-place and it just didn't happen to get implemented that way the first time around?

bradcray on 13 Aug 2019

@bradcray - the user-facing API is that it's not in place (it returns a new string) but the implementation currently does it in place.

(Edit: FWIW, I think having them not in-place is the right move)

mppf on 13 Aug 2019

Does that mean that the implementation is incorrect, or that it's correct and doing it in-place, simply on a newly copied string rather than the original one? (and in either event, does this mean that the German character problem isn't really a problem we need to worry about—that we just need to allocate additional bytes when we copy the string for such characters?)

bradcray on 13 Aug 2019

The current implementation creates the copy of this and iterates over the copy character by character (or byte by byte for bytes) and changes them in the copy's buffer in-place.

The German character problem is a problem. It wouldn't be if these methods were starting from an empty string and appending character by character.

e-kayrakli on 13 Aug 2019

Just to clarify I don't think we should change anything about toUpper etc and the mutable bytes objects should also not change this. However, providing some in-place alternatives only for the mutable bytes type seems appealing.

e-kayrakli on 13 Aug 2019

The German character problem is a problem. It wouldn't be if these methods were starting from an empty string and appending character by character.

Right, this is getting a bit off-topic here, but I think that is an implementation bug and should be fixed.

mppf on 13 Aug 2019

While playing around with the bytes implementation and how we can achieve this, I have three following alternatives:

Add a set method. Give a compiler error if mutable==false.
We may need this because bytes.this returns a bytes object. Therefore, supporting mutable bytes is probably not as easy as just adding a refversion of proc this
Add a proc this() ref that returns another mutable bytes object which shares the original object's buffer. And add a proc =(b: bytes(mutable=true), i: uint(8)) that can only be used for 1-length bytes objects.
This sounds like jumping through bunch of hoops.
Try to see if we can have something like bytesView object or something that is returned by proc this() and proc this() ref
This may be the most principled approach. But it may be an overkill. It can also support idioms like (1 can also do this, I think. it is more difficult for 2):

var b = b"some bytes";
b[1..4] = b"no"; //note the length change

I am leaning towards 1 because its simplicity. Also, one can argue that not supporting b[4] = 65; even for mutable bytes avoid possible confusions.

e-kayrakli on 23 Aug 2019

bytes.this returns a bytes object.

I would've expected bytes.this(i: integral) to return a uint(8) (or ref-to-uint(8)) since it's the closest thing we have to a byte. I think the main reason that this is not the case for strings is (a) because the size of the return type could be variable, (b) to get null-termination (?).

I think bytes.this(r: range) is a different can of worms and would imagine taking the bytesView-style approach.

bradcray on 24 Aug 2019

I think the main reason that this is not the case for strings is (a) because the size of the return type could be variable, (b) to get null-termination (?).

I think it's so that the type is more self contained. string.this(i) could return an int(32) representing a codepoint.

Anyway I think bytes and string should be consistent in this regard (return integer or return same type).

I'd like to have bytesView / stringView objects that "borrow" from the original string/bytes and are checked by the lifetime checker. But I wouldn't use those for this(i: integral).

mppf on 24 Aug 2019

I think it's so that the type [string] is more self contained.

Perhaps, but I don't think that's necessarily a reason to have bytes behave the same way. Indexing into an array of t returns a t, not an array of t with one element.

My intuition is that when I'm indexing into a string, I'd typically like to get a string back that I can compare against another string (e.g., if ("brad"[2] != "r") .... Whereas when indexing into a bytes type, I don't see any particular value in getting a bytes type out rather than something byte-like (like uint(8)). It'd be consistent, but it doesn't seem particularly useful. I think of bytes as being somewhere between a string and an array in terms of capability which is why I don't think I feel the need to be strictly consistent in this regard.

bradcray on 26 Aug 2019

@bradcray

My intuition is that when I'm indexing into a string, I'd typically like to get a string back that I can compare against another string (e.g., if ("brad"[2] != "r") .... Whereas when indexing into a bytes type, I don't see any particular value in getting a bytes type out rather than something byte-like (like uint(8)). It'd be consistent, but it doesn't seem particularly useful. I think of bytes as being somewhere between a string and an array in terms of capability which is why I don't think I feel the need to be strictly consistent in this regard.

It would be nice to check if a path (that is of bytes type) is absolute by doing myPath[0] == b"/". I am personally leaning more towards bytes.this(i) returning another bytes.

@mppf

I'd like to have bytesView / stringView objects that "borrow" from the original string/bytes and are checked by the lifetime checker. But I wouldn't use those for this(i: integral).

Then, how do you imagine myBytes[3] = b" " or myBytes[3] = 32 would work if bytes.this returns a bytes?

e-kayrakli on 26 Aug 2019

I think it's important for bytes and string to support the same API where possible. People will be toggling code between them a lot (at least if Python 3 experience is any indication).

mppf on 26 Aug 2019

👍2

Then, how do you imagine myBytes[3] = b" " or myBytes[3] = 32 would work if bytes.this returns a bytes?

I'd just make it mutable with a set method.

mppf on 26 Aug 2019

I'd just make it mutable with a set method.

If we decide to add a set method, would you be fine with the following overloads?

bytes.set(i: integral, val: uint(8))
bytes.set(i: integral, val: bytes) // val.length is not necessarily 1
bytes.set(r: range, val: bytes) // val.length is not necessarily r.size

In which case, I would see the bytesView/stringView approach as a separate and more general support for zero-copy (potentially) mutable and immutable bytes/string slices

e-kayrakli on 26 Aug 2019

FWIW, this discussion is cooling any enthusiasm I had for trying to make bytes mutable before the 1.20 release (so don't feel a need to rush into it due to my original request).

bradcray on 26 Aug 2019

If we decide to add a set method, would you be fine with the following overloads?
I'm not sure exactly what you're asking here... but I'll try to answer

bytes.set(i: integral, val: uint(8))

This is what I would expect, at a minimum.

bytes.set(i: integral, val: bytes) // val.length is not necessarily 1

I'd leave this out, unless it were to assert that val's length is 1 somehow (halt/throw/compiler error).

bytes.set(r: range, val: bytes) // val.length is not necessarily r.size

I think there's potential confusion about what this does if r.size != val.length. Users might think it inserts into the bytes and extends the length. But I don't think we're expecting that (are we?) So, I'd want it to be some sort of error if the size of the range didn't match the length of the bytes and the range is not unbounded. (Note I think both low-bounded and high-bounded ranges should work).

mppf on 26 Aug 2019

FWIW, this discussion is cooling any enthusiasm I had for trying to make bytes mutable before the 1.20 release (so don't feel a need to rush into it due to my original request).

I must admit that I was expecting a clearer path. :(

For what its worth, we have the following in python:

>>> b = bytearray(b"some bytes")
>>> b[0:4] = b'A'
>>> b
bytearray(b'A bytes')
>>> b[2:7] = b'modified bytes'
>>> b
bytearray(b'A modified bytes')
>>> b[0] = b'some'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'bytes' object cannot be interpreted as an integer

So equivalent of bytes.set(i: integral, val: bytes) is not supported in python. But setting a range of bytes with different length of bytes is supported.

Going back a few comments back, b"bytes"[0] returns an int in python, so maybe I was wrong. Going even further:

>>> b"bytes"[0] == b"b"
False
>>> "string"[0] == "s"
True
>>>

(what bytes.this returns sounds like a separate discussion but would have an impact on the design discussion we are having here.)

e-kayrakli on 26 Aug 2019

I think it's important for bytes and string to support the same API where possible. People will be toggling code between them a lot (at least if Python 3 experience is any indication).

I want to reiterate this point. I imagine that someday the bytes and string type would implement a constraint so that a constrained generic could use either of them for most of the work. There might be a limited set of calls that only make sense on strings (Unicode?) or bytes, but even some straightforward things like .numCodepoints could be implemented on bytes.