Chapel: Bytes objects and uint(8) interactions

Created on 27 Aug 2019 · 31Comments · Source: chapel-lang/chapel

There was some discussion under #13707 about the return type of bytes.this. It currently returns another bytes object a la string.this. @bradcray was suggesting that we may accept the difference and make bytes.this return a uint(8). It'd make mutable bytes design much clearer.

Here is a list of cases (including the one above) where we may consider making the bytes type easier to use along with uint(8) values:

bytes.this return type
Should it be uint(8) or bytes? (In python this is uint(8))
proc += for bytes and uint(8)s
Currently this is not supported, again similar to strings. (Python doesn't support it either)
What does casting a uint(8) to bytes mean?
Today if you do that, you'll get behavior identical to string:

var val: uint(8) = 65;
writef("%ht\n" val:bytes);

This prints b"65" but maybe it should print b"A"?

Language Libraries / Modules Design

Source

e-kayrakli

All 31 comments

I'm happy with the current situation since I want bytes and strings to be easy to swap for each other, assuming the data is all ASCII.

mppf on 27 Aug 2019

👍1

I feel worried about our asymmetry with Python w.r.t. indexing and worry that Python programmers will be confused that indexing into a bytes in Chapel produces a new bytes rather than a single byte. I also keep getting confused by it myself because I keep incorrectly thinking "Oh, for this ASCII string, rather than using myString[i].toByte() I can just use myBytes[i].

bradcray on 24 Sep 2019

👍1

It also feels unfortunate that iterating over a bytes yields single-byte bytes values rather than single integer / byte values.

bradcray on 24 Sep 2019

(I thought I posted this somewhere... but I can't find it anywhere right now?)

First, note that having string/byte.this return string/byte enables nicer interactions with I/O; e.g. writeln("hi"[1]); prints h rather than 104 (which we would get if string.this returned an int).

Second, for bytes, we might wish for something like myBytes[1] & 1 to work (i.e. to treat the value as an integer).

One strategy that is appealing (to me anyway) is to add byte and codepoint types (not necessarily with those names) that just contain integers. Then make bytes.this/these return byte and string.this/these return codepoint. We might consider adding some/all of the following coercions:

byte would coerce to int (and presumably uint(8))
byte would coerce to bytes
codepoint would coerce to int
codepoint would coerce to string

However even without these coercions, with specific overloads, things like writeln can do the right thing with byte or codepoint.

Note that this does not help with getting a mutable ref to a byte at a particular offset in a bytes. The problem there is that the corresponding operation is not possible for a UTF-8 string. (If bytes and string are to be consistent, that just means that this shouldn't do it; there could be a different bytes-only method that does).

mppf on 25 Oct 2019

My latest thought on this topic was to have string and bytes behave more like Python w.r.t. default indexing and iteration (due to precedent / to avoid surprises for programmers coming from Python), but to have methods that would support printable vs. numeric accesses and iterations in order to support users to write generic code that treated the types symmetrically by using the specific iterators/accessors rather than the default ones.

bradcray on 25 Oct 2019

@bradcray was suggesting that we may accept the difference and make bytes.this return a uint(8). It'd make mutable bytes design much clearer.

Just to make sure the two statements above aren't conflated more than intended. I'd prefer for bytes indexing/iteration to yield integers even if we decided never to support mutable bytes values.

bradcray on 5 Nov 2019

I quickly asked for opinions in different contexts and wanted to add responses here:

@benharsh : leans towards this and these returning/yielding uint but can live with having a .uint or .uints type of method (which we do in bytes.byte and bytes.bytes)
@ronawho : initial response was to follow Chapel strings and return/yield bytes. but said that he's happy to follow Python's precedence after me mentioning Python's behavior.

Please chime in if I mischaracterized your opinions..

Going back to the specific discussion here:

First, note that having string/byte.this return string/byte enables nicer interactions with I/O; e.g. writeln("hi"[1]); prints h rather than 104 (which we would get if string.this returned an int).

The example here uses a string, but for bytes this behavior is identical to what Python does in the expense of having its string and bytes behave differently:

>>> print(b"hi"[1])
105
>>> print("hi"[1])
i

My latest thought on this topic was to have string and bytes behave more like Python w.r.t. default indexing and iteration (due to precedent / to avoid surprises for programmers coming from Python), but to have methods that would support printable vs. numeric accesses and iterations in order to support users to write generic code that treated the types symmetrically by using the specific iterators/accessors rather than the default ones.

I feel closest to this approach. But I don't yet have a good proposal for naming those accessors without making things confusing. Arguably they are a bit confusing already. For bytes we currently have the following:

bytes.this: returns bytes
bytes.byte: returns uint(8)
bytes.these: yields bytes
bytes.bytes: yields uint(8)

Same methods/iterators exist for string as well. So, we should probably keep bytes.byte and bytes.bytes as-is, make default bytes.this and bytes.these to be synonyms for those, add new accessors/iterators that return/yield bytes.

Inspired from Java's charAt and codePointAt for String, maybe we can have bytes.bytesAt return a bytes but I am not sure what its iterator counterpart should be. Or even whether we should have such an iterator at all..

One strategy that is appealing (to me anyway) is to add byte and codepoint types (not necessarily with those names) that just contain integers. Then make bytes.this/these return byte and string.this/these return codepoint. We might consider adding some/all of the following coercions:

byte would coerce to int (and presumably uint(8))

byte would coerce to bytes

codepoint would coerce to int

codepoint would coerce to string

I am not against this either. But I have to say that "keeping bytes and string as similar as possible" is losing its appeal on me, so I am not seeing the motivation for this personally.

I think being closer to python's bytes has more appeal to me.
In general, I am not seeing too strong of a use case for having bytes as a drop-in replacement for string. I think the cases for both of these should be much more clear-cut than that so that we shouldn't need almost-identical interfaces. One argument for having that is the initial transition to bytes for when we have UTF8 validation on master. And I don't think we should be making our decisions based on that, anyways.

But then again, this could be a good compromise and I wouldn't object to it any further.

e-kayrakli on 2 Dec 2019

👍1

Reading on comments on other issues I came across one of my past comments that contradicts with what I am supporting today (I guess my opinions change a lot :( ):

It would be nice to check if a path (that is of bytes type) is absolute by doing myPath[0] == b"/". I am personally leaning more towards bytes.this(i) returning another bytes.

Joking aside, I still think that it is nice to be able to do myPath[0] == b"/". But it is something that I can sacrifice (and do myPath[0] == b"/".byte(), or even better myPath.startsWith(b"/"))

e-kayrakli on 2 Dec 2019

@e-kayrakli: W.r.t. your absolute path test pattern, it seemed to me that the consensus view was converging on having paths represented as strings (with special escapes for non-UTF8 characters?) rather than bytes, which means that for most paths (i.e., ones that you haven't gone out of your way to represent using bytes), you could still do that quick check, right?

bradcray on 2 Dec 2019

@bradcray -- that's absolutely true.

It wasn't a very deep example, I didn't mean to bring up anything particular about bytes/strings in paths discussion.

e-kayrakli on 2 Dec 2019

From https://github.com/chapel-lang/chapel/issues/14291#issuecomment-562721845, I do care that string and bytes have a defined interface such that I can interchange between the two types generically (e.g., constrained generics). Python2 vs. Python3 is a good place to start — it sounds like that's already being looked at — because Python2 didn't have native UTF-8 support and it was still used for string processing. Arguably, Python3 made string processing more painful with the changes to their string interfaces.

(I don't know that much about Python, so this is all what I think I know.)

BryantLam on 10 Dec 2019

@BryantLam - in Python 3, string/bytes element access returns a different kind of thing between string bytes:

>>> 'hi'[1]    # string element access returns string
'i'
>>> b'hi'[1]   # bytes element access returns int
105

mppf on 10 Dec 2019

For interchangeablity of strings and bytes:

Based on my experience trying to port some string-based code to bytes-based, a
(more?) important discussion about that is coercion from string to
bytes.
I do still believe that interfaces should be as similar as possible and
bytes.this is one of the most significant parts of the interface. But adding
some generic methods that behaves identically and then letting string.this
and bytes.this diverge doesn't seem too bad for me. (Note that they already
return different types string vs bytes)

What we have

String

Accessors

proc byte(i): uint(8) at index i
proc codepoint(i): codepoint at index i
proc this(i): like codepoint(i) but returns string

Iterators

iter bytes(): yields uint(8)s
iter codepoints(): yields int(32)s
iter these(): like codepoints() but yields strings

Bytes

Accessors

proc byte(i): uint(8) at index i
proc this(i): like byte(i) but returns bytes

Iterators

iter bytes(): yields uint(8)
iter these(): like bytes() but yields bytes

As you can see, both types' this methods are unique and already different that
they return different types. Maybe we can add a char method to both to replace
this and make bytes.this an alias of bytes.byte and string.this an alias
of string.char. (And similarly for these) So we'd end up with:

How we can change it

String

Accessors

proc byte(i): uint(8) at index i
proc codepoint(i): codepoint at index i
proc char(i): Unicode character at index i as string
proc this(i): alias for char

Iterators

iter bytes(): yields uint(8)
iter codepoints(): yields int(32)s
iter chars(): yields unicode characters as strings
iter these(): alias for chars

Bytes

Accessors

proc byte(i): uint(8) at index i
proc char(i): ASCII character at index i as bytes
proc this(i): alias for byte

Iterators

iter bytes(): yields uint(8)
iter chars(): yields ASCII characters as bytes
iter these(): alias for bytes

e-kayrakli on 11 Dec 2019

Overall I think we have the following proposals/options:

1. Leave as-is

Pro: Symmetry with Chapel string. No implementation effort.
Con: Asymmetry with Python bytes -- bytes.this not returning a byte may
be unexpected and confusing, bytes are less performance-improving
```
var b = b"ABC";
writeln(b[1]);
```
This'd print "A" instead of the byte value "65".

2. Add new types with potential coercions (see comment)

Pro: Support both worlds depending on the context
Con: Adding coercions to (potentially?) non-user facing types can be
confusing. Implementation may not be very straightforward

3. Just make bytes accessor and default iterator return/yield uint(8)

Pro: Symmetry with Python bytes, trivial implementation.
Con: Asymmetry with Chapel string

4. Do 3, but add symmetric accessors and iterators (see comment)

Pro: Symmetry with Python bytes, generic code can use char() for
symmetric behavior.
Con: Asymmetry with Chapel string.this and string.these

I proposed 4 and still support it. If I were to sort them, I'd say 4>1>3>2.

e-kayrakli on 7 Jan 2020

I like 4 best though I haven't thought about the names we'd use for symmetric routines much (in part because I didn't want to waste the effort if we didn't go with option 4).

bradcray on 10 Jan 2020

I'm OK with 4 and have gradually been becoming less convinced of anything at all to do with this issue... so don't currently have a strong opinion.

mppf on 10 Jan 2020

Assuming we are going with 4, some alternatives to char accessor can be item, elem or unit but I don't like them as much as char. All of them are too general and unit has extra disadvantage of being to close to uint.

I'll put together a branch with 4 implemented to see if there's any unforeseen issues with that.

e-kayrakli on 14 Jan 2020

PR #14878 adds the symmetric accessor and iterator pair named char/chars. But they may not be what we want in the end. As a quick summary these are gonna return/yield (1) one-ASCII-character bytes or (2) one-Unicode-codepoint strings.

Possible alternatives I can think of:

item/items -- somewhat symmetry with map.items
elem/elems
unit/units
character/characters -- to avoid confusion with C's char

e-kayrakli on 13 Feb 2020

I think I prefer char(s) or character(s) to item(s), elem(s), or unit(s). Other things I considered, but didn't feel excited about in the end were:

symbol(s): maybe OK, but as a compiler person, takes my mind to identifiers rather than characters; also seems very general
glyph(s): has a reasonable English meaning, but since it already means something specific and different to the Unicode community, it's probably not a good idea
codepoint(s): seems too technical and specialized to me; wouldn't suggest a byte to me

bradcray on 15 Feb 2020

I'm a little bit concerned that char / chars will be confusing to people coming with a C background.

I don't think codepoints would be possible because strings already have a method of that name.

I think I like char and item the best of these ideas.

mppf on 18 Feb 2020

One other thought: value / values?

bradcray on 18 Feb 2020

Or element / elements (I know, it's the same as elem, but it is somehow more appealing to me when spelled out)

mppf on 18 Feb 2020

value/values: Not a big fan. As a string user, it makes me think about integral "values" of codepoints or characters more than the characters themselves.
element/elements: I don't think it is better than char/chars even with potential confusion with C's char type. But it is more of a personal choice rather than having specific concerns. I can live with this, if there is a general agreement.

e-kayrakli on 18 Feb 2020

I don't have a strong opinion on the naming. But...

unit has extra disadvantage of being to close to uint.

I completely agree. I'm actively against this one.

I'm a little bit concerned that char / chars will be confusing to people coming with a C background.

Agreed. Especially when unfortunately developing a custom strings library in Chapel with a C FFI.

Possible alternatives I can think of:

item/items -- somewhat symmetry with map.items

elem/elems

My preference would be item or elem. I like consistency since these are already used by other standard modules. item is used by map.items and Chapel has historically used elem.

That said, char is likely fine too. I'd be interested in a thought experiment as to whether it makes sense to have a single constrained generic with a generic method like item or elem versus a string-specific constrained generic with only char/chars.

interface Item {
  proc this.item { ... }
}
interface StringItem {
  proc this.item { ... }
}

// Of course string would implement StringItem
class string implements StringItem { ... }

proc print(obj: Item + StringItem) {
//             ^^^^^^^^^^^^^^^^^
//             oops this doesn't work due to ambiguity
  writeln(obj.item);
}

but this is also a crude example because StringItem doesn't make sense; it would just be class string implements Item and then it works.

BryantLam on 20 Feb 2020

Side question. With proposal 4, what are the available ways a user would typically ASCII-print a bytes object?

var b = b"ABC";
writeln(b); // 65 66 67

writeln(b.chars()); // ABC
writeln("%s".format(b)); // Is this legal?

I assume the last line would be able to make use of this chars/items/elems method.

BryantLam on 20 Feb 2020

Side question. With proposal 4, what are the available ways a user would typically ASCII-print a bytes object?
var b = b"ABC";
writeln(b); // 65 66 67

writeln(b.chars()); // ABC
writeln("%s".format(b)); // Is this legal?
I assume the last line would be able to make use of this chars/items/elems method.

There might be some confusion here. A "normal" printout of a bytes object is in ASCII:

var b = b"ABC";
writeln(b);   // ABC

And, you are right to suggest that %s is not a valid formatter for bytes. You should use binary formatter, instead.

var b = b"ABC";
writeln("%|*s".format(b.length, b));  // ABC

Neither of these will change with proposal 4. However,

for x in b do
  writeln(x);

prints A, B, C today, and it'll print 65, 66, 67 after this change. And we'll use chars (or whatever we choose) to get the same behavior:

for x in b.chars() do
  writeln(x);

will print A, B, C for both strings and bytes.

Does that change your name preferences?

I'd be interested in a thought experiment as to whether it makes sense to have a single constrained generic with a generic method like item or elem versus a string-specific constrained generic with only char/chars.

That's be interesting, but I think this is already almost covered by having an iter these for types that make sense. Constrained generics could help in your snippet by allowing us to define an Interface Iterable that has iter these in it, I think.

e-kayrakli on 21 Feb 2020

I think overall we got

char/ chars
item/items

that have the most support.

@bradcray -- how strongly are you against item? Thinking about C interop more made me a bit concerned with char as well. If I am handling strings both at Chapel and C in the same application, I'd be confused by two separate meanings of char

e-kayrakli on 21 Feb 2020

If char(s) is off the table, character(s) is probably what I'd use next, left to my own devices (re-reading my earlier response, I'm seeing that I didn't indicate that I preferred it to the other alternatives we were discussing at the time). But maybe it's too long for anyone else to take seriously.

I'm also open to item(s) even though it's not what I'd choose. But I also don't expect to use this interface much, so probably don't care enough about what it's called as long as I can find it in the docs.

bradcray on 21 Feb 2020

character(s)
But maybe it's too long for anyone else to take seriously.

I don't think characters(s) is too long. I don't have a real strong opinion here and am OK with it or even with item(s) and even char(s).

I think my biggest hesitation with character(s) is that it might already have meaning to some people - they might be thinking they're getting ASCII characters or UTF-8 codepoints but it doesn't really say anything about the int vs string/bytes difference which is all that it's really doing differently here.

I see it as an advantage that item(s) has less connection to words people might think they know the meaning of w.r.t strings and bytes.

mppf on 21 Feb 2020

Ah, I buy that rationale. OK, I'm good with items.

bradcray on 21 Feb 2020

item/items is now on master. Closing.

e-kayrakli on 25 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings