There was some discussion under #13707 about the return type of bytes.this. It currently returns another bytes object a la string.this. @bradcray was suggesting that we may accept the difference and make bytes.this return a uint(8). It'd make mutable bytes design much clearer.
Here is a list of cases (including the one above) where we may consider making the bytes type easier to use along with uint(8) values:
bytes.this return type
Should it be uint(8) or bytes? (In python this is uint(8))
proc += for bytes and uint(8)s
Currently this is not supported, again similar to strings. (Python doesn't support it either)
What does casting a uint(8) to bytes mean?
Today if you do that, you'll get behavior identical to string:
var val: uint(8) = 65;
writef("%ht\n" val:bytes);
This prints b"65" but maybe it should print b"A"?
I'm happy with the current situation since I want bytes and strings to be easy to swap for each other, assuming the data is all ASCII.
I feel worried about our asymmetry with Python w.r.t. indexing and worry that Python programmers will be confused that indexing into a bytes in Chapel produces a new bytes rather than a single byte. I also keep getting confused by it myself because I keep incorrectly thinking "Oh, for this ASCII string, rather than using myString[i].toByte() I can just use myBytes[i].
It also feels unfortunate that iterating over a bytes yields single-byte bytes values rather than single integer / byte values.
(I thought I posted this somewhere... but I can't find it anywhere right now?)
First, note that having string/byte.this return string/byte enables nicer interactions with I/O; e.g. writeln("hi"[1]); prints h rather than 104 (which we would get if string.this returned an int).
Second, for bytes, we might wish for something like myBytes[1] & 1 to work (i.e. to treat the value as an integer).
One strategy that is appealing (to me anyway) is to add byte and codepoint types (not necessarily with those names) that just contain integers. Then make bytes.this/these return byte and string.this/these return codepoint. We might consider adding some/all of the following coercions:
byte would coerce to int (and presumably uint(8))byte would coerce to bytescodepoint would coerce to int codepoint would coerce to stringHowever even without these coercions, with specific overloads, things like writeln can do the right thing with byte or codepoint.
Note that this does not help with getting a mutable ref to a byte at a particular offset in a bytes. The problem there is that the corresponding operation is not possible for a UTF-8 string. (If bytes and string are to be consistent, that just means that this shouldn't do it; there could be a different bytes-only method that does).
My latest thought on this topic was to have string and bytes behave more like Python w.r.t. default indexing and iteration (due to precedent / to avoid surprises for programmers coming from Python), but to have methods that would support printable vs. numeric accesses and iterations in order to support users to write generic code that treated the types symmetrically by using the specific iterators/accessors rather than the default ones.
@bradcray was suggesting that we may accept the difference and make bytes.this return a uint(8). It'd make mutable bytes design much clearer.
Just to make sure the two statements above aren't conflated more than intended. I'd prefer for bytes indexing/iteration to yield integers even if we decided never to support mutable bytes values.
I quickly asked for opinions in different contexts and wanted to add responses here:
this and these returning/yielding uint but can live with having a .uint or .uints type of method (which we do in bytes.byte and bytes.bytes)strings and return/yield bytes. but said that he's happy to follow Python's precedence after me mentioning Python's behavior.Please chime in if I mischaracterized your opinions..
Going back to the specific discussion here:
First, note that having string/byte.this return string/byte enables nicer interactions with I/O; e.g. writeln("hi"[1]); prints h rather than 104 (which we would get if string.this returned an int).
The example here uses a string, but for bytes this behavior is identical to what Python does in the expense of having its string and bytes behave differently:
>>> print(b"hi"[1])
105
>>> print("hi"[1])
i
My latest thought on this topic was to have string and bytes behave more like Python w.r.t. default indexing and iteration (due to precedent / to avoid surprises for programmers coming from Python), but to have methods that would support printable vs. numeric accesses and iterations in order to support users to write generic code that treated the types symmetrically by using the specific iterators/accessors rather than the default ones.
I feel closest to this approach. But I don't yet have a good proposal for naming those accessors without making things confusing. Arguably they are a bit confusing already. For bytes we currently have the following:
bytes.this: returns bytesbytes.byte: returns uint(8)bytes.these: yields bytesbytes.bytes: yields uint(8)Same methods/iterators exist for string as well. So, we should probably keep bytes.byte and bytes.bytes as-is, make default bytes.this and bytes.these to be synonyms for those, add new accessors/iterators that return/yield bytes.
Inspired from Java's charAt and codePointAt for String, maybe we can have bytes.bytesAt return a bytes but I am not sure what its iterator counterpart should be. Or even whether we should have such an iterator at all..
One strategy that is appealing (to me anyway) is to add byte and codepoint types (not necessarily with those names) that just contain integers. Then make bytes.this/these return byte and string.this/these return codepoint. We might consider adding some/all of the following coercions:
- byte would coerce to int (and presumably uint(8))
- byte would coerce to bytes
- codepoint would coerce to int
- codepoint would coerce to string
I am not against this either. But I have to say that "keeping bytes and string as similar as possible" is losing its appeal on me, so I am not seeing the motivation for this personally.
bytes as a drop-in replacement for string. I think the cases for both of these should be much more clear-cut than that so that we shouldn't need almost-identical interfaces. One argument for having that is the initial transition to bytes for when we have UTF8 validation on master. And I don't think we should be making our decisions based on that, anyways.But then again, this could be a good compromise and I wouldn't object to it any further.
Reading on comments on other issues I came across one of my past comments that contradicts with what I am supporting today (I guess my opinions change a lot :( ):
It would be nice to check if a path (that is of bytes type) is absolute by doing myPath[0] == b"/". I am personally leaning more towards bytes.this(i) returning another bytes.
Joking aside, I still think that it is nice to be able to do myPath[0] == b"/". But it is something that I can sacrifice (and do myPath[0] == b"/".byte(), or even better myPath.startsWith(b"/"))
@e-kayrakli: W.r.t. your absolute path test pattern, it seemed to me that the consensus view was converging on having paths represented as strings (with special escapes for non-UTF8 characters?) rather than bytes, which means that for most paths (i.e., ones that you haven't gone out of your way to represent using bytes), you could still do that quick check, right?
@bradcray -- that's absolutely true.
It wasn't a very deep example, I didn't mean to bring up anything particular about bytes/strings in paths discussion.
From https://github.com/chapel-lang/chapel/issues/14291#issuecomment-562721845, I do care that string and bytes have a defined interface such that I can interchange between the two types generically (e.g., constrained generics). Python2 vs. Python3 is a good place to start — it sounds like that's already being looked at — because Python2 didn't have native UTF-8 support and it was still used for string processing. Arguably, Python3 made string processing more painful with the changes to their string interfaces.
(I don't know that much about Python, so this is all what I think I know.)
@BryantLam - in Python 3, string/bytes element access returns a different kind of thing between string bytes:
>>> 'hi'[1] # string element access returns string
'i'
>>> b'hi'[1] # bytes element access returns int
105
For interchangeablity of strings and bytes:
Based on my experience trying to port some string-based code to bytes-based, a
(more?) important discussion about that is coercion from string to
bytes.
I do still believe that interfaces should be as similar as possible and
bytes.this is one of the most significant parts of the interface. But adding
some generic methods that behaves identically and then letting string.this
and bytes.this diverge doesn't seem too bad for me. (Note that they already
return different types string vs bytes)
As you can see, both types' this methods are unique and already different that
they return different types. Maybe we can add a char method to both to replace
this and make bytes.this an alias of bytes.byte and string.this an alias
of string.char. (And similarly for these) So we'd end up with:
Overall I think we have the following proposals/options:
1. Leave as-is
string. No implementation effort.Con: Asymmetry with Python bytes -- bytes.this not returning a byte may
be unexpected and confusing, bytes are less performance-improving
var b = b"ABC";
writeln(b[1]);
This'd print "A" instead of the byte value "65".
2. Add new types with potential coercions (see comment)
3. Just make bytes accessor and default iterator return/yield uint(8)
bytes, trivial implementation.string4. Do 3, but add symmetric accessors and iterators (see comment)
bytes, generic code can use char() forstring.this and string.theseI proposed 4 and still support it. If I were to sort them, I'd say 4>1>3>2.
I like 4 best though I haven't thought about the names we'd use for symmetric routines much (in part because I didn't want to waste the effort if we didn't go with option 4).
I'm OK with 4 and have gradually been becoming less convinced of anything at all to do with this issue... so don't currently have a strong opinion.
Assuming we are going with 4, some alternatives to char accessor can be item, elem or unit but I don't like them as much as char. All of them are too general and unit has extra disadvantage of being to close to uint.
I'll put together a branch with 4 implemented to see if there's any unforeseen issues with that.
PR #14878 adds the symmetric accessor and iterator pair named char/chars. But they may not be what we want in the end. As a quick summary these are gonna return/yield (1) one-ASCII-character bytes or (2) one-Unicode-codepoint strings.
Possible alternatives I can think of:
item/items -- somewhat symmetry with map.itemselem/elemsunit/unitscharacter/characters -- to avoid confusion with C's charI think I prefer char(s) or character(s) to item(s), elem(s), or unit(s). Other things I considered, but didn't feel excited about in the end were:
I'm a little bit concerned that char / chars will be confusing to people coming with a C background.
I don't think codepoints would be possible because strings already have a method of that name.
I think I like char and item the best of these ideas.
One other thought: value / values?
Or element / elements (I know, it's the same as elem, but it is somehow more appealing to me when spelled out)
value/values: Not a big fan. As a string user, it makes me think about integral "values" of codepoints or characters more than the characters themselves.
element/elements: I don't think it is better than char/chars even with potential confusion with C's char type. But it is more of a personal choice rather than having specific concerns. I can live with this, if there is a general agreement.
I don't have a strong opinion on the naming. But...
unithas extra disadvantage of being to close touint.
I completely agree. I'm actively against this one.
I'm a little bit concerned that
char/charswill be confusing to people coming with a C background.
Agreed. Especially when unfortunately developing a custom strings library in Chapel with a C FFI.
Possible alternatives I can think of:
item/items-- somewhat symmetry withmap.itemselem/elems
My preference would be item or elem. I like consistency since these are already used by other standard modules. item is used by map.items and Chapel has historically used elem.
That said, char is likely fine too. I'd be interested in a thought experiment as to whether it makes sense to have a single constrained generic with a generic method like item or elem versus a string-specific constrained generic with only char/chars.
interface Item {
proc this.item { ... }
}
interface StringItem {
proc this.item { ... }
}
// Of course string would implement StringItem
class string implements StringItem { ... }
proc print(obj: Item + StringItem) {
// ^^^^^^^^^^^^^^^^^
// oops this doesn't work due to ambiguity
writeln(obj.item);
}
but this is also a crude example because StringItem doesn't make sense; it would just be class string implements Item and then it works.
Side question. With proposal 4, what are the available ways a user would typically ASCII-print a bytes object?
var b = b"ABC";
writeln(b); // 65 66 67
writeln(b.chars()); // ABC
writeln("%s".format(b)); // Is this legal?
I assume the last line would be able to make use of this chars/items/elems method.
Side question. With proposal 4, what are the available ways a user would typically ASCII-print a bytes object?
var b = b"ABC"; writeln(b); // 65 66 67 writeln(b.chars()); // ABC writeln("%s".format(b)); // Is this legal?I assume the last line would be able to make use of this chars/items/elems method.
There might be some confusion here. A "normal" printout of a bytes object is in ASCII:
var b = b"ABC";
writeln(b); // ABC
And, you are right to suggest that %s is not a valid formatter for bytes. You should use binary formatter, instead.
var b = b"ABC";
writeln("%|*s".format(b.length, b)); // ABC
Neither of these will change with proposal 4. However,
for x in b do
writeln(x);
prints A, B, C today, and it'll print 65, 66, 67 after this change. And we'll use chars (or whatever we choose) to get the same behavior:
for x in b.chars() do
writeln(x);
will print A, B, C for both strings and bytes.
Does that change your name preferences?
I'd be interested in a thought experiment as to whether it makes sense to have a single constrained generic with a generic method like item or elem versus a string-specific constrained generic with only char/chars.
That's be interesting, but I think this is already almost covered by having an iter these for types that make sense. Constrained generics could help in your snippet by allowing us to define an Interface Iterable that has iter these in it, I think.
I think overall we got
char/ charsitem/itemsthat have the most support.
@bradcray -- how strongly are you against item? Thinking about C interop more made me a bit concerned with char as well. If I am handling strings both at Chapel and C in the same application, I'd be confused by two separate meanings of char
If char(s) is off the table, character(s) is probably what I'd use next, left to my own devices (re-reading my earlier response, I'm seeing that I didn't indicate that I preferred it to the other alternatives we were discussing at the time). But maybe it's too long for anyone else to take seriously.
I'm also open to item(s) even though it's not what I'd choose. But I also don't expect to use this interface much, so probably don't care enough about what it's called as long as I can find it in the docs.
character(s)
But maybe it's too long for anyone else to take seriously.
I don't think characters(s) is too long. I don't have a real strong opinion here and am OK with it or even with item(s) and even char(s).
I think my biggest hesitation with character(s) is that it might already have meaning to some people - they might be thinking they're getting ASCII characters or UTF-8 codepoints but it doesn't really say anything about the int vs string/bytes difference which is all that it's really doing differently here.
I see it as an advantage that item(s) has less connection to words people might think they know the meaning of w.r.t strings and bytes.
Ah, I buy that rationale. OK, I'm good with items.
item/items is now on master. Closing.