The ascii() function used to provide the numeric value of the first (or only) byte of a string. ascii() is now deprecated and has been replaced with string.byte(1). We can do the same thing for codepoints with string.codepoint(1).
Recently, some people have mentioned that they might be interested in an additional way to say this, possibly something shorter, possibly one for bytes and one for codeopints. Should we have an additional way to say this? If so, what should it be?
I can't remember - did we already rule out string.byte() doing it?
Does it need to be shorter or conceptually simpler? E.g. string.firstByte()?
I initially proposed string.byte() and string.codepoint() each with a default argument of 1, but the group preferred not to have a default argument for those functions.
I don't think being shorter was a requirement, just one of the possibilities.
There was a comment in the String code that someone was thinking of ordinal(), but we would need to distinguish between bytes and codepoints.
As a motivating example of an existing code that has changed due to this deprecation, fasta.chpl now reads:
enum nucleotide {
A = "A".byte(1), C = "C".byte(1), G = "G".byte(1), T = "T".byte(1),
a = "a".byte(1), c = "c".byte(1), g = "g".byte(1), t = "t".byte(1),
B = "B".byte(1), D = "D".byte(1), H = "H".byte(1), K = "K".byte(1),
M = "M".byte(1), N = "N".byte(1), R = "R".byte(1), S = "S".byte(1),
V = "V".byte(1), W = "W".byte(1), Y = "Y".byte(1)
}
where before it said:
enum nucleotide {
A = ascii("A"), C = ascii("C"), G = ascii("G"), T = ascii("T"),
a = ascii("a"), c = ascii("c"), g = ascii("g"), t = ascii("t"),
B = ascii("B"), D = ascii("D"), H = ascii("H"), K = ascii("K"),
M = ascii("M"), N = ascii("N"), R = ascii("R"), S = ascii("S"),
V = ascii("V"), W = ascii("W"), Y = ascii("Y")
}
my head goes to toByte("A") or toCodepoint("A") where it will only work for strings of length 1 (in the corresponding units) and in the param string case would result in a compile-time error if it was not.
my head goes to toByte("A") or toCodepoint("A") where it will only work for strings of length 1 (in the corresponding units) and in the param string case would result in a compile-time error if it was not.
Or we introduce a character literal somehow, e.g. c"A" = 0x41.
I like toByte(), toCodepoint(), and character literals for their brevity, even though that probably wasn't a hard requirement.
I feel skittish about character literals for a few reasons:
c_string literal syntax (which I'd also like to retire, ideally)That said, the brevity is attractive...
I sense that we seem to have no problems with toByte() and toCodepoint(), in contrast to other options. Any objections if I close this design issue and open an implementation issue to create those two functions?
Is there a reason to make toByte() a free function (e.g toByte("a")) rather than a method (e.g. "a".toByte() ? I don't have a strong opinion here but I generally prefer methods if possible.
I'm fine with either way.
Methods reduce namespace pollution. I'd be happy to go that direction if we don't have any objections. I'm curious to hear from @bradcray , though, since he proposed the functions.
I'm fine with a method as well. I probably just proposed a free function because I was looking at ascii() at the time. I assume it would do some sort of assert or throw if the string was > 1 byte?
I'm fine with proceeding with the implementation, though I don't recall whether the team has been pointed to this issue / proposal for comment, or if it's just been those of us who live in the issues proactively a lot commenting because we happened to see it.
I announced it by e-mail, and then later, I reminded people by e-mail that this issue seemed to be converging and people should comment if they have an opinion. I think we're good to go with methods that give an error if the string is > 1 unit of the appropriate kind.
I think we're good to go with methods that give an error if the string is > 1 unit of the appropriate kind.
What is the nature of the error? The methods throw, or is it something we consider to be a bounds checking halt? (Or undefined behavior?)
As Brad indicated, the param version will need to be a compiler error. This will make the codepoint version quite complicated and possibly require a new primitive, with UTF-8 decoding being done in the compiler itself. It's a big project because of that.
For the non-param version, I had in mind the same kind of error as an out-of-bounds condition has now.
In the interest of making progress, I'm fine if the param codepoint version is an execution time error for the time being. Let's just be sure to file an issue against the desire to make it compile-time eventually.
Actually, it's the requirement of having a separate param codepoint version that requires the new primitive, and compiler UTF-8 decoding. After that, the error message isn't much add-on. Note that we currently have a param string.byte() but not a param string.codepoint() for exactly that reason.
It sounds like you might be willing to defer the param version of string.toCodepoint() because of that complexity. Is that right?
@dmk42 - I wouldn't consider the param codepoint version a big project. You can just use qio_utf8_decode from third-party/utf8-decoder. You can see examples of this being used in the runtime. (Yes, it will require a new primitive in the compiler).
Either way I think it's reasonable to implement everything else first. I'd be happy to help with the param version. I expect it'd take me an hour or two to do the compiler changes.
@mppf - I think you're underestimating the testing, which will be harder because we won't be able to just feed it a text stream like the I/O runtime, but more crucial because it is within the compiler. qio_utf8_decode is just the single-byte part of the answer, and one option is to recompile part of the I/O runtime to link with the compiler, to take advantage of the testing that has already taken place. In any case, it's quite a messy undertaking.
It sounds like you might be willing to defer the param version of string.toCodepoint() because of that complexity. Is that right?
That's right, sorry to be unclear.
[edit: specifically, I'm much more concerned about the string length being in codepoints issue #13087 than a compile-time version of codepoint parsing].
I think you're underestimating the testing
In fact I didn't estimate testing at all.
qio_utf8_decode is just the single-byte part of the answer,
I have no idea what you mean.
to take advantage of the testing that has already taken place
it seems we will need some param-specific tests and I don't see adding these as particularly prohibitive. Of course we could add an arbitrary amount of testing, but that's true for most of our efforts.
Either way, the next step is the same - start without the param version of codepoint functions.