Chapel: Should there be an additional way to access the first (or only) byte or codepoint of a string?

Created on 3 Jun 2019 · 21Comments · Source: chapel-lang/chapel

The ascii() function used to provide the numeric value of the first (or only) byte of a string. ascii() is now deprecated and has been replaced with string.byte(1). We can do the same thing for codepoints with string.codepoint(1).

Recently, some people have mentioned that they might be interested in an additional way to say this, possibly something shorter, possibly one for bytes and one for codeopints. Should we have an additional way to say this? If so, what should it be?

Source

dmk42

All 21 comments

I can't remember - did we already rule out string.byte() doing it?

Does it need to be shorter or conceptually simpler? E.g. string.firstByte()?

mppf on 3 Jun 2019

I initially proposed string.byte() and string.codepoint() each with a default argument of 1, but the group preferred not to have a default argument for those functions.

I don't think being shorter was a requirement, just one of the possibilities.

dmk42 on 3 Jun 2019

There was a comment in the String code that someone was thinking of ordinal(), but we would need to distinguish between bytes and codepoints.

dmk42 on 3 Jun 2019

As a motivating example of an existing code that has changed due to this deprecation, fasta.chpl now reads:

enum nucleotide {
  A = "A".byte(1), C = "C".byte(1), G = "G".byte(1), T = "T".byte(1),
  a = "a".byte(1), c = "c".byte(1), g = "g".byte(1), t = "t".byte(1),
  B = "B".byte(1), D = "D".byte(1), H = "H".byte(1), K = "K".byte(1),
  M = "M".byte(1), N = "N".byte(1), R = "R".byte(1), S = "S".byte(1),
  V = "V".byte(1), W = "W".byte(1), Y = "Y".byte(1)
}

where before it said:

enum nucleotide {
  A = ascii("A"), C = ascii("C"), G = ascii("G"), T = ascii("T"),
  a = ascii("a"), c = ascii("c"), g = ascii("g"), t = ascii("t"),
  B = ascii("B"), D = ascii("D"), H = ascii("H"), K = ascii("K"),
  M = ascii("M"), N = ascii("N"), R = ascii("R"), S = ascii("S"),
  V = ascii("V"), W = ascii("W"), Y = ascii("Y")
}

my head goes to toByte("A") or toCodepoint("A") where it will only work for strings of length 1 (in the corresponding units) and in the param string case would result in a compile-time error if it was not.

bradcray on 3 Jun 2019

👍1

my head goes to toByte("A") or toCodepoint("A") where it will only work for strings of length 1 (in the corresponding units) and in the param string case would result in a compile-time error if it was not.

Or we introduce a character literal somehow, e.g. c"A" = 0x41.

mppf on 3 Jun 2019

I like toByte(), toCodepoint(), and character literals for their brevity, even though that probably wasn't a hard requirement.

dmk42 on 4 Jun 2019

I feel skittish about character literals for a few reasons:

it feels weird to have a literal kind for something that doesn't have a matching type (and I'm not yet convinced it's worth our while to add a character type to Chapel).
syntactically, it duplicates with our current c_string literal syntax (which I'd also like to retire, ideally)

That said, the brevity is attractive...

bradcray on 4 Jun 2019

I sense that we seem to have no problems with toByte() and toCodepoint(), in contrast to other options. Any objections if I close this design issue and open an implementation issue to create those two functions?

dmk42 on 10 Jun 2019

Is there a reason to make toByte() a free function (e.g toByte("a")) rather than a method (e.g. "a".toByte() ? I don't have a strong opinion here but I generally prefer methods if possible.

mppf on 10 Jun 2019

I'm fine with either way.

dmk42 on 10 Jun 2019

Methods reduce namespace pollution. I'd be happy to go that direction if we don't have any objections. I'm curious to hear from @bradcray , though, since he proposed the functions.

dmk42 on 10 Jun 2019

I'm fine with a method as well. I probably just proposed a free function because I was looking at ascii() at the time. I assume it would do some sort of assert or throw if the string was > 1 byte?

I'm fine with proceeding with the implementation, though I don't recall whether the team has been pointed to this issue / proposal for comment, or if it's just been those of us who live in the issues proactively a lot commenting because we happened to see it.

bradcray on 10 Jun 2019

I announced it by e-mail, and then later, I reminded people by e-mail that this issue seemed to be converging and people should comment if they have an opinion. I think we're good to go with methods that give an error if the string is > 1 unit of the appropriate kind.

dmk42 on 11 Jun 2019

👍1

I think we're good to go with methods that give an error if the string is > 1 unit of the appropriate kind.

What is the nature of the error? The methods throw, or is it something we consider to be a bounds checking halt? (Or undefined behavior?)

mppf on 11 Jun 2019

As Brad indicated, the param version will need to be a compiler error. This will make the codepoint version quite complicated and possibly require a new primitive, with UTF-8 decoding being done in the compiler itself. It's a big project because of that.

For the non-param version, I had in mind the same kind of error as an out-of-bounds condition has now.

dmk42 on 11 Jun 2019

👍1

In the interest of making progress, I'm fine if the param codepoint version is an execution time error for the time being. Let's just be sure to file an issue against the desire to make it compile-time eventually.

bradcray on 11 Jun 2019

Actually, it's the requirement of having a separate param codepoint version that requires the new primitive, and compiler UTF-8 decoding. After that, the error message isn't much add-on. Note that we currently have a param string.byte() but not a param string.codepoint() for exactly that reason.

It sounds like you might be willing to defer the param version of string.toCodepoint() because of that complexity. Is that right?

dmk42 on 11 Jun 2019

@dmk42 - I wouldn't consider the param codepoint version a big project. You can just use qio_utf8_decode from third-party/utf8-decoder. You can see examples of this being used in the runtime. (Yes, it will require a new primitive in the compiler).

Either way I think it's reasonable to implement everything else first. I'd be happy to help with the param version. I expect it'd take me an hour or two to do the compiler changes.

mppf on 11 Jun 2019

@mppf - I think you're underestimating the testing, which will be harder because we won't be able to just feed it a text stream like the I/O runtime, but more crucial because it is within the compiler. qio_utf8_decode is just the single-byte part of the answer, and one option is to recompile part of the I/O runtime to link with the compiler, to take advantage of the testing that has already taken place. In any case, it's quite a messy undertaking.

dmk42 on 11 Jun 2019

It sounds like you might be willing to defer the param version of string.toCodepoint() because of that complexity. Is that right?

That's right, sorry to be unclear.

[edit: specifically, I'm much more concerned about the string length being in codepoints issue #13087 than a compile-time version of codepoint parsing].

bradcray on 11 Jun 2019

I think you're underestimating the testing

In fact I didn't estimate testing at all.

qio_utf8_decode is just the single-byte part of the answer,

I have no idea what you mean.

to take advantage of the testing that has already taken place

it seems we will need some param-specific tests and I don't see adding these as particularly prohibitive. Of course we could add an arbitrary amount of testing, but that's true for most of our efforts.

Either way, the next step is the same - start without the param version of codepoint functions.

mppf on 11 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings