Chapel: introduce string.size

Created on 7 Feb 2018 · 18Comments · Source: chapel-lang/chapel

In a discussion on an internal mailing list, we think that we want to introduce the method string.size by analogy to array.size, range.size tuple.size and so on.

There is some ambiguity in what 'size' could mean for different string encodings, though.

Libraries / Modules Design Feature Request

Source

vasslitvinov

All 18 comments

My proposal is to make string.size and string.length be aliases, by analogy to range.size and range.length etc. As we clarify the meaning of "length" of the string, it will automatically be reflected in string.size as well.

vasslitvinov on 7 Feb 2018

Yeah!! In C++ too there are both functions i.e string.size and string.length. So, I also think that string.size should also be there in String.chpl .

victor-ludorum on 7 Feb 2018

👍1

Yes, I think so too

nimitbhardwaj on 7 Feb 2018

As we clarify the meaning of "length" of the string, it will automatically be reflected in string.size as well.

One downside of doing this is that it sets ourselves up for a language-breaking change in the future, whereas leaving string.size undefined until we settle the question will likely not.

Do I understand correctly that the fundamental question we need to answer is whether string.size is the number of characters in the string (encoding-dependent), the number of bytes required to store the string (encoding-independent), or something else?

ben-albrecht on 8 Feb 2018

Do I understand correctly that the fundamental question we need to answer is whether string.size is the number of characters in the string (encoding-dependent), the number of bytes required to store the string (encoding-independent), or something else?

None of our current .size() queries are in terms of bytes, so I think that's definitely the wrong choice here. It's hard for me to imagine it meaning anything other than number of logical characters (where I suspect tacking accents onto glyphs are the kind of thing that introduce implementation challenges in unicode?)

bradcray on 8 Feb 2018

This makes me think that our implementation should be helpful when the user writes a size query, mistakenly followed by parens. For example:

execute the size query anyway, or
say "please do not use parens for size"

Cf. the current behavior is: error: unresolved access of 'int(64)' by '()' .

This helpfulness could apply, for example, to all paren-full invocations of paren-less functions that return a primitive type.

vasslitvinov on 8 Feb 2018

This makes me think that our implementation should be helpful when the user writes a size query, mistakenly followed by parens.

I agree, but this is a bit of a tangent from the topic being tackled here :)

ben-albrecht on 9 Feb 2018

Hey, I had a thought that size would give the number of bytes in a string, not the logical length, if unicode encoding is used, so I think length should give the logical length.
I give an example, here I write a sentence in Hindi "नमस्ते", it means "hello" in Hindi, there are 5 characters in it, they are 'न'(its n), 'म'(its m), 'स्'(half s it is), 'त'(its t), and 'ो'(for a matra of a), its called Namaste.
So shouldn't it be like

nimitbhardwaj on 9 Feb 2018

It's hard for me to imagine it meaning anything other than number of logical characters

I see, so the challenge is in the implementation. Could the current design of strings support this yet? The string docs currently state:

While string is intended to be a Unicode string, there is much left to do.

I imagine supporting a proper encoding-dependent length depends on finishing this remaining work to supporting unicode strings (btw, are these next steps documented anywhere?).

Today, length assumes the string is ascii, which is equivalent to giving number of bytes:

var s = 'йцы';
writeln(s.length);
// prints 6

ben-albrecht on 9 Feb 2018

Just like python, string.size for नमस्ते is 18, not 5, i.e. number of bytes, I saw the hexadecimal output in python

nimitbhardwaj on 9 Feb 2018

Hey @ben-albrecht , what should be actual length of 'йцы', 3 or 4 or 6

nimitbhardwaj on 9 Feb 2018

Just like python,

Depends on which Python. See this nice SO answer describing Python 2 vs. Python 3 string lengths.

what should be actual length of 'йцы', 3 or 4 or 6

I believe the character count should be 3, if encoding to utf-8.

ben-albrecht on 9 Feb 2018

Yes, in python2 its 18, but in python3 its 6, may be I missed 'स्', its half s, so perhaps in unicode its considered as 2 characters, and one char is of 16 bits in unicode of hindi.

nimitbhardwaj on 9 Feb 2018

I think extending the current numBytes() / numBits() support would be the best way to get the size of a (string) value in terms of its memory utilization. Again, since there's no precedent within Chapel for having size return something w.r.t. memory utilization, I think it would be inconsistent if it meant "number of indices" for ranges and domains, "number of elements" for arrays, and yet "number of bytes" for strings (regardless of what Python does).

bradcray on 9 Feb 2018

👍1

Another inconsistency would come up if string.size and string.length meant different things. Cf size and length are aliases for ranges and lists. size is an alias for more-meaningful names such as array.numElements and domain.numIndices.

vasslitvinov on 9 Feb 2018

@bradcray - Agreed.

So today, string.length means "number of characters", assuming ascii encoding. If we assert that string.length and string.size will always have the same meaning (number of characters), then I think it's fine to introduce string.size as an alias to string.length.

My earlier opposition was specifically to the proposal of string.size and string.length having different meanings, where one would suffer from breaking changes as unicode support improves, as the other does not.

ben-albrecht on 9 Feb 2018

To recap, if string.length and string.size will always mean the same thing, then I am in favor of adding string.size as an alias to string.length.

ben-albrecht on 16 Feb 2018

👍1

FYI to those subscribed - We are going to move forward with this proposal.

ben-albrecht on 26 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Should enums support `.indices` and indexing?

bradcray · 3Comments

allow promotion when looking for leader/follower with chpl__promotionType() ?

vasslitvinov · 3Comments

Adding entries to map of shared type results in an error

ben-albrecht · 3Comments

Passing a function with an array argument results in internal error

ty1027 · 3Comments

Overloading assignment (=) on class types

bradcray · 4Comments