In a discussion on an internal mailing list, we think that we want to introduce the method string.size by analogy to array.size, range.size tuple.size and so on.
There is some ambiguity in what 'size' could mean for different string encodings, though.
My proposal is to make string.size and string.length be aliases, by analogy to range.size and range.length etc. As we clarify the meaning of "length" of the string, it will automatically be reflected in string.size as well.
Yeah!! In C++ too there are both functions i.e string.size and string.length. So, I also think that string.size should also be there in String.chpl .
Yes, I think so too
As we clarify the meaning of "length" of the string, it will automatically be reflected in string.size as well.
One downside of doing this is that it sets ourselves up for a language-breaking change in the future, whereas leaving string.size undefined until we settle the question will likely not.
Do I understand correctly that the fundamental question we need to answer is whether string.size is the number of characters in the string (encoding-dependent), the number of bytes required to store the string (encoding-independent), or something else?
Do I understand correctly that the fundamental question we need to answer is whether string.size is the number of characters in the string (encoding-dependent), the number of bytes required to store the string (encoding-independent), or something else?
None of our current .size() queries are in terms of bytes, so I think that's definitely the wrong choice here. It's hard for me to imagine it meaning anything other than number of logical characters (where I suspect tacking accents onto glyphs are the kind of thing that introduce implementation challenges in unicode?)
This makes me think that our implementation should be helpful when the user writes a size query, mistakenly followed by parens. For example:
size"Cf. the current behavior is: error: unresolved access of 'int(64)' by '()' .
This helpfulness could apply, for example, to all paren-full invocations of paren-less functions that return a primitive type.
This makes me think that our implementation should be helpful when the user writes a size query, mistakenly followed by parens.
I agree, but this is a bit of a tangent from the topic being tackled here :)
Hey, I had a thought that size would give the number of bytes in a string, not the logical length, if unicode encoding is used, so I think length should give the logical length.
I give an example, here I write a sentence in Hindi "नमस्ते", it means "hello" in Hindi, there are 5 characters in it, they are 'न'(its n), 'म'(its m), 'स्'(half s it is), 'त'(its t), and 'ो'(for a matra of a), its called Namaste.
So shouldn't it be like
It's hard for me to imagine it meaning anything other than number of logical characters
I see, so the challenge is in the implementation. Could the current design of strings support this yet? The string docs currently state:
While
stringis intended to be a Unicode string, there is much left to do.
I imagine supporting a proper encoding-dependent length depends on finishing this remaining work to supporting unicode strings (btw, are these next steps documented anywhere?).
Today, length assumes the string is ascii, which is equivalent to giving number of bytes:
var s = 'йцы';
writeln(s.length);
// prints 6
Just like python, string.size for नमस्ते is 18, not 5, i.e. number of bytes, I saw the hexadecimal output in python
Hey @ben-albrecht , what should be actual length of 'йцы', 3 or 4 or 6
Just like python,
Depends on which Python. See this nice SO answer describing Python 2 vs. Python 3 string lengths.
what should be actual length of 'йцы', 3 or 4 or 6
I believe the character count should be 3, if encoding to utf-8.
Yes, in python2 its 18, but in python3 its 6, may be I missed 'स्', its half s, so perhaps in unicode its considered as 2 characters, and one char is of 16 bits in unicode of hindi.
I think extending the current numBytes() / numBits() support would be the best way to get the size of a (string) value in terms of its memory utilization. Again, since there's no precedent within Chapel for having size return something w.r.t. memory utilization, I think it would be inconsistent if it meant "number of indices" for ranges and domains, "number of elements" for arrays, and yet "number of bytes" for strings (regardless of what Python does).
Another inconsistency would come up if string.size and string.length meant different things. Cf size and length are aliases for ranges and lists. size is an alias for more-meaningful names such as array.numElements and domain.numIndices.
@bradcray - Agreed.
So today, string.length means "number of characters", assuming ascii encoding. If we assert that string.length and string.size will always have the same meaning (number of characters), then I think it's fine to introduce string.size as an alias to string.length.
My earlier opposition was specifically to the proposal of string.size and string.length having different meanings, where one would suffer from breaking changes as unicode support improves, as the other does not.
To recap, if string.length and string.size will always mean the same thing, then I am in favor of adding string.size as an alias to string.length.
FYI to those subscribed - We are going to move forward with this proposal.