codepointToString
should not have illegal UTF-8 values as valid arguments. But it didn't throw any error when these types are given as arguments, for example:
codepointToString(142);
Also, it points to the need for a function like codepointToString
for bytes.
chpl version 1.23.0 pre-release (22dc508e76)
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: gmp
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_LLVM: none *
CHPL_AUX_FILESYS: none
We definitely need to do something about codepointToString
.
Building strings/bytes from integers can be desired as a convenience over building a C-based buffer first and then creating the appropriate type out of it. What I think we should do is:
codepointToString
to throw when a non-unicode codepoint is passed. And just for the symmetry add the convenience function:inline proc byteToBytes(b: uint(8)) {
return createBytesWithNewBuffer(c_ptrTo(b), 1);
}
Python3 seems to do something special for non-unicode values. But we don't have a good 1-to-1 match with Python in this part of the interface:
>>> str("\u008e") # not valid unicode
'\x8e'
>>> str("\u0394") # valid unicode
'螖'
>>>
string
? Are they codepoints? Are they UTF-8 encoded bytes? Can/should we support both somehow?Do we need additonal overloads of factory functions that can create string/bytes from integer arrays and/or iterators?
@e-kayrakli I don't think we should have some additional overloads of this kind. The need to convert an array of integer into a string/bytes may vary from user to user. As you mentioned, there will always be an ambiguity about the information the integers provide. In that case, the user can implement that kind of function himself as per the requirement. Thoughts?
@priyank23 -- You have a point. But I really dislike the fact that if/when they'd need to do that they'd have to either use
string
and bytes
may be a bit different in this respect, too. For string, one example I can think of is parsing some config or a csv-like file where there can be some comments. You may wanna read that character by character and ignore comments. In that scenario, I could imagine writing an iterator that reads the file character-by-character and yields them only if you are not in a comment. In that case wouldn't be attractive to do the following?
var fullContents = createStringWithNewBuffer(myFileReader());
The alternative involves reading multiple strings and concatenating them, which would have performance overheads as I mention above.
I think we need this more for bytes
because I personally see bytes
as somewhere between a string and a pure byte array. For example, you can send bytes
-based ZMQ messages that involve arbitrarily-serialized data. In that case it may be appealing to create a bytes
from an iterator that is basically a bit stream.