Chapel: Implementation of function like `codepointToString` for bytes

Created on 14 Apr 2020 · 3Comments · Source: chapel-lang/chapel

Summary of Problem

codepointToString should not have illegal UTF-8 values as valid arguments. But it didn't throw any error when these types are given as arguments, for example:

codepointToString(142);

Also, it points to the need for a function like codepointToString for bytes.

Configuration Information

chpl version 1.23.0 pre-release (22dc508e76)

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: gmp
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_LLVM: none *
CHPL_AUX_FILESYS: none

Libraries / Modules Bug Feature Request

Source

Aniket21mathur

👍2

All 3 comments

We definitely need to do something about codepointToString.

Building strings/bytes from integers can be desired as a convenience over building a C-based buffer first and then creating the appropriate type out of it. What I think we should do is:

In the short term: fix our encoder and/or the implementation of codepointToString to throw when a non-unicode codepoint is passed. And just for the symmetry add the convenience function:

inline proc byteToBytes(b: uint(8)) {
   return createBytesWithNewBuffer(c_ptrTo(b), 1);
}

Python3 seems to do something special for non-unicode values. But we don't have a good 1-to-1 match with Python in this part of the interface:

>>> str("\u008e")  # not valid unicode
'\x8e'
>>> str("\u0394")  # valid unicode
'Δ'
>>>

In the long term: Do we need additonal overloads of factory functions that can create string/bytes from integer arrays and/or iterators?
- If so, what do those integers mean for string? Are they codepoints? Are they UTF-8 encoded bytes? Can/should we support both somehow?

e-kayrakli on 14 Apr 2020

👍1

Do we need additonal overloads of factory functions that can create string/bytes from integer arrays and/or iterators?

@e-kayrakli I don't think we should have some additional overloads of this kind. The need to convert an array of integer into a string/bytes may vary from user to user. As you mentioned, there will always be an ambiguity about the information the integers provide. In that case, the user can implement that kind of function himself as per the requirement. Thoughts?

priyank23 on 15 Apr 2020

@priyank23 -- You have a point. But I really dislike the fact that if/when they'd need to do that they'd have to either use

Concatenate multiple smaller string/bytes that has performance overheads
C pointers that are too low-level of a feature to be used for creating a value for a primitive type

string and bytes may be a bit different in this respect, too. For string, one example I can think of is parsing some config or a csv-like file where there can be some comments. You may wanna read that character by character and ignore comments. In that scenario, I could imagine writing an iterator that reads the file character-by-character and yields them only if you are not in a comment. In that case wouldn't be attractive to do the following?

var fullContents = createStringWithNewBuffer(myFileReader());

The alternative involves reading multiple strings and concatenating them, which would have performance overheads as I mention above.

I think we need this more for bytes because I personally see bytes as somewhere between a string and a pure byte array. For example, you can send bytes-based ZMQ messages that involve arbitrarily-serialized data. In that case it may be appealing to create a bytes from an iterator that is basically a bit stream.

e-kayrakli on 15 Apr 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings