Describe the bug
Nvtext.tokenize with default delimiter gives incorrect results with initial ascii chars . See below:
Steps/Code to reproduce bug
import nvtext
import nvstrings
text_str = nvstrings.to_device(["test_str\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06"])
print(nvtext.tokenize(text_str))
The current output is below:
['test_str', 'test_str', 'test_str', 'test_str', 'test_str', 'test_str', 'test_str']
Expected behavior
The expected output should be like below:
['\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06test_str\x10test_str\x1f']
Environment overview (please complete the following information)
The default delimiter is intended to split on 'whitespace'. The tokenize method considers any character code point of 32 (space) and below to be control characters (tab, new-line, page-feed, etc) and therefore are treated as whitespace as an optimization. See ascii.com for list of these character definitions.
If you wish to split on space only you can specify that to the delimiter parameter instead of using the default.
>>> print( nvtext.tokenize(txt_str, delimiter=' ') )
['test_str\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06']
nvtext.tokenize documentation specifies the default delimiter specifies whitespace
https://github.com/rapidsai/cudf/blob/a84d3b5edf57adedd132f6f27b768ee7e093f674/python/nvstrings/nvtext.py#L19
https://rapidsai.github.io/projects/nvstrings/en/0.9.0/api.html#nvtext.tokenize
The default delimiter is intended to split on 'whitespace'. The
tokenizemethod considers any character code point of 32 (space) and below to be control characters (tab, new-line, page-feed, etc) and therefore are treated as whitespace as an optimization. See ascii.com for list of these character definitions.If you wish to split on space only you can specify that to the
delimiterparameter instead of using the default.>>> print( nvtext.tokenize(txt_str, delimiter=' ') ) ['test_str\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06']
Got it . Thanks for clarifying.
I agree with the behavior of treating any character code point of 32 (space) and below as spaces, but this behavior makes it inconsistent with nvtext.token_count , default behavior where we seem to only treat the space character as the delimiter.
I think we should change the token_count default behavior to treat the same 32 (space) and below to be control characters as delimiting characters because in work-flows we may want to use the token_count function to gather labels for the tokenized strings.
text_str = nvstrings.to_device(["test_str\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06"])
print(nvtext.token_count(text_str))
[1]
https://rapidsai.github.io/projects/nvstrings/en/0.10.0/api.html#nvtext.token_count
I agree. I think we can change this bug to "Change the default delimiter for nvtext.token_count" or something like that. The underlying C++ code for NVText.token_count uses whitespace when the delimiter paramter is null. So the changes are only required in the pytext.cpp as you mentioned https://github.com/rapidsai/cudf/blob/a80e9a1e173abf60f5703486933e80d87982212a/python/nvstrings/cpp/pytext.cpp#L142
and in the python code https://github.com/rapidsai/cudf/blob/bdb5291968491c4f216e8d41a4d3b21c1bf32421/python/nvstrings/nvtext.py#L76
Got it , thanks for clarifying. I have updated the bug heading.
Will do a pr to fix this bug soon, can you please assign the issue to me.
Thanks .
In the same spirit,
I think we might want also want to look into ensuring a common default delimiter behavior for all the functions where we have delimiter passed as an argument:
Please let me know if that is something we will like, I can work on fixing (atleast the things that can be done on a python level) / creating issues for things that need changes on the C++ side of things.
I think we have some flexibility on nvtext delimiter defaults but I tried to make the nvstrings delimiter defaults match Pandas. You are welcome to check these and create issues as necessary.
Hi @VibhuJawa if you are busy working on something else then I can take up this.
Hi @VibhuJawa if you are busy working on something else then I can take up this.
It would be amazing if you can pick this up @AK-ayush , this issue slipped through the cracks for me. Thanks