Cudf: [BUG] Change the default delimiter for nvtext.token_count to support splitting on whitespace

Created on 14 Sep 2019 · 9Comments · Source: rapidsai/cudf

Describe the bug

Nvtext.tokenize with default delimiter gives incorrect results with initial ascii chars . See below:

Steps/Code to reproduce bug

import nvtext
import nvstrings

text_str = nvstrings.to_device(["test_str\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06"])

print(nvtext.tokenize(text_str))

The current output is below:

['test_str', 'test_str', 'test_str', 'test_str', 'test_str', 'test_str', 'test_str']

Expected behavior

The expected output should be like below:

['\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06test_str\x10test_str\x1f']

Environment overview (please complete the following information)

Environment location: Bare-meta
Method of cuDF install: source

bug strings

Source

VibhuJawa

All 9 comments

The default delimiter is intended to split on 'whitespace'. The tokenize method considers any character code point of 32 (space) and below to be control characters (tab, new-line, page-feed, etc) and therefore are treated as whitespace as an optimization. See ascii.com for list of these character definitions.

If you wish to split on space only you can specify that to the delimiter parameter instead of using the default.

>>> print( nvtext.tokenize(txt_str, delimiter=' ') )
['test_str\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06']

davidwendt on 14 Sep 2019

nvtext.tokenize documentation specifies the default delimiter specifies whitespace
https://github.com/rapidsai/cudf/blob/a84d3b5edf57adedd132f6f27b768ee7e093f674/python/nvstrings/nvtext.py#L19
https://rapidsai.github.io/projects/nvstrings/en/0.9.0/api.html#nvtext.tokenize

davidwendt on 14 Sep 2019

The default delimiter is intended to split on 'whitespace'. The tokenize method considers any character code point of 32 (space) and below to be control characters (tab, new-line, page-feed, etc) and therefore are treated as whitespace as an optimization. See ascii.com for list of these character definitions.

If you wish to split on space only you can specify that to the delimiter parameter instead of using the default.
>>> print( nvtext.tokenize(txt_str, delimiter=' ') )
['test_str\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06']

Got it . Thanks for clarifying.

I agree with the behavior of treating any character code point of 32 (space) and below as spaces, but this behavior makes it inconsistent with nvtext.token_count , default behavior where we seem to only treat the space character as the delimiter.

See:
https://github.com/rapidsai/cudf/blob/a80e9a1e173abf60f5703486933e80d87982212a/python/nvstrings/cpp/pytext.cpp#L142

I think we should change the token_count default behavior to treat the same 32 (space) and below to be control characters as delimiting characters because in work-flows we may want to use the token_count function to gather labels for the tokenized strings.

text_str = nvstrings.to_device(["test_str\x01test_str\x02test_str\x03test_str\x04test_str\x05test_str\x06"])
print(nvtext.token_count(text_str))

[1]

https://rapidsai.github.io/projects/nvstrings/en/0.10.0/api.html#nvtext.token_count

VibhuJawa on 15 Sep 2019

I agree. I think we can change this bug to "Change the default delimiter for nvtext.token_count" or something like that. The underlying C++ code for NVText.token_count uses whitespace when the delimiter paramter is null. So the changes are only required in the pytext.cpp as you mentioned https://github.com/rapidsai/cudf/blob/a80e9a1e173abf60f5703486933e80d87982212a/python/nvstrings/cpp/pytext.cpp#L142
and in the python code https://github.com/rapidsai/cudf/blob/bdb5291968491c4f216e8d41a4d3b21c1bf32421/python/nvstrings/nvtext.py#L76

davidwendt on 16 Sep 2019

Got it , thanks for clarifying. I have updated the bug heading.

Will do a pr to fix this bug soon, can you please assign the issue to me.

Thanks .

VibhuJawa on 16 Sep 2019

In the same spirit,

I think we might want also want to look into ensuring a common default delimiter behavior for all the functions where we have delimiter passed as an argument:

unique_tokens
partition
rpartition
rsplit_record
split
split_record
and so on ...

Please let me know if that is something we will like, I can work on fixing (atleast the things that can be done on a python level) / creating issues for things that need changes on the C++ side of things.

VibhuJawa on 16 Sep 2019

I think we have some flexibility on nvtext delimiter defaults but I tried to make the nvstrings delimiter defaults match Pandas. You are welcome to check these and create issues as necessary.

davidwendt on 16 Sep 2019

👍1

Hi @VibhuJawa if you are busy working on something else then I can take up this.