Cudf: [FEA] strings libcudf split & rsplit api needs to align with pandas behavior for `None` and specific cases

Created on 6 Mar 2020 · 9Comments · Source: rapidsai/cudf

Describe the bug
Our string libcudf split & rsplit api seems to differ in behavior in some set of cases as listed below.

Steps/Code to reproduce bug

For empty character split, None seems to be missing unlike in Pandas.

>>> import cudf
>>> s = cudf.Series(['23', '³', '⅕', ''])
>>> s.to_pandas().str.split(expand=True)
      0
0    23
1     ³
2     ⅕
3  None
>>> s.str.split(expand=True)
    0
0  23
1   ³
2   ⅕
3   



>>> s.to_pandas().str.rsplit(expand=True)
      0
0    23
1     ³
2     ⅕
3  None
>>> s.str.rsplit(expand=True)
    0
0  23
1   ³
2   ⅕
3

In this specific case where I have a space, tabs-carriage return-newline, empty string. Expectation from pandas output is empty dataframe when no pat is given.

>>> s = cudf.Series([' ', '\t\r\n ', ''])
>>> s.str.split()
        0     1
0              
1  \t\r\n      
2          None
>>> s.str.split(expand=True)
        0     1
0              
1  \t\r\n      
2          None
>>> s.to_pandas().str.split(expand=True)
Empty DataFrame
Columns: []
Index: [0, 1, 2]
>>> 

>>> s.str.rsplit(expand=True)
        0     1
0              
1  \t\r\n      
2          None
>>> s.to_pandas().str.rsplit(expand=True)
Empty DataFrame
Columns: []
Index: [0, 1, 2]

But works fine when we have pat defined as:

>>> s.str.split(pat=" ",expand=True)
        0     1
0              
1  \t\r\n      
2          None
>>> s.to_pandas().str.split(pat=" ",expand=True)
        0     1
0              
1  \t\r\n      
2          None

Note: Please ignore reg coloring in code-block, github seems to be highlighting it automatically.

Expected behavior
Similar to that of pandas.

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: from source

Environment details
Output of the cudf/print_env.sh script here: env.txt

Additional context
This can be tested using PR: #4311

bug libcudf strings

Source

galipremsagar

All 9 comments

The cudf::strings::split() was purposely implemented different than Pandas in general. First, there is no expand parameter. Essentially the code executes similar to expand=True in Pandas and columns are returned based on the string with the most tokens in the column. It may be possible to get something like expand=False behavior using the cudf::strings::continugous_split_record() which was ported based on nvstrings::split_record() and implemented similar to cudf::contigous_split() in that a single piece of memory represents multiple columns.

So just checking you are calling cudf::strings::split() here and that the expand parameter is being ignored.

davidwendt on 9 Mar 2020

I've not been able to verify the output your are seeing using cudf::strings::split()
So please verify how you are calling libcudf for the cases in the description.
For the first one:

    cudf::test::strings_column_wrapper strings( {"23", "³", "⅕", ""} );
    auto results = cudf::strings::split( cudf::strings_column_view(strings) );
    cudf::strings::print( cudf::strings_column_view(reults->view().column(idx)) );

Output is:

0:[23]
1:[³]
2:[⅕]
3:<null>

which matches Pandas in the description.

Likewise, the second one:

    cudf::test::strings_column_wrapper strings( {" ", "\t\r\n ", ""} );
    auto results = cudf::strings::split( cudf::strings_column_view(strings) );
    cudf::strings::print( cudf::strings_column_view(reults->view().column(idx)) );

This produces as column of size=3 with all nulls which is different than what the description claims.

davidwendt on 11 Mar 2020

Looks like the error may be in the python code here:
https://github.com/rapidsai/cudf/blob/8323de3f6a5f27d9135bc522d28d4aab620764af/python/cudf/cudf/core/column/string.py#L982

        if pat is None:
            pat = " "

The default should be None or '' and not space.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html#pandas.Series.str.split

davidwendt on 11 Mar 2020

So just checking you are calling cudf::strings::split() here and that the expand parameter is being ignored.

Yes, currently only the call goes to cudf::strings::split(). Reason is if we usecudf::strings::continugous_split_record() we currently cannot store list type objects in Series. Hence .str.split defaults to expand=True behaviour.

The default should be None or '' and not space.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html#pandas.Series.str.split

Using '' instead of " " fixed the issue. But one observation in the second case:

>>> s = cudf.Series([' ', '\t\r\n ', ''])
>>> s.str.split()
      0
0  None
1  None
2  None
>>> s.to_pandas().str.split(expand=True)
Empty DataFrame
Columns: []
Index: [0, 1, 2]
>>> cudf.DataFrame()
Empty DataFrame
Columns: []
Index: []

Seems like we return column with None values, so should cudf::strings::split() be returning an empty column? or do you recommend handling it on the python side while construction of Datagra,e from the resulting table?

galipremsagar on 11 Mar 2020

Seems like we return column with None values, so should cudf::strings::split() be returning an empty column? or do you recommend handling it on the python side while construction of Datagra,e from the resulting table?

I recommend handling this on the python side since the DataFrame still seems to have indexes.

davidwendt on 11 Mar 2020

👍1

You may also want to make the same change to partition()/rpartition() which also appears to use space as the default separator.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.partition.html#pandas.Series.str.partition

davidwendt on 11 Mar 2020

Nevermind, looks like partition() default space separator is correct.

davidwendt on 11 Mar 2020

👍1

I recommend handling this on the python side since the DataFrame still seems to have indexes.

Yes, will be going ahead with this approach of handling it on python side. I will close this issue with a PR along with enabling some disabled code chunks.

galipremsagar on 11 Mar 2020

Closing as not a libcudf issue.