Describe the bug
Our string libcudf split & rsplit api seems to differ in behavior in some set of cases as listed below.
Steps/Code to reproduce bug
For empty character split, None seems to be missing unlike in Pandas.
>>> import cudf
>>> s = cudf.Series(['23', '³', '⅕', ''])
>>> s.to_pandas().str.split(expand=True)
0
0 23
1 ³
2 â…•
3 None
>>> s.str.split(expand=True)
0
0 23
1 ³
2 â…•
3
>>> s.to_pandas().str.rsplit(expand=True)
0
0 23
1 ³
2 â…•
3 None
>>> s.str.rsplit(expand=True)
0
0 23
1 ³
2 â…•
3
In this specific case where I have a space, tabs-carriage return-newline, empty string. Expectation from pandas output is empty dataframe when no pat is given.
>>> s = cudf.Series([' ', '\t\r\n ', ''])
>>> s.str.split()
0 1
0
1 \t\r\n
2 None
>>> s.str.split(expand=True)
0 1
0
1 \t\r\n
2 None
>>> s.to_pandas().str.split(expand=True)
Empty DataFrame
Columns: []
Index: [0, 1, 2]
>>>
>>> s.str.rsplit(expand=True)
0 1
0
1 \t\r\n
2 None
>>> s.to_pandas().str.rsplit(expand=True)
Empty DataFrame
Columns: []
Index: [0, 1, 2]
But works fine when we have pat defined as:
>>> s.str.split(pat=" ",expand=True)
0 1
0
1 \t\r\n
2 None
>>> s.to_pandas().str.split(pat=" ",expand=True)
0 1
0
1 \t\r\n
2 None
Note: Please ignore reg coloring in code-block, github seems to be highlighting it automatically.
Expected behavior
Similar to that of pandas.
Environment overview (please complete the following information)
Environment details
Output of the cudf/print_env.sh script here: env.txt
Additional context
This can be tested using PR: #4311
The cudf::strings::split() was purposely implemented different than Pandas in general. First, there is no expand parameter. Essentially the code executes similar to expand=True in Pandas and columns are returned based on the string with the most tokens in the column. It may be possible to get something like expand=False behavior using the cudf::strings::continugous_split_record() which was ported based on nvstrings::split_record() and implemented similar to cudf::contigous_split() in that a single piece of memory represents multiple columns.
So just checking you are calling cudf::strings::split() here and that the expand parameter is being ignored.
I've not been able to verify the output your are seeing using cudf::strings::split()
So please verify how you are calling libcudf for the cases in the description.
For the first one:
cudf::test::strings_column_wrapper strings( {"23", "³", "⅕", ""} );
auto results = cudf::strings::split( cudf::strings_column_view(strings) );
cudf::strings::print( cudf::strings_column_view(reults->view().column(idx)) );
Output is:
0:[23]
1:[³]
2:[â…•]
3:<null>
which matches Pandas in the description.
Likewise, the second one:
cudf::test::strings_column_wrapper strings( {" ", "\t\r\n ", ""} );
auto results = cudf::strings::split( cudf::strings_column_view(strings) );
cudf::strings::print( cudf::strings_column_view(reults->view().column(idx)) );
This produces as column of size=3 with all nulls which is different than what the description claims.
Looks like the error may be in the python code here:
https://github.com/rapidsai/cudf/blob/8323de3f6a5f27d9135bc522d28d4aab620764af/python/cudf/cudf/core/column/string.py#L982
if pat is None:
pat = " "
The default should be None or '' and not space.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html#pandas.Series.str.split
So just checking you are calling
cudf::strings::split()here and that theexpandparameter is being ignored.
Yes, currently only the call goes to cudf::strings::split(). Reason is if we usecudf::strings::continugous_split_record() we currently cannot store list type objects in Series. Hence .str.split defaults to expand=True behaviour.
The default should be None or '' and not space.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html#pandas.Series.str.split
Using '' instead of " " fixed the issue. But one observation in the second case:
>>> s = cudf.Series([' ', '\t\r\n ', ''])
>>> s.str.split()
0
0 None
1 None
2 None
>>> s.to_pandas().str.split(expand=True)
Empty DataFrame
Columns: []
Index: [0, 1, 2]
>>> cudf.DataFrame()
Empty DataFrame
Columns: []
Index: []
Seems like we return column with None values, so should cudf::strings::split() be returning an empty column? or do you recommend handling it on the python side while construction of Datagra,e from the resulting table?
Seems like we return column with None values, so should cudf::strings::split() be returning an empty column? or do you recommend handling it on the python side while construction of Datagra,e from the resulting table?
I recommend handling this on the python side since the DataFrame still seems to have indexes.
You may also want to make the same change to partition()/rpartition() which also appears to use space as the default separator.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.partition.html#pandas.Series.str.partition
Nevermind, looks like partition() default space separator is correct.
I recommend handling this on the python side since the DataFrame still seems to have indexes.
Yes, will be going ahead with this approach of handling it on python side. I will close this issue with a PR along with enabling some disabled code chunks.
Closing as not a libcudf issue.