Not too pressing, but cuDF should either allow users to use .str methods on the .str StringMethods object, or raise an informative error. This is possible in pandas. We fail silently.
import cudf
import pandas as pd
​
df = cudf.DataFrame({'a':['s','t','r'], 'b':['a','b','c']})
pdf = df.to_pandas()
​
print(df.a.str.cat(df.b))
print(df.a.str.cat(df.b.str))
​
print(pdf.a.str.cat(pdf.b))
print(pdf.a.str.cat(pdf.b.str))
0 sa
1 tb
2 rc
dtype: object
<empty Series of dtype=float64>
0 sa
1 tb
2 rc
Name: a, dtype: object
0 sa
1 tb
2 rc
Name: a, dtype: object
I was looking into this issue and I observed that pandas concatenates only the largest string in series while after making required changes cudf concatenates all the strings in a series
Which is most appropriate behaviour?
import pandas as pd
import cudf
arr = ["AbC", "de", "FGHI", "j", "kLLLm"]
ps = pd.Series(arr)
expect = ps.str.cat(others=ps.str)
print(expect)
gs = cudf.Series(arr)
got=gs.str.cat(others=gs.str)
print(got)
Out[16]:
0 NaN
1 NaN
2 NaN
3 NaN
4 kLLLmkLLLm
dtype: object
Out[18]:
0 AbCAbC
1 dede
2 FGHIFGHI
3 jj
4 kLLLmkLLLm
dtype: object
Hmm, that behavior discrepancy feels like a bug in pandas to me. If we decide to support passing stringmethods objects, we should probably not silently only concatenate the largest string
I've submitted a issue in pandas-dev github repo:
https://github.com/pandas-dev/pandas/issues/28277
Thanks @AK-ayush . Given @TomAugspurger 's comment in the pandas thread linked above, I'm inclined toward changing this issue later today to be about raising an exception here if there are no additional opinions. @kkraus14 , let me know if you think we should still support this usage pattern.