Describe the bug
When we try appending(vertically appending, not concatenating horizontally) two cudf.Series's of string type we get an exception(AttributeError: 'nvstrings' object has no attribute 'dtype')
Steps/Code to reproduce bug
Pandas:
import pandas as pd
x = pd.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.append(x.two))
0 abc
1 def
0 xyz
1 pqr
dtype: object
Incase of cudf:
import cudf
x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.append(x.two))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-201-76914250ef80> in <module>
1 import cudf
2 x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
----> 3 print(x.one.append(x.two))
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0+37.g2392d5a7.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/series.py in append(self, arbitrary)
629 other_col = other._column
630 # return new series
--> 631 return Series(self._column.append(other_col))
632
633 @property
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0+37.g2392d5a7.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/column.py in append(self, other)
538 newsize = len(self) + len(other)
539 # allocate memory
--> 540 data_dtype = np.result_type(self.data.dtype, other.data.dtype)
541 mem = rmm.device_array(shape=newsize, dtype=data_dtype)
542 newbuf = Buffer.from_empty(mem)
AttributeError: 'nvstrings' object has no attribute 'dtype'
Expected behavior
Expected behavior is that an appended series object is returned like that of pandas Series.
Environment details (please complete the following information):
cudf/print_env.sh script output:Additional context
As a workaround currently I'm using add_strings by accessing .data of two columns:
import cudf
x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.data.add_strings(x.two.data))
output:
['abc', 'def', 'xyz', 'pqr']
This should be fixed in branch 0.8; looking at env.txt it looks like you are running 0.7
In [1]: import cudf
...: x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
...: print(x.one.append(x.two))
0 abc
1 def
2 xyz
3 pqr
dtype: object
You mention that the method used to install cuDF was conda-nightly; could you please tell us what commands you used to install cuDF?
@shwina Just now pulled in latest 0.8 nightly. As you said it works now. But one observation is that in cudf Series we seem to be doing reset index(.reset_index(drop=True) ) by default. Any reason why we are doing so? Asking this as pandas Series doesn't do so.
Cudf:
import cudf
x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.append(x.two))
0 abc
1 def
2 xyz
3 pqr
dtype: object
Pandas:
import pandas as pd
x = pd.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.append(x.two))
0 abc
1 def
0 xyz
1 pqr
dtype: object
md5-f414515a772ebb9ef0a9b5efa5f5dbab
print(x.one.append(x.two).reset_index(drop=True))
0 abc
1 def
2 xyz
3 pqr
dtype: object
You're right - by default we use ignore_index=True in the append() method, while Pandas uses ignore_index=False (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.append.html).
We'll change this to match Pandas' behaviour
You're right - by default we use
ignore_index=Truein theappend()method, while Pandas usesignore_index=False(see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.append.html).We'll change this to match Pandas' behaviour
Cool 馃憤
Closing this issue as #1941 is fixed.
Most helpful comment
You're right - by default we use
ignore_index=Truein theappend()method, while Pandas usesignore_index=False(see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.append.html).We'll change this to match Pandas' behaviour