Cudf: [BUG] cudf Series append for string column throws Exception

Created on 6 Jun 2019  路  5Comments  路  Source: rapidsai/cudf

Describe the bug
When we try appending(vertically appending, not concatenating horizontally) two cudf.Series's of string type we get an exception(AttributeError: 'nvstrings' object has no attribute 'dtype')

Steps/Code to reproduce bug
Pandas:

import pandas as pd
x = pd.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.append(x.two))
0    abc
1    def
0    xyz
1    pqr
dtype: object

Incase of cudf:


import cudf
x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.append(x.two))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-201-76914250ef80> in <module>
      1 import cudf
      2 x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
----> 3 print(x.one.append(x.two))

/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0+37.g2392d5a7.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/series.py in append(self, arbitrary)
    629         other_col = other._column
    630         # return new series
--> 631         return Series(self._column.append(other_col))
    632 
    633     @property

/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0+37.g2392d5a7.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/column.py in append(self, other)
    538         newsize = len(self) + len(other)
    539         # allocate memory
--> 540         data_dtype = np.result_type(self.data.dtype, other.data.dtype)
    541         mem = rmm.device_array(shape=newsize, dtype=data_dtype)
    542         newbuf = Buffer.from_empty(mem)

AttributeError: 'nvstrings' object has no attribute 'dtype'

Expected behavior
Expected behavior is that an appended series object is returned like that of pandas Series.

Environment details (please complete the following information):

  • Environment location: Docker

    • Method of cuDF install: conda-nightly

    • cudf/print_env.sh script output:

env.txt

Additional context
As a workaround currently I'm using add_strings by accessing .data of two columns:

import cudf
x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.data.add_strings(x.two.data))

output:

['abc', 'def', 'xyz', 'pqr']
bug cuDF (Python)

Most helpful comment

You're right - by default we use ignore_index=True in the append() method, while Pandas uses ignore_index=False (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.append.html).

We'll change this to match Pandas' behaviour

All 5 comments

This should be fixed in branch 0.8; looking at env.txt it looks like you are running 0.7

In [1]: import cudf 
   ...: x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]}) 
   ...: print(x.one.append(x.two))                  
0    abc
1    def
2    xyz
3    pqr
dtype: object

You mention that the method used to install cuDF was conda-nightly; could you please tell us what commands you used to install cuDF?

@shwina Just now pulled in latest 0.8 nightly. As you said it works now. But one observation is that in cudf Series we seem to be doing reset index(.reset_index(drop=True) ) by default. Any reason why we are doing so? Asking this as pandas Series doesn't do so.

Cudf:

import cudf
x = cudf.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.append(x.two))
0    abc
1    def
2    xyz
3    pqr
dtype: object

Pandas:


import pandas as pd
x = pd.DataFrame({'one':["abc","def"],'two':["xyz","pqr"]})
print(x.one.append(x.two))
0    abc
1    def
0    xyz
1    pqr
dtype: object



md5-f414515a772ebb9ef0a9b5efa5f5dbab



print(x.one.append(x.two).reset_index(drop=True))
0    abc
1    def
2    xyz
3    pqr
dtype: object

You're right - by default we use ignore_index=True in the append() method, while Pandas uses ignore_index=False (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.append.html).

We'll change this to match Pandas' behaviour

You're right - by default we use ignore_index=True in the append() method, while Pandas uses ignore_index=False (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.append.html).

We'll change this to match Pandas' behaviour

Cool 馃憤

Closing this issue as #1941 is fixed.

Was this page helpful?
0 / 5 - 0 ratings