Cudf: [BUG] Float column being stored as Object and throws error upon conversion

Created on 13 May 2020  路  8Comments  路  Source: rapidsai/cudf

Reproducer Code

>>> import cudf
>>> df = cudf.DataFrame()
>>> df ['a'] = range(10)
>>> df['b'] = 1234324233.13
>>> df['c'] = None
>>> df
   a             b     c
0  0  1.234324e+09  None
1  1  1.234324e+09  None
2  2  1.234324e+09  None
3  3  1.234324e+09  None
4  4  1.234324e+09  None
5  5  1.234324e+09  None
6  6  1.234324e+09  None
7  7  1.234324e+09  None
8  8  1.234324e+09  None
9  9  1.234324e+09  None
>>> df['c'][df.a < 10] = df.b[df.a < 10]/10
>>> df
   a             b            c
0  0  1.234324e+09  123432423.3
1  1  1.234324e+09  123432423.3
2  2  1.234324e+09  123432423.3
3  3  1.234324e+09  123432423.3
4  4  1.234324e+09  123432423.3
5  5  1.234324e+09  123432423.3
6  6  1.234324e+09  123432423.3
7  7  1.234324e+09  123432423.3
8  8  1.234324e+09  123432423.3
9  9  1.234324e+09  123432423.3
>>> df['c'] = df['c'].astype('int64')

Exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py", line 1445, in astype
    raise e
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py", line 1441, in astype
    data=self._column.astype(dtype, **kwargs)
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/column.py", line 840, in astype
    return self.as_numerical_column(dtype, **kwargs)
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/string.py", line 2073, in as_numerical_column
    type due to presence of non-integer values."
ValueError: Could not convert strings to integer                         type due to presence of non-integer values.

Explanation

after this line df['c'][df.a < 10] = df.b[df.a < 10]/10 the dtype of col 'c' is still 'object' and not 'float64'.
This only happens with a boolean mask.

? - Needs Triage bug

All 8 comments

This only happens on latest branch-0.14 nightly.

@mlahir1 this change has been introduced recently with PR #5054, and you can observe that this would fail in pandas as well, as 123432423.3 is not an integer,

In [13]: import pandas as pd                                                                                                                                                                                                     

In [14]: df = pd.DataFrame({"a":["1.1", "2.2", "3.3"]})                                                                                                                                                                          

In [15]: df['a'] = df['a'].astype('int64') 

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '1.1'

You can try something like this

In [3]: df['c'] = df['c'].astype('float64').astype('int64')                                                                                                                                                                      

In [4]: df                                                                                                                                                                                                                       
Out[4]: 
   a             b          c
0  0  1.234324e+09  123432423
1  1  1.234324e+09  123432423
2  2  1.234324e+09  123432423
3  3  1.234324e+09  123432423
4  4  1.234324e+09  123432423
5  5  1.234324e+09  123432423
6  6  1.234324e+09  123432423
7  7  1.234324e+09  123432423
8  8  1.234324e+09  123432423
9  9  1.234324e+09  123432423

Thanks, I am using the same workaround.

Sorry, closed by mistake.

Think there is no other solution apart from what is suggested as the behavior follows pandas.

@mlahir1 is there any other expectation?

I'm confused as to why df['c'] is a string when trying to typecast here.

I'm guessing df['c'][df.a < 10] = df.b[df.a < 10]/10 isn't changing the type to float as expected. Does Pandas keep this as an object type?

df['c'] still remains object in pandas, and as it is updating slice/map of elements in a column it make sense to keep the type same.

In [40]: df['c']                                                                                                                                                                                                                 
Out[40]: 
0    1.23432e+08
1    1.23432e+08
2    1.23432e+08
3    1.23432e+08
4    1.23432e+08
5    1.23432e+08
6    1.23432e+08
7    1.23432e+08
8    1.23432e+08
9    1.23432e+08
Name: c, dtype: object

In [41]: type(df)                                                                                                                                                                                                                
Out[41]: pandas.core.frame.DataFrame

@kkraus14
df['c'][df.a < 10] = df.b[df.a < 10]/10 isn't changing it float. I just checked and pandas also doesn't change it to float.

@rgsl888prabhu
I reported it because something that was working yesterday wasn't working today. If you think it is in line with pandas. you can close the issue. I will put the workaround for that in my code.

Was this page helpful?
0 / 5 - 0 ratings