Cudf: [BUG] Float column being stored as Object and throws error upon conversion

Created on 13 May 2020 · 8Comments · Source: rapidsai/cudf

Reproducer Code

>>> import cudf
>>> df = cudf.DataFrame()
>>> df ['a'] = range(10)
>>> df['b'] = 1234324233.13
>>> df['c'] = None
>>> df
   a             b     c
0  0  1.234324e+09  None
1  1  1.234324e+09  None
2  2  1.234324e+09  None
3  3  1.234324e+09  None
4  4  1.234324e+09  None
5  5  1.234324e+09  None
6  6  1.234324e+09  None
7  7  1.234324e+09  None
8  8  1.234324e+09  None
9  9  1.234324e+09  None
>>> df['c'][df.a < 10] = df.b[df.a < 10]/10
>>> df
   a             b            c
0  0  1.234324e+09  123432423.3
1  1  1.234324e+09  123432423.3
2  2  1.234324e+09  123432423.3
3  3  1.234324e+09  123432423.3
4  4  1.234324e+09  123432423.3
5  5  1.234324e+09  123432423.3
6  6  1.234324e+09  123432423.3
7  7  1.234324e+09  123432423.3
8  8  1.234324e+09  123432423.3
9  9  1.234324e+09  123432423.3
>>> df['c'] = df['c'].astype('int64')

Exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py", line 1445, in astype
    raise e
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/series.py", line 1441, in astype
    data=self._column.astype(dtype, **kwargs)
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/column.py", line 840, in astype
    return self.as_numerical_column(dtype, **kwargs)
  File "/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/string.py", line 2073, in as_numerical_column
    type due to presence of non-integer values."
ValueError: Could not convert strings to integer                         type due to presence of non-integer values.

Explanation

after this line df['c'][df.a < 10] = df.b[df.a < 10]/10 the dtype of col 'c' is still 'object' and not 'float64'.
This only happens with a boolean mask.

? - Needs Triage bug

Source

mlahir1

All 8 comments

This only happens on latest branch-0.14 nightly.

mlahir1 on 13 May 2020

@mlahir1 this change has been introduced recently with PR #5054, and you can observe that this would fail in pandas as well, as 123432423.3 is not an integer,

In [13]: import pandas as pd                                                                                                                                                                                                     

In [14]: df = pd.DataFrame({"a":["1.1", "2.2", "3.3"]})                                                                                                                                                                          

In [15]: df['a'] = df['a'].astype('int64') 

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

ValueError: invalid literal for int() with base 10: '1.1'

You can try something like this

In [3]: df['c'] = df['c'].astype('float64').astype('int64')                                                                                                                                                                      

In [4]: df                                                                                                                                                                                                                       
Out[4]: 
   a             b          c
0  0  1.234324e+09  123432423
1  1  1.234324e+09  123432423
2  2  1.234324e+09  123432423
3  3  1.234324e+09  123432423
4  4  1.234324e+09  123432423
5  5  1.234324e+09  123432423
6  6  1.234324e+09  123432423
7  7  1.234324e+09  123432423
8  8  1.234324e+09  123432423
9  9  1.234324e+09  123432423

rgsl888prabhu on 13 May 2020

👍1

Thanks, I am using the same workaround.

mlahir1 on 13 May 2020

Sorry, closed by mistake.

mlahir1 on 13 May 2020

Think there is no other solution apart from what is suggested as the behavior follows pandas.

@mlahir1 is there any other expectation?

rgsl888prabhu on 13 May 2020

👍1

I'm confused as to why df['c'] is a string when trying to typecast here.

I'm guessing df['c'][df.a < 10] = df.b[df.a < 10]/10 isn't changing the type to float as expected. Does Pandas keep this as an object type?

kkraus14 on 13 May 2020

df['c'] still remains object in pandas, and as it is updating slice/map of elements in a column it make sense to keep the type same.

In [40]: df['c']                                                                                                                                                                                                                 
Out[40]: 
0    1.23432e+08
1    1.23432e+08
2    1.23432e+08
3    1.23432e+08
4    1.23432e+08
5    1.23432e+08
6    1.23432e+08
7    1.23432e+08
8    1.23432e+08
9    1.23432e+08
Name: c, dtype: object

In [41]: type(df)                                                                                                                                                                                                                
Out[41]: pandas.core.frame.DataFrame

rgsl888prabhu on 13 May 2020

👍1

@kkraus14
df['c'][df.a < 10] = df.b[df.a < 10]/10 isn't changing it float. I just checked and pandas also doesn't change it to float.

@rgsl888prabhu
I reported it because something that was working yesterday wasn't working today. If you think it is in line with pandas. you can close the issue. I will put the workaround for that in my code.

mlahir1 on 13 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings