Cudf: [BUG] inconsistent behaviour of cudf.DataFrame and pandas.DataFrame from list of tuples

Created on 10 May 2019 · 6Comments · Source: rapidsai/cudf

Describe the bug
cuDF DataFrame treats first element of each tuple as column name, second element of each tuple as column, while pandas treats each tuple as row, when initializing from list of tuples.

cudf DataFrame behaviour is similar to pandas.DataFrame.from_items API which is deprecated.

Steps/Code to reproduce bug

In [1]: import cudf

In [2]: df = cudf.DataFrame([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])

In [3]: df
Out[3]: <cudf.DataFrame ncols=3 nrows=20 >

In [4]: print(df)
   a  b  c
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9
[10 more rows]

In [5]: import pandas as pd

In [6]: pdf = pd.DataFrame([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])

In [7]: print(pdf)
   0                                                  1
0  a  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1  b  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2  c  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...

In [8]: pd.DataFrame.from_items([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])
/home/nfs/knataraj/miniconda3/envs/cudf_dev/bin/ipython:1: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  #!/home/nfs/knataraj/miniconda3/envs/cudf_dev/bin/python
Out[8]:
     a   b   c
0    0   0   0
1    1   1   1
2    2   2   2
3    3   3   3
4    4   4   4
5    5   5   5
6    6   6   6
7    7   7   7
8    8   8   8
9    9   9   9
10  10  10  10
11  11  11  11
12  12  12  12
13  13  13  13
14  14  14  14
15  15  15  15
16  16  16  16
17  17  17  17
18  18  18  18
19  19  19  19

Expected behavior
cudf DataFrame initialization should match with pandas dataframe initialization behavior.

Environment details :

Environment location: Bare-metal
Method of cuDF install: from source '0.8.0a+709.g58d05cf.dirty'
- pandas 0.24.2 (installed by conda)

Additional context
Also fix the documentation in dataframe.iloc and related tests.

bug cuDF (Python)

Source

karthikeyann

Most helpful comment

Personally I think that no promise of performance can/should be made for the case of tuple inputs, especially given that the data is assumed to be rows in this case. I think going through Pandas in this case might be the best approach.

shwina on 23 Sep 2019

👍2

All 6 comments

There's a few ways to go about this:

Longest Dev Time + Better Performance
In terms of performance, Does libcudf have a way to transform row to columner format? If so, we could utilize that, rather than doing it at the python level.

Shorter Dev Time + "No Pandas"
We can accomplish the transformation by factoring out the relevant portions of from_pandas, and adding things like automatic column name generation.

Shortest Dev Time + 100% Compatibility
In terms of error handling, we have no guarantee that each tuple will contain the same data type for each position, nor the size of tuple, for that matter. It's possible to leverage Pandas to convert rows to a DataFrame, then use from_pandas - this way we could be sure the incoming data is formatted correctly.

I'd go with the shortest dev time, assuming we just want matching functionality and don't care about "code purity" or performance. That brings up an interesting question about the current implementation of __init__: Would it be reasonable to replace it entirely with a Pandas DataFrame constructor followed by from_pandas... ?

@harrism does libcudf happen to have rows => table conversion already?

cwharris on 18 Sep 2019

I don't think we have that functionality.

harrism on 19 Sep 2019

@kkraus14 @shwina

of the options listed above, is there any that stands out as particularly apt?

cwharris on 21 Sep 2019

@cwharris what exactly do you mean by "row to column" format? I assume you don't mean a transpose. There is no concept of a "row" in the traditional database sense in cuDF -- everything is stored in columns. Tables are always made up of columns.

In any case I think the shortest dev time approach is the right first step, since the request here isn't about performance, it's about compatibility. (Also, this is bug originates internally, not from an end user.)

harrism on 23 Sep 2019

shwina on 23 Sep 2019

👍2

While we can update cudf.DataFrame.__init__ to match Pandas' behavior, the specific example given above won't work in cudf because we do not yet have nested (and therefore list-like) types #2857.

cwharris on 17 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings