Describe the bug
cuDF DataFrame treats first element of each tuple as column name, second element of each tuple as column, while pandas treats each tuple as row, when initializing from list of tuples.
cudf DataFrame behaviour is similar to pandas.DataFrame.from_items API which is deprecated.
Steps/Code to reproduce bug
In [1]: import cudf
In [2]: df = cudf.DataFrame([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])
In [3]: df
Out[3]: <cudf.DataFrame ncols=3 nrows=20 >
In [4]: print(df)
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
[10 more rows]
In [5]: import pandas as pd
In [6]: pdf = pd.DataFrame([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])
In [7]: print(pdf)
0 1
0 a [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1 b [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2 c [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
In [8]: pd.DataFrame.from_items([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])
/home/nfs/knataraj/miniconda3/envs/cudf_dev/bin/ipython:1: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
#!/home/nfs/knataraj/miniconda3/envs/cudf_dev/bin/python
Out[8]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
12 12 12 12
13 13 13 13
14 14 14 14
15 15 15 15
16 16 16 16
17 17 17 17
18 18 18 18
19 19 19 19
Expected behavior
cudf DataFrame initialization should match with pandas dataframe initialization behavior.
Environment details :
Additional context
Also fix the documentation in dataframe.iloc and related tests.
There's a few ways to go about this:
Longest Dev Time + Better Performance
In terms of performance, Does libcudf have a way to transform row to columner format? If so, we could utilize that, rather than doing it at the python level.
Shorter Dev Time + "No Pandas"
We can accomplish the transformation by factoring out the relevant portions of from_pandas, and adding things like automatic column name generation.
Shortest Dev Time + 100% Compatibility
In terms of error handling, we have no guarantee that each tuple will contain the same data type for each position, nor the size of tuple, for that matter. It's possible to leverage Pandas to convert rows to a DataFrame, then use from_pandas - this way we could be sure the incoming data is formatted correctly.
I'd go with the shortest dev time, assuming we just want matching functionality and don't care about "code purity" or performance. That brings up an interesting question about the current implementation of __init__: Would it be reasonable to replace it entirely with a Pandas DataFrame constructor followed by from_pandas... ?
@harrism does libcudf happen to have rows => table conversion already?
I don't think we have that functionality.
@kkraus14 @shwina
of the options listed above, is there any that stands out as particularly apt?
@cwharris what exactly do you mean by "row to column" format? I assume you don't mean a transpose. There is no concept of a "row" in the traditional database sense in cuDF -- everything is stored in columns. Tables are always made up of columns.
In any case I think the shortest dev time approach is the right first step, since the request here isn't about performance, it's about compatibility. (Also, this is bug originates internally, not from an end user.)
Personally I think that no promise of performance can/should be made for the case of tuple inputs, especially given that the data is assumed to be rows in this case. I think going through Pandas in this case might be the best approach.
While we can update cudf.DataFrame.__init__ to match Pandas' behavior, the specific example given above won't work in cudf because we do not yet have nested (and therefore list-like) types #2857.
Most helpful comment
Personally I think that no promise of performance can/should be made for the case of tuple inputs, especially given that the data is assumed to be rows in this case. I think going through Pandas in this case might be the best approach.