Describe the bug
from_dlpack does not appear to be creating the correct dataframe from 2d cupy arrays, though I may be doing something wrong. I expect that calling from_dlpack on a cupy array created with column-major order will return the correct dataframe.
It looks like we skip the from_dlpack tests in CI as it requires cupy.
import cupy
import cudf
arr = cupy.array([
[0,1,2.],
[4,5,6,],
[7,8,9]
])
print(arr)
print(cudf.from_dlpack(arr.toDlpack()))
print(cudf.from_dlpack(arr.T.toDlpack()))
print(cupy.fromDlpack(arr.toDlpack()))
[[0. 1. 2.]
[4. 5. 6.]
[7. 8. 9.]]
0 1 2
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 4.0 4.0 4.0
0 1 2
0 4.0 4.0 4.0
1 5.0 5.0 5.0
2 6.0 6.0 6.0
[[0. 1. 2.]
[4. 5. 6.]
[7. 8. 9.]]
import cupy
import cudf
arr = cupy.array(
[
[0,1,2.],
[4,5,6,],
[7,8,9]
],
order='F' # column major
)
print(arr)
print(cudf.from_dlpack(arr.toDlpack()))
print(cudf.from_dlpack(arr.T.toDlpack()))
print(cupy.fromDlpack(arr.toDlpack()))
[[0. 1. 2.]
[4. 5. 6.]
[7. 8. 9.]]
0 1 2
0 1.0 1.0 1.0
1 5.0 5.0 5.0
2 8.0 8.0 8.0
0 1 2
0 4.0 4.0 4.0
1 7.0 7.0 7.0
2 1.0 1.0 1.0
[[0. 1. 2.]
[4. 5. 6.]
[7. 8. 9.]]
cuDF commit
commit 24ab9736d53c7859e6364a9d33861c2858d7f752 (HEAD -> branch-0.8, origin/branch-0.8, origin/HEAD)
Merge: feec0c5e 1357a57a
Author: Jake Hemstad jhemstad@nvidia.com
Date: Fri May 17 08:12:00 2019 -0500
Merge pull request #1746 from jrhemstad/fea-ext-removed-dead-code
[REVIEW] Removed unused, untested, and dead code
CuPy installed via pip for cuda 9.2
Investigating the list of devicearrays (res) shows that every devicearray is the same, which makes me think this may be caused by something in the C++/Cython code.
Are the 2d cupy arrays column or row major? The from/to_dlpack in libcudf only supports column major.
CC @harrism
@jrhemstad I tried both. order='F' during the array construction creates the column major cupy array.
import cupy
import cudf
arr = cupy.array([
[0,1,2.],
[4,5,6,],
[7,8,9]
])
print(cupy.isfortran(arr))
arr = cupy.array([
[0,1,2.],
[4,5,6,],
[7,8,9]
],
order='F')
print(cupy.isfortran(arr))
False
True
Here's an example without using CuPy:
df = cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df_dlpack = cudf.from_dlpack(df.to_dlpack())
print(df)
print(df_new)
a b c
0 1 4 7
1 2 5 8
2 3 6 9
0 1 2
0 4 4 4
1 5 5 5
2 6 6 6
It would be great if someone can triage whether this is a cuDF or libcudf issue.
@harrism In the Cython code, the numba devicearray created by this section of the code:
always has the same values, despite the idx and data_ptr being different as the loop progresses for each column. This makes me think this is a libcudf issue, since result_cols is created from the libcudf function.
A specific example:
Given this input dataframe: df = cudf.DataFrame({'a': [0, 4, 2.0, 39], 'b':[1, 1, 3.0, 50]})
idx = 0
data_ptr = 140305823171072
[ 1. 1. 3. 50.] # the created devicearray copied to host
idx = 1
data_ptr = 140305823172608
[ 1. 1. 3. 50.] # the created devicearray copied to host
0 1
0 1.0 1.0
1 1.0 1.0
2 3.0 3.0
3 50.0 50.0
Thanks. Just need help setting priority. Can this wait until 0.9?
I think that's fine. CuPy users can still convert via Numba's array interface.
Something like: cudf.DataFrame.from_gpu_matrix(numba.cuda.as_cuda_array(cupy_arr))