What is the canonical way to convert a cudf dataframe into an array-like object?
There is
First, these should maybe replace the name matrix with array. Numpy matrix objects are different and somewhat unpleasant. Also, note this:
In [19]: df.to_pandas().as_matrix()
/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/ipython:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
#!/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/python
Out[19]:
array([[1, 4],
[2, 5],
[3, 6]])
I most often see people use .values, a property which returns a homogenously typed numpy array. I wonder if we might instead return a cupy array (or numba device array if that's preferred). If we choose something sufficiently numpy-like (like cupy) then things will just work with dask dataframe.
This would be a valuable property, as it's fairly common to call this in pandas to grab the underlying numpy array
One question quickly jumps out: What will happen with dataframes including non-numeric data? Should this fail? Should it only return the numeric values? Currently, as_gpu_matrix requires the data to be numeric (and of the same type) as it returns numba device ndarray. We can fairly easily solve the type alignment issue (.values in pandas resolves this by upcasting the numeric types, as @mrocklin mentioned), but the other is quite unclear.
In pandas you would use the to_records method in that case. I believe that in Pandas both .as_matrix() and .values are supposed to return a homogenously typed array. If you give it very heterogeneously typed arrays then you get back an object-dtype array.
Absolutely. I didn't phrase my question as clearly as I should have. We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays. My question is how to best handle that when thinking about something like .values.
We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays.
Oh I see, yeah that's a much bigger problem it seems. I don't have any answers for you there.
My question is how to best handle that when thinking about something like .values.
Well, for the .values API in particular this concern doesn't come up. The contract is to promise a homogeneously typed array. We would upcast or err probably?
I would think raising an informative error if we cannot upcast (such as when some columns are non-numeric) is probably the right solution, as I think the .values contract may be a bit more specific: a homogeneously typed array that contains all of your data.
@mrocklin @beckernick typically, calling .values against a Pandas Series/DataFrame gives its underlying data by reference, how important would it be to return by reference versus by copy? Given the user expects a dense buffer, what should we do with nulls?
@thomcom , @kkraus14 that's a good question. Is it correct that we can return the underlying buffer by reference easily if there are no nulls, but we'd need to make a copy to return a buffer with nulls? (Please do correct me if I'm wrong on that one.)
For interoperability in the ecosystem, we likely don't want the following behavior to occur when someone wants to convert to CuPy (or anything else):
s = cudf.Series([1.0, np.nan, 3, 4])
print(s)
0 1.0
1
2 3.0
3 4.0
dtype: float64
print(cuda.as_cuda_array(s.data.mem).copy_to_host())
[1. 0. 3. 4.]
print(cupy.asarray(s.data.mem))
[1. 0. 3. 4.]
print(cudf.Series(s.data.mem))
0 1.0
1 0.0
2 3.0
3 4.0
dtype: float64
print(cudf.Series(s.to_gpu_array())) # the dense buffer without the null
0 1.0
1 3.0
2 4.0
dtype: float64
This could lead to a bunch of unexpected downstream behavior for users that is silently incorrect. I think users expect to pass around data with nulls (or NaNs) that don't get left out or filled in as 0.
print(cudf.Series(cuda.as_cuda_array(cuda.to_device([1.2, np.nan, 3]))))
0 1.2
1
2 3.0
dtype: float64
If copying can fulfill that contract, I think it's a good choice. I think allowing people to leverage and interoperate the growing ecosystem is worth a copy in the short term.
@jakirkham as well for visibility
Reopening this since we currently return a numpy array as opposed to a GPU array. Will collect further thoughts on this in a bit.
To give a first test for .values, I think that the following would be helpful
def test_cupy_values():
cupy = pytest.importorskip("cupy")
s = cudf.Series([1, 2, 3])
assert isinstance(s.values, cupy.ndarray)
numpy.testing.assert_array_equal(
s.values.get(),
s.to_pandas().values
)
If one wanted to extend this we might try
@pytest.mark.parametrize("dtype", [float, int, "float32"]).values on an array with missing values or strings or categoricals raises an informative TypeError or NotImplementedErrorJust noticed this after someone pointed this out to me, in the Pandas docs for .values it says the following.
Warning We recommend using DataFrame.to_numpy() instead.
Thanks Brandon! 馃槃