Cudf: [FEA] Add a .values property to convert to a GPU array

Created on 22 May 2019  路  13Comments  路  Source: rapidsai/cudf

What is the canonical way to convert a cudf dataframe into an array-like object?

There is

  • as_gpu_matrix, which seems to be the best choice today? This returns a Numba device array
  • to_gpu_matrix, which seems to not do anything
  • as_matrix, which seems to return a Numpy array

First, these should maybe replace the name matrix with array. Numpy matrix objects are different and somewhat unpleasant. Also, note this:

In [19]: df.to_pandas().as_matrix()
/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/ipython:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  #!/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/python
Out[19]:
array([[1, 4],
       [2, 5],
       [3, 6]])

I most often see people use .values, a property which returns a homogenously typed numpy array. I wonder if we might instead return a cupy array (or numba device array if that's preferred). If we choose something sufficiently numpy-like (like cupy) then things will just work with dask dataframe.

See also https://github.com/rapidsai/dask-cudf/issues/259

cuDF (Python) feature request

All 13 comments

This would be a valuable property, as it's fairly common to call this in pandas to grab the underlying numpy array

One question quickly jumps out: What will happen with dataframes including non-numeric data? Should this fail? Should it only return the numeric values? Currently, as_gpu_matrix requires the data to be numeric (and of the same type) as it returns numba device ndarray. We can fairly easily solve the type alignment issue (.values in pandas resolves this by upcasting the numeric types, as @mrocklin mentioned), but the other is quite unclear.

In pandas you would use the to_records method in that case. I believe that in Pandas both .as_matrix() and .values are supposed to return a homogenously typed array. If you give it very heterogeneously typed arrays then you get back an object-dtype array.

Absolutely. I didn't phrase my question as clearly as I should have. We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays. My question is how to best handle that when thinking about something like .values.

We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays.

Oh I see, yeah that's a much bigger problem it seems. I don't have any answers for you there.

My question is how to best handle that when thinking about something like .values.

Well, for the .values API in particular this concern doesn't come up. The contract is to promise a homogeneously typed array. We would upcast or err probably?

I would think raising an informative error if we cannot upcast (such as when some columns are non-numeric) is probably the right solution, as I think the .values contract may be a bit more specific: a homogeneously typed array that contains all of your data.

@mrocklin @beckernick typically, calling .values against a Pandas Series/DataFrame gives its underlying data by reference, how important would it be to return by reference versus by copy? Given the user expects a dense buffer, what should we do with nulls?

@thomcom , @kkraus14 that's a good question. Is it correct that we can return the underlying buffer by reference easily if there are no nulls, but we'd need to make a copy to return a buffer with nulls? (Please do correct me if I'm wrong on that one.)

For interoperability in the ecosystem, we likely don't want the following behavior to occur when someone wants to convert to CuPy (or anything else):

s = cudf.Series([1.0, np.nan, 3, 4])
print(s)
0    1.0
1       
2    3.0
3    4.0
dtype: float64

print(cuda.as_cuda_array(s.data.mem).copy_to_host())
[1. 0. 3. 4.]

print(cupy.asarray(s.data.mem))
[1. 0. 3. 4.]

print(cudf.Series(s.data.mem))
0    1.0
1    0.0
2    3.0
3    4.0
dtype: float64

print(cudf.Series(s.to_gpu_array())) # the dense buffer without the null
0    1.0
1    3.0
2    4.0
dtype: float64

This could lead to a bunch of unexpected downstream behavior for users that is silently incorrect. I think users expect to pass around data with nulls (or NaNs) that don't get left out or filled in as 0.

print(cudf.Series(cuda.as_cuda_array(cuda.to_device([1.2, np.nan, 3]))))
0    1.2
1       
2    3.0
dtype: float64

If copying can fulfill that contract, I think it's a good choice. I think allowing people to leverage and interoperate the growing ecosystem is worth a copy in the short term.

@jakirkham as well for visibility

Reopening this since we currently return a numpy array as opposed to a GPU array. Will collect further thoughts on this in a bit.

To give a first test for .values, I think that the following would be helpful

def test_cupy_values():
    cupy = pytest.importorskip("cupy")
    s = cudf.Series([1, 2, 3])

    assert isinstance(s.values, cupy.ndarray)

    numpy.testing.assert_array_equal(
        s.values.get(),
        s.to_pandas().values
    )

If one wanted to extend this we might try

  • parametrizing around dtype with @pytest.mark.parametrize("dtype", [float, int, "float32"])
  • Checking that calling .values on an array with missing values or strings or categoricals raises an informative TypeError or NotImplementedError

Just noticed this after someone pointed this out to me, in the Pandas docs for .values it says the following.

Warning We recommend using DataFrame.to_numpy() instead.

ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values

Thanks Brandon! 馃槃

Was this page helpful?
0 / 5 - 0 ratings