Cudf: [FEA] Add a .values property to convert to a GPU array

Created on 22 May 2019 · 13Comments · Source: rapidsai/cudf

What is the canonical way to convert a cudf dataframe into an array-like object?

There is

as_gpu_matrix, which seems to be the best choice today? This returns a Numba device array
to_gpu_matrix, which seems to not do anything
as_matrix, which seems to return a Numpy array

First, these should maybe replace the name matrix with array. Numpy matrix objects are different and somewhat unpleasant. Also, note this:

In [19]: df.to_pandas().as_matrix()
/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/ipython:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  #!/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/python
Out[19]:
array([[1, 4],
       [2, 5],
       [3, 6]])

I most often see people use .values, a property which returns a homogenously typed numpy array. I wonder if we might instead return a cupy array (or numba device array if that's preferred). If we choose something sufficiently numpy-like (like cupy) then things will just work with dask dataframe.

cuDF (Python) feature request

Source

mrocklin

All 13 comments

This would be a valuable property, as it's fairly common to call this in pandas to grab the underlying numpy array

One question quickly jumps out: What will happen with dataframes including non-numeric data? Should this fail? Should it only return the numeric values? Currently, as_gpu_matrix requires the data to be numeric (and of the same type) as it returns numba device ndarray. We can fairly easily solve the type alignment issue (.values in pandas resolves this by upcasting the numeric types, as @mrocklin mentioned), but the other is quite unclear.

beckernick on 22 May 2019

In pandas you would use the to_records method in that case. I believe that in Pandas both .as_matrix() and .values are supposed to return a homogenously typed array. If you give it very heterogeneously typed arrays then you get back an object-dtype array.

mrocklin on 22 May 2019

Absolutely. I didn't phrase my question as clearly as I should have. We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays. My question is how to best handle that when thinking about something like .values.

beckernick on 22 May 2019

We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays.

Oh I see, yeah that's a much bigger problem it seems. I don't have any answers for you there.

My question is how to best handle that when thinking about something like .values.

Well, for the .values API in particular this concern doesn't come up. The contract is to promise a homogeneously typed array. We would upcast or err probably?

mrocklin on 22 May 2019

I would think raising an informative error if we cannot upcast (such as when some columns are non-numeric) is probably the right solution, as I think the .values contract may be a bit more specific: a homogeneously typed array that contains all of your data.

beckernick on 22 May 2019

👍1

@mrocklin @beckernick typically, calling .values against a Pandas Series/DataFrame gives its underlying data by reference, how important would it be to return by reference versus by copy? Given the user expects a dense buffer, what should we do with nulls?

kkraus14 on 28 May 2019

@thomcom , @kkraus14 that's a good question. Is it correct that we can return the underlying buffer by reference easily if there are no nulls, but we'd need to make a copy to return a buffer with nulls? (Please do correct me if I'm wrong on that one.)

For interoperability in the ecosystem, we likely don't want the following behavior to occur when someone wants to convert to CuPy (or anything else):

s = cudf.Series([1.0, np.nan, 3, 4])
print(s)
0    1.0
1       
2    3.0
3    4.0
dtype: float64

print(cuda.as_cuda_array(s.data.mem).copy_to_host())
[1. 0. 3. 4.]

print(cupy.asarray(s.data.mem))
[1. 0. 3. 4.]

print(cudf.Series(s.data.mem))
0    1.0
1    0.0
2    3.0
3    4.0
dtype: float64

print(cudf.Series(s.to_gpu_array())) # the dense buffer without the null
0    1.0
1    3.0
2    4.0
dtype: float64

This could lead to a bunch of unexpected downstream behavior for users that is silently incorrect. I think users expect to pass around data with nulls (or NaNs) that don't get left out or filled in as 0.

print(cudf.Series(cuda.as_cuda_array(cuda.to_device([1.2, np.nan, 3]))))
0    1.2
1       
2    3.0
dtype: float64

If copying can fulfill that contract, I think it's a good choice. I think allowing people to leverage and interoperate the growing ecosystem is worth a copy in the short term.

beckernick on 30 May 2019

@jakirkham as well for visibility

beckernick on 30 May 2019

👍1

Reopening this since we currently return a numpy array as opposed to a GPU array. Will collect further thoughts on this in a bit.

kkraus14 on 12 Aug 2019

To give a first test for .values, I think that the following would be helpful

def test_cupy_values():
    cupy = pytest.importorskip("cupy")
    s = cudf.Series([1, 2, 3])

    assert isinstance(s.values, cupy.ndarray)

    numpy.testing.assert_array_equal(
        s.values.get(),
        s.to_pandas().values
    )

If one wanted to extend this we might try

parametrizing around dtype with @pytest.mark.parametrize("dtype", [float, int, "float32"])
Checking that calling .values on an array with missing values or strings or categoricals raises an informative TypeError or NotImplementedError

mrocklin on 14 Aug 2019

Just noticed this after someone pointed this out to me, in the Pandas docs for .values it says the following.

Warning We recommend using DataFrame.to_numpy() instead.

ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values

jakirkham on 28 Aug 2019

Closed by https://github.com/rapidsai/cudf/pull/2655.

brandon-b-miller on 27 Sep 2019

🎉1

Thanks Brandon! 😄

jakirkham on 27 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[BUG] "NaT" string literal needs to be recognized as `null` in to_timestamps method

galipremsagar · 3Comments

[BUG] RunTimeError in `cudf::strings::starts_with`, `cudf::strings::ends_with` and `cudf::strings::find` when `target=''`

galipremsagar · 3Comments

[BUG] datetime dtype not being inferred by cudf avro reader

galipremsagar · 3Comments

[QST] Multiple GPU memories with Dask-cuDF

jmkim · 3Comments

Latest Docker container gives CUDA driver version error

MurrayData · 3Comments