Cudf: [QST] Tracking dataframe casts & CPU / GPU boundary crossing

Created on 13 Oct 2020 · 5Comments · Source: rapidsai/cudf

I'm starting to work with Cudf and Cupy, but I'm having issues keeping track of the operations that happen to cudf DataFrames.
More to the point I have trouble keeping track whether the objects really are on the GPU and where computation takes place. Furthermore, when using basic operations casts are happening behind the scenes and I don't know if I could write the code in such a way that they are avoided. There are also inconsistencies between using python's builtin type() and cupy ndarray dtype.

Sample code:

Package versions

import sys; print('Python Version:', sys.version) # Python 3.8.5 from conda-forge
import cudf; print('Cudf Version:', cudf.__version__) # 0.16.0a+1950.g5452cb9b4e
import cupy as cp; print('Cupy Version:', cp.__version__) # 7.8.0

Dataframe creation & created type check

From the 10-min to cudf docs plus a secondary creation that uses cupy arrays instead of base python lists.

df = cudf.DataFrame({'a': list(range(20)),
                     'b': list(reversed(range(20))),
                     'c': list(range(20))
                    })
print(df.dtypes)
dfcp = cudf.DataFrame({'a': cp.arange(20),
                       'b': cp.arange(20)[::-1],
                       'c': cp.arange(20)
                      })
print(dfcp.dtypes)

This yields:
df -> a: int64, b: int64, c:int64, dtype: object
dfcp -> a: int64, b: int64, c: int64, dtype: object

This isn't very useful because the types aren't specific, so I tried an alternative formulation that is more specific:

[print("df python init",c, type(df[c].iloc[0])) for c in df.columns]
[print("dfcp cupy init",c, type(df[c].iloc[0])) for c in dfcp.columns]

I don't know if this code doesn't result in a cast, but the output shows that both arrays have numpy int.64 as type for column elements so the cupy arrays have been changed to numpy.

df python init a
df python init b
df python init c
dfcp cupy init a
dfcp cupy init b
dfcp cupy init c

Cupy ndarrays or numpy ndarrays in CuDF DataFrames?

However if you ask the values attribute of each frame which returns cupy ndarrays you do get the expected answer for both dataframes. What happens when I ask the values, does it create the cupy arrays on the spot or were they already in the frames?

print(type(df.values))
print(df.values.device)
print(type(dfcp.values))
print(dfcp.values.device)

prints:

Python's type() doesn't report the correct type

If I take a look inside with python's builtin type

print("df python init")
for col in cp.arange(df.shape[1]):
    print(type(df.values[0, col]))
print("df cupy init")
for col in cp.arange(dfcp.shape[1]):
    print(type(dfcp.values[0, col]))

df python init

df cupy init

Then I get the output ao, which is weird because at those indices I expect a singular value, not an array. This checks out because without the type() invocation it just prints the values. Adding to the confusion if you use the cupy ndarray dtype attribute it also prints the actual type.
So why do the python built-in type() and (cupy.core.core.ndarray).dtype yield different results.

With cupy ndarray dtype attribute:

print("df python init")
for col in cp.arange(df.shape[1]):
    print(df.values[0, col].dtype)
print("df cupy init")
for col in cp.arange(dfcp.shape[1]):
    print(dfcp.values[0, col].dtype)

prints as expected:
df python init
int64
int64
int64
df cupy init
int64
int64
int64

Simple division on single column leads to all-column cast

If you do a simple division across an entire column and ask the Dataframe's dtypes then everything works as intended, but if you ask the types of the elements in the columns through the values attribute suddenly they all changed

df['a'] = df['a'] / 3.0
dfcp['a'] = dfcp['a'] / 3.0
print(df.dtypes)
print(dfcp.dtypes)

Correctly reports:
df: a -> float64, b -> int64, c -> int64, dtype: object
dfcp: a -> float64, b -> int64, c -> int64, dtype: object

but this:

print("df python init")
for col in cp.arange(df.shape[1]):
    print(df.values[0, col].dtype)
print("df cupy init")
for col in cp.arange(dfcp.shape[1]):
    print(dfcp.values[0, col].dtype)

now reports:
df python init
float64
float64
float64
df cupy init
float64
float64
float64

The final question is which is which? Are the columns of the dataframes numpy arrays f64, i64, i64 stored in general memory, but the values also in GPU memory as cupy f64 f64 f64? Are they created temporarily just because I ask them and not longer?
If I would change cp.arange to np.arange or just use Python's range() to the list of column indices, where would computation happen?

TL;DR

The iPython magic commands for profiling don't help out, because they don't show the CPU / GPU division.
Cudf dataframe dtypes doesn't tell you what type each column is exactly (is it np.int64 or cp.int64, ...)
How can you tell what is stored where and how it is changed?
How can you profile CUDF code?
How can you be assured that computation doesn't constantly swap back-and-forth between CPU and GPU which would nullify all benefits?

cuDF (Python) question

Source

Str-Gen

All 5 comments

Currently, our scalar representation is backed by CPU memory i.e.,

sr = cudf.Series([1, 2, 3])
sr  # Vector object

sr is stored on GPU memory, but when you access/retrieve a specific element from index of series, it copies the data onto CPU memory:

>>> sr.iloc[0] # Scalar object
1
>>> type(sr.iloc[0])
<class 'numpy.int64'>

But, this is going to change in future release of cudf. @brandon-b-miller can comment more on this.

The final question is which is which? Are the columns of the dataframes numpy arrays f64, i64, i64 stored in general memory, but the values also in GPU memory as cupy f64 f64 f64? Are they created temporarily just because I ask them and not longer?

The data frame column dtypes all will show np.int.. , np.float.., np.object but the data is essentially stored in GPU, not in CPU. We currently utilize numpy dtypes for representation - but this will soon change with dedicated cudf.Dtypes representation instead of displaying np.dtypes.

If I would change cp.arange to np.arange or just use Python's range() to the list of column indices, where would computation happen?

The initial object creation of np.arrange or range() call will happen definitely on CPU, but when you pass them to cudf.DataFrame a copy of data will occur from CPU memory to GPU memory.

TL;DR

The iPython magic commands for profiling don't help out, because they don't show the CPU / GPU division.

The data in above cudf.DataFrame is all backed by GPU. Except for incase of RangeIndex, @kkraus14 can correct me here.

Cudf dataframe dtypes doesn't tell you what type each column is exactly (is it np.int64 or cp.int64, ...)

The dtypes you currently see in df.dtypes are the correct dtypes. It is just that we use numpy dtypes to represent but the actual data is on GPU, not CPU memory.

How can you tell what is stored where and how it is changed?
Multiple ways:

You can nvidia-smi to see continuous memory usage.
You can use df.memory_usage to see individual dataframe memory usage.

How can you profile CUDF code?

You can use %timeit, %memit in ipython.
Snakeviz is a great graphical tool: https://jiffyclub.github.io/snakeviz/
You can use nvidia nsight to see lower-level memory details too: https://developer.nvidia.com/tools-overview

How can you be assured that computation doesn't constantly swap back-and-forth between CPU and GPU which would nullify all benefits?

In very rare or trivial cases we move a very tiny amount of memory to CPU currently i.e., column names are currently stored in memory backed by CPU. But to be able to track memory back-and-forth between CPU and GPU best tool is nsight(for great detail) or monitoring nvidia-smi and htop(for high level overview).

galipremsagar on 13 Oct 2020

❤1

RE: Scalars, and the result of cudf.Series([1,2,3]).iloc[1] for instance: right now, this translates into a call to libcudf that obtains the value of the underlying GPU-side column at that particular index location, transfers that value to the host, and then creates the corresponding numpy scalar from the result to serve back to the caller. This might give the impression that the data is stored in some kind of numpy related structure, but this is not the case - it's a custom cuDF data structure written in c++ (columns). NumPy is only involved in the result because numpy scalars are the closest we can currently get to an existing host-side representation of what an individual "element" of a column would be.

In the future, hopefully shortly, accessing a single element of an object with iloc will give you back a new kind of scalar object that isn't a numpy scalar.

brandon-b-miller on 13 Oct 2020

❤1

Alright, thank you both for the quick responses & detailed answers. I now have a much better view of what's actually happening.
I went through the documentation, but the closest I got was information at https://docs.rapids.ai/api/cudf/stable/basics.html & https://docs.rapids.ai/api/cudf/stable/internals.html and those pages don't have the full story.

Is it fair to summarize that (at least for now) the cuDF Frames and Series are maximally stored on the GPU and asking for a (partial) representation will pass the values to the host, wrapping them in numpy data types so they can be delivered and asking for the type information then is sort of moot because technically numpy.X is correct, but the 'real' data is still in cuDF data structures on the GPU?

@galipremsagar Thank you for the tooling recommendations, I'll look into snakeviz and nsight.

Str-Gen on 14 Oct 2020

Correct- you only get numpy structures back because that's the closest thing we can give you to the 'reality' of how the data is stored. The numpy dtypes are another example of this. Really pyarrow dtypes are an even better representation of the underlying data conceptually, especially when it comes to some of the types that we support and numpy does not (list for example). But the actual pyarrow dtype python objects don't have the nice properties that numpy dtypes have or the useful set of functions of those dtypes that numpy provides. This can actually create complicated programming situations on the backend since neither type system fits exactly. There's even work right now to see if a custom cuDF type system wouldn't make life easier as well.

brandon-b-miller on 14 Oct 2020

Closing as this is answered.