Cudf: Mismatching sizeof and alloc_size results

Created on 7 Jun 2019 · 19Comments · Source: rapidsai/cudf

Mismatching results for __sizeof__() and alloc_size on dask_cudf.core.DataFrame partitions. The following snippet can be used to reproduce the issue:

import cudf
import dask_cudf
from numba import cuda

rows = int(1e6)

free_before = cuda.current_context().get_memory_info()[0]
df = cudf.DataFrame([('A', [8] * rows), ('B', [32] * rows)])
free_after = cuda.current_context().get_memory_info()[0]
print("df size:          ", free_before - free_after)
print("df __sizeof__():  ", df.__sizeof__())
print("df alloc_size:    ", df.as_gpu_matrix().alloc_size)

free_before = cuda.current_context().get_memory_info()[0]
cdf = dask_cudf.from_cudf(df, npartitions=16)
free_after = cuda.current_context().get_memory_info()[0]
print("cdf size:         ", free_before - free_after)
print("cdf __sizeof__(): ", sum(p.compute().__sizeof__() for p in cdf.partitions))
print("cdf alloc_size:   ", sum(p.compute().as_gpu_matrix().alloc_size for p in cdf.partitions))

The results I get are:

df size:           25165824
df __sizeof__():   16000064
df alloc_size:     16000000
cdf size:          25165824
cdf __sizeof__():  32000000
cdf alloc_size:    16000000

As we can see, __sizeof__() is double the size of alloc_size, and the latter matches that of the cudf.DataFrame, despite the overhead for actual allocated device memory in both cudf and dask_cudf dataframes.

bug cuDF (Python) dask

Source

pentschev

All 19 comments

The relevant code is here:

https://github.com/rapidsai/cudf/blob/0233e4b53205b65196e74889803d8e3a75d1893e/python/cudf/dataframe/dataframe.py#L296-L297

I'm moving this to the cudf issue tracker.

My first guess is that we're not handling the index?

mrocklin on 7 Jun 2019

👍1

Well, I raised it in dask-cudf because the issue only happens in dask-cudf, cudf itself works fine.

pentschev on 7 Jun 2019

@pentschev included the index in this commit for 0.9 https://github.com/rapidsai/cudf/commit/722dc2f69778cd9e68dd0bbe1b8cd6d53e584302, however sizes are still inconsistent.

before:            _MemoryInfo(free=2463105024, total=4200988672)
after:             _MemoryInfo(free=2448424960, total=4200988672)
df size:           14680064
df __sizeof__():   16000032
df alloc_size:     16000000

before:            _MemoryInfo(free=2448424960, total=4200988672)
after:             _MemoryInfo(free=2423259136, total=4200988672)
cdf size:          25165824
cdf __sizeof__():  24000000
cdf alloc_size:    16000000

cwharris on 12 Jul 2019

I decided to take a look at pandas, cudf, dask, and dask_cudf to see what similarities / differences there are for indicating memory footprint.

Running against 0.9 branch.

import sys

import pandas
import dask

from numba import cuda
import cudf
import dask_cudf

rows = int(1e6) + 1
npartitions = 32

df = pandas.DataFrame({ "A": [8] * rows, "B": [32] * rows })
print('pandas')
print('  __sizeof__:    ', df.__sizeof__())
print('  sys.getsizeof: ', sys.getsizeof(df))
print()

ddf = dask.dataframe.from_pandas(df, npartitions=npartitions)
print('dask')
print('  __sizeof__:                 ', ddf.__sizeof__())
print('  sysm.getsizeof:             ', sys.getsizeof(ddf))
print('  sum __sizeof__ computed:    ', sum(p.compute().__sizeof__() for p in ddf.partitions))
print('  sum sys.getsizeof computed: ', sum(sys.getsizeof(p.compute()) for p in ddf.partitions))
print()

x = cudf.DataFrame([("A", [0])]) # initialize cuda (could be lazy)
x = None

cdf_m0 = cuda.current_context().get_memory_info()
cdf = cudf.from_pandas(df)
cdf_m1 = cuda.current_context().get_memory_info()
print('cudf')
print('  before:                ', cdf_m0)
print('  after:                 ', cdf_m1)
print('  diff:                  ', cdf_m0[0] - cdf_m1[0])
print('  __sizeof__:            ', cdf.__sizeof__())
print('  sysm.getsizeof:        ', sys.getsizeof(cdf))
print('  gpu matrix alloc_size: ', cdf.as_gpu_matrix().alloc_size)
print()

dcdf_m0 = cuda.current_context().get_memory_info()
dcdf = dask_cudf.from_cudf(cdf, npartitions=npartitions)
dcdf_m1 = cuda.current_context().get_memory_info()
print('dask_cudf')
print('  before:                     ', dcdf_m0)
print('  after:                      ', dcdf_m1)
print('  diff:                       ', dcdf_m0[0] - dcdf_m1[0])
print('  __sizeof__:                 ', dcdf.__sizeof__())
print('  sys.getsizeof:              ', sys.getsizeof(dcdf))
print('  sum __sizeof__ computed:    ', sum(p.compute().__sizeof__() for p in dcdf.partitions))
print('  sum sys.getsizeof computed: ', sum(sys.getsizeof(p.compute()) for p in dcdf.partitions))
print('  sum gpu matrix alloc_size:  ', sum(p.compute().as_gpu_matrix().alloc_size for p in dcdf.partitions))
print()

results

pandas
  __sizeof__:     16000096
  sys.getsizeof:  16000120

dask
  __sizeof__:                  32
  sysm.getsizeof:              56
  sum __sizeof__ computed:     16002700
  sum sys.getsizeof computed:  16003468

cudf
  before:                 _MemoryInfo(free=2319712256, total=4200988672)
  after:                  _MemoryInfo(free=2294546432, total=4200988672)
  diff:                   25165824
  __sizeof__:             24000024
  sysm.getsizeof:         24000048
  gpu matrix alloc_size:  16000016

dask_cudf
  before:                      _MemoryInfo(free=2294546432, total=4200988672)
  after:                       _MemoryInfo(free=2269380608, total=4200988672)
  diff:                        25165824
  __sizeof__:                  32
  sys.getsizeof:               56
  sum __sizeof__ computed:     24000024
  sum sys.getsizeof computed:  24000792
  sum gpu matrix alloc_size:   16000016

Only inference I can make is that alloc_size reports the underlying dataset size (as pandas would see it). That being said, the underlying dataset is not the entire allocation. Cuda/cudf seems to be allocating some overhead (which makes sense). So these results have some internal consistency.

I'm probably missing something here, but aside from "why is the overhead +50%", do these results indicate a problem?

cwharris on 12 Jul 2019

Ok, so I looked back and the only difference between @pentschev 's tests and mine are that I created my cudf dataframe using from_pandas. If I create it via the cudf.DataFrame constructor, I see the same allocations as @pentschev . Here's the fun part: dask_cudf.from_cudf is just an alias for dask.from_pandas.

https://github.com/rapidsai/cudf/blob/branch-0.9/python/dask_cudf/dask_cudf/core.py#L883

So, again, there's internal consistency. The question becomes why does from_pandas end up allocating _~50% more memory_?

cwharris on 12 Jul 2019

@cwharris as far as I could understand, this 50% memory increase (8MB in the case you mentioned above) is the index memory buffer. If I recall correctly, a RangeIndex in Pandas will only take a few bytes instead of populating an entire array, whereas cudf will populate a memory buffer with length matching the number of rows. I'm not sure if cudf will always do that for the indices, but that behavior seems to differ from Pandas for such cases. It may perhaps be a design choice to improve performance in CUDA kernels, but I can't confirm that, maybe @kkraus14 could say something about this?

Finally, from the use case I was interested in when I filled this issue, the most important part to me is that the memory allocation is reported correctly in cuDF, even if that doesn't may differ internally from how Pandas handles memory.

pentschev on 12 Jul 2019

@pentschev a RangeIndex shouldn't have an underlying buffer and is created opportunistically when needed to be passed to a libcudf function. There should be 0 GPU memory usage, but I'm not sure what size you would expect us to report here as it would be CPU memory instead of GPU memory.

kkraus14 on 12 Jul 2019

I’m wondering if sizeof should be returning GPU memory allocation. I’m leaning toward no, that we should have a separate method of that.

1) sizeof has traditionally been used only to refer to memory allocations in RAM.
2) aggregating allocations from multiple sources obscures the source of those allocations.
3) creating a method to specifically indicate RAM usage seems counterintuitive when sizeof has been designed for this purpose.

I think this issue (of mismatched / misunderstood) allocations Is an artifact of a much bigger question. How should memory usage be reported in a world with multiple physical / virtual memory interfaces?

Dask seems not to care too much what __sizeof__ reports for its top-level dataframe interface. Instead, it waits for the consumer to “compute” the contiguous dataframe representation, and that result correctly reports memory usage.

Is there a reason we include GPU memory in our sizeof result? Maybe we can just report RAM and add another method (I think we have it already) to report GPU memory.

cwharris on 12 Jul 2019

I believe internally Dask uses the sizeof method to understand how much memory the Python objects are consuming to determine if / when it should start spilling things. I agree it's not a good long term approach to report sizeof arbitrarily for CPU/GPU memory, but in the short term it's needed for Dask to allow it to manage memory.

kkraus14 on 12 Jul 2019

👍1

Sorry for the late reply.

@pentschev a RangeIndex shouldn't have an underlying buffer and is created opportunistically when needed to be passed to a libcudf function.

Right, my comment was wrong. What actually happens is that a pd.RangeIndex is turned into a cudf.GenericIndex when you do cudf.from_pandas(df), such as in https://github.com/rapidsai/cudf/issues/1957#issuecomment-510728351. This will end up allocating a memory buffer to hold the indices. That's why we see only (roughly) 16MB with Pandas, but 24MB with cuDF.

There should be 0 GPU memory usage, but I'm not sure what size you would expect us to report here as it would be CPU memory instead of GPU memory.

For now what's important is to know how much GPU memory a cuDF dataframe is consuming in total, so ignoring the size of the index buffer isn't an option.

I believe internally Dask uses the sizeof method to understand how much memory the Python objects are consuming to determine if / when it should start spilling things. I agree it's not a good long term approach to report sizeof arbitrarily for CPU/GPU memory, but in the short term it's needed for Dask to allow it to manage memory.

If we have two different methods to get memory usage (e.g., sizeof/gpu_sizeof, for host and device respectively), we can definitely make use of the proper one on the Dask side. So if you consider important to make that change soon, we can ensure dask-cuda will use the appropriate method then.

pentschev on 15 Jul 2019

If we separated sizeof and __sizeof__ from gpu_sizeof and __gpu_sizeof__ how will Dask treat the two separate memories as far as spilling and monitoring? Would it only look at GPU memory for now?

kkraus14 on 15 Jul 2019

Would it only look at GPU memory for now?

I would say this would be the easiest, yes. We could register both separately, and if we're running out of memory on either side, we would need to update both registries, which is doable, just some extra work. I don't know if for cuDF dataframes it's that important to look at the host memory consumed by the objects, is there a situation where large memory buffers are required? If not, we may be ok ignoring a few hundred bytes of memory consumed on the host-side.

pentschev on 15 Jul 2019

Would it only look at GPU memory for now?

I would say this would be the easiest, yes. We could register both separately, and if we're running out of memory on either side, we would need to update both registries, which is doable, just some extra work. I don't know if for cuDF dataframes it's that important to look at the host memory consumed by the objects, is there a situation where large memory buffers are required? If not, we may be ok ignoring a few hundred bytes of memory consumed on the host-side.

cuDF uses a very minimal amount of host memory so I think the risk of ignoring it is very small for now. The concern is if someone is combining both traditional Pandas / Numpy objects with cuDF objects in a single worker.

kkraus14 on 15 Jul 2019

The concern is if someone is combining both traditional Pandas / Numpy objects with cuDF objects in a single worker.

These objects have their own ways of registering memory, so if cuDF is not wrapping Pandas/NumPy objects in any way (that would require us to check for sub-object sizes), I think it won't be a problem, unless someone has really lots of really small cuDF objects, but in that case, GPU memory would likely be a problem first anyway.

pentschev on 15 Jul 2019

I'm not tracking with how cuDF is wrapping Pandas/NumPy objects, or why lots of small cuDF objects would be problematic with regards to accurately determining the size of currently allocated memory. I should take a look at the etl between cuDF / Pandas / Dask, etc.

Before the last few comments, I was thinking Dask should know about GPU+CPU memory usage, al a @sizeof.register(...), but now it sounds like we only want Dask to know about the GPU memory usage?

cwharris on 15 Jul 2019

I'm not tracking with how cuDF is wrapping Pandas/NumPy objects

I meant only _if_ that happens, I don't think it does, but I also don't know cuDF in detail, so just raising awareness in case this may actually happen.

or why lots of small cuDF objects would be problematic with regards to accurately determining the size of currently allocated memory.

It would be a problem if we are ignoring the object size and only taking into account GPU memory. For example, if a cuDF object takes 1kB, but somehow there are 1M cuDF objects, thus taking 1GB of host memory that's not accounted for.

Before the last few comments, I was thinking Dask should know about GPU+CPU memory usage, al a @sizeof.register(...), but now it sounds like we only want Dask to know about the GPU memory usage?

Dask (more specifically dask-cuda) has two LRU caches, one for device memory (that can be spilled to host once it reaches a certain size), and another for host memory (that can be spilled to disk), therefore, we need to know of both GPU and CPU memory usage separately. In the case of cuDF for example, we may be fine ignoring CPU memory usage for now (if it really uses just a few hundred bytes) and only take into account GPU memory usage, which is where the bulk of memory lies.

pentschev on 15 Jul 2019

👍1

I believe I have tracked this down to the way dask slices the incoming cuDF object using iloc, see https://github.com/dask/dask/blob/master/dask/dataframe/io/io.py#L208. Basically if we have a cudf.DataFrame with a RangeIndex, df.iloc[a:b] will return a dataframe with a GenericIndex as @pentschev said. I think the relevant line of code in cuDF is this one

I will take a look at what can be done here so that Dask is aware of how much memory it will need post-conversion. I wonder if the cleanest way to deal with the problem is to just have iloc return a RangeIndex'ed object when the calling dataframe also has one.