Xgboost: Safe CUDA deallocation.

Created on 28 Feb 2019  路  13Comments  路  Source: dmlc/xgboost

While working on #4162, I found out thatCubMemory::Free is applied on wrong device pointer. Just put dh::CudaCheckPointerDevice before cudaFree, and run unittests should earn some segfaults.

The problem lies within destructors of various DeviceShard, they are usually called by destructor of std::unique_ptr, which is not threaded and doesn't know how to handle device id. One way to get this right is making a Clear method for all DeviceShards and remember calling it via ExecuteIndexShard (for threading) before entering their respected destructor.

It's not clear to me how does thrust dealing with this yet. The solution above is quite fragile I think
. @RAMitchell @mt-jones Looking for discussion/suggestions. :)

Most helpful comment

@mt-jones Thanks for the explanations!

and try to move the data/address the free

I'm not sure what did you mean by this.


For illustration, suppose that you have two GPUs which are peer-connected. Now, suppose that one of the GPUs has data stored in void * dev_a. If both GPUs are coordinated through a collection of threads, and every thread calls cudaFree(dev_a), there are two possibilities:

  1. The device that owns dev_a calls cudaFree first
  2. The device that owns dev_a does not call cudaFree first.

With (1), dev_a is physically freed from device memory.
With (2), dev_a is manipulated via direct access and freed by the device that doesn't own dev_a.

As an example, consider the following:

  float val = 7; // the true value

  float h_val1;
  float h_val2;
  float * d_val;

  cudaSetDevice(0);
  cudaMalloc((void**)&d_val,
             sizeof(float));
  cudaMemcpy((void*)d_val,
             (const void*)&val, sizeof(float),
             cudaMemcpyHostToDevice);
  cudaMemcpy((void*)&h_val1,
             (const void*)d_val, sizeof(float),
             cudaMemcpyDeviceToHost);
  printf("the value after transfer is %f \n", h_val1); // should print 7.000000

  cudaSetDevice(1);
  cudaFree((void*)d_val);
  cudaSetDevice(0);
  cudaMemcpy((void*)&h_val2,
             (const void*)d_val, sizeof(float),
             cudaMemcpyDeviceToHost);
  printf("the value after the transfer is %f \n", h_val2); // should print garbage, 0.000000 if there is nothing at that place in memory

In other words peer-enabled systems with direct access provide each GPU the ability to read/write to their peers.

Now, if a GPU calls cudaFree(0), nothing will happen. The call immediately passes. So, it's OK to have all GPUs in a peer-connected system call cudaFree provided that after the initial free, the pointer is null, prompting the call to cudaFree to simply pass.

I suspect the situation involving unique pointers and device shards is considerably more complex, and for that reason, I believe calling cudaSetDevice() in the destructor path is a better solution.


On the subject of memory pools, that might be a more robust solution in terms of dealing with errors of this flavor. It also nets you more performance on multi-GPU systems. That feels like a much larger volume of work, though.

All 13 comments

I thought you had unified address space across multiple GPUs? You still have to call cudaSetDevice before freeing memory?

unified address space across multiple GPUs

@hcho3 I read to doc from CUDA didn't see any mention of one can free memory of another device. Allocating memory is device dependent. And besides, #4161 is an example of reading wrong address space.

@trivialfis Got it. Thanks for the explanation.

I'm not sure if you need to change device to perform cudaFree(). Let me check.

@RAMitchell I think it does require setting the right device. See an example from NCCL:

https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/examples.html

It is good practice to cudaSetDevice before calling cudaMalloc to avoid multithreading bugs, and invalid access errors. These errors can be masked by P2P-enabled devices as the driver will try to resolve the issue with a dynamic transfer.

Similarly, cudaFree won't necessarily require a cudaSetDevice on P2P-enabled machines, or codes compiled with managed memory. The driver will catch the page fault, and try to move the data/address the free. If the pointer fed to cudaFree is 0, the API call will return without doing anything; though, I believe there is an expensive context switch in this case.

Short answer, it's always safer and more performant to run cudaSetDevice for allocations and frees.

unified address space across multiple GPUs

@hcho3 I read to doc from CUDA didn't see any mention of one can free memory of another device. Allocating memory is device dependent. And besides, #4161 is an example of reading wrong address space.

If you鈥檙e using managed memory, (e.g.) cudaMallocManaged(), data can be stored in a single pointer across all devices within a node and a call to free will release the memory on all processors (CPU and GPU).

Even with managed memory, you need to cudaSetDevice() to properly configure kernel launches, though.


As I understand it, the problem is really the deleter for the shared pointer is left as the default. Could we not simply supply a custom deleter to DeviceShard unique pointers which properly sets the device and then calls free? This may require some significant rewrite; though, it seems straight forward.

@mt-jones Thanks for the explanations!

and try to move the data/address the free

I'm not sure what did you mean by this.

If you鈥檙e using managed memory,

Sadly no.

Could we not simply supply a custom deleter to DeviceShard unique pointers which properly sets the device and then calls free?

I am thinking about this. There was requests for de-allocating GPU memory after a training session, which might require us implementing a memory pool. But one way or another, destructor from c++ itself can not be used to free GPU memory any more.

@mt-jones Thanks for the explanations!

and try to move the data/address the free

I'm not sure what did you mean by this.


For illustration, suppose that you have two GPUs which are peer-connected. Now, suppose that one of the GPUs has data stored in void * dev_a. If both GPUs are coordinated through a collection of threads, and every thread calls cudaFree(dev_a), there are two possibilities:

  1. The device that owns dev_a calls cudaFree first
  2. The device that owns dev_a does not call cudaFree first.

With (1), dev_a is physically freed from device memory.
With (2), dev_a is manipulated via direct access and freed by the device that doesn't own dev_a.

As an example, consider the following:

  float val = 7; // the true value

  float h_val1;
  float h_val2;
  float * d_val;

  cudaSetDevice(0);
  cudaMalloc((void**)&d_val,
             sizeof(float));
  cudaMemcpy((void*)d_val,
             (const void*)&val, sizeof(float),
             cudaMemcpyHostToDevice);
  cudaMemcpy((void*)&h_val1,
             (const void*)d_val, sizeof(float),
             cudaMemcpyDeviceToHost);
  printf("the value after transfer is %f \n", h_val1); // should print 7.000000

  cudaSetDevice(1);
  cudaFree((void*)d_val);
  cudaSetDevice(0);
  cudaMemcpy((void*)&h_val2,
             (const void*)d_val, sizeof(float),
             cudaMemcpyDeviceToHost);
  printf("the value after the transfer is %f \n", h_val2); // should print garbage, 0.000000 if there is nothing at that place in memory

In other words peer-enabled systems with direct access provide each GPU the ability to read/write to their peers.

Now, if a GPU calls cudaFree(0), nothing will happen. The call immediately passes. So, it's OK to have all GPUs in a peer-connected system call cudaFree provided that after the initial free, the pointer is null, prompting the call to cudaFree to simply pass.

I suspect the situation involving unique pointers and device shards is considerably more complex, and for that reason, I believe calling cudaSetDevice() in the destructor path is a better solution.


On the subject of memory pools, that might be a more robust solution in terms of dealing with errors of this flavor. It also nets you more performance on multi-GPU systems. That feels like a much larger volume of work, though.

@mt-jones That's really insightful. Thanks!

Glad to help!

Another point I forgot to address is the universal virtual address space on multi-GPU systems.

If you have a single GPU on your system, and you call cudaMalloc, the operation takes t0 seconds.
If you have a dual-GPU system, the same call will take 2*t0.

This is because the driver has to properly sequence the address space associated with both GPUs. There is, in general, a linear relation between the number of GPUs and the cost of cudaMalloc.

Making copies between GPUs on the same system will incur this cost if you need to allocate new memory for the operation.

Note: I've observed situations where the cudaMalloc is worse than linear, but linear is currently the lower-bound.

Closing this for now since cudaFree is safe and rarely called. Thanks guys!

Was this page helpful?
0 / 5 - 0 ratings