Apex: Memory used on GPU0 is much more than others after amp.initialize

Created on 14 Mar 2019 · 3Comments · Source: NVIDIA/apex

Hi all,
I found that when use amp with multi GPUs, the 0-GPU used much more memory than others.

| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:04:00.0 Off |                  N/A |
| 36%   30C    P8    15W / 250W |   6520MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:05:00.0 Off |                  N/A |
| 36%   31C    P8     1W / 250W |    897MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:08:00.0 Off |                  N/A |
| 36%   30C    P8    19W / 250W |    897MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
| 36%   30C    P8     4W / 250W |    897MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:85:00.0 Off |                  N/A |
| 36%   28C    P8     8W / 250W |    897MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:86:00.0 Off |                  N/A |
| 36%   30C    P8    17W / 250W |    897MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce RTX 208...  Off  | 00000000:89:00.0 Off |                  N/A |
| 36%   29C    P8     4W / 250W |    897MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce RTX 208...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 36%   29C    P8    21W / 250W |    897MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15079      C   python3                                      887MiB |
|    0     15080      C   python3                                      803MiB |
|    0     15081      C   python3                                      803MiB |
|    0     15082      C   python3                                      803MiB |
|    0     15083      C   python3                                      803MiB |
|    0     15084      C   python3                                      803MiB |
|    0     15085      C   python3                                      803MiB |
|    0     15086      C   python3                                      803MiB |
|    1     15080      C   python3                                      887MiB |
|    2     15081      C   python3                                      887MiB |
|    3     15082      C   python3                                      887MiB |
|    4     15083      C   python3                                      887MiB |
|    5     15084      C   python3                                      887MiB |
|    6     15085      C   python3                                      887MiB |
|    7     15086      C   python3                                      887MiB |
+-----------------------------------------------------------------------------+

Here is my toy code:

import torch
import torch.distributed as dist
import time
from mpi4py import MPI
from apex import amp
import torchvision.models as models

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if __name__=='__main__':
  dist_backend = 'nccl'
  dist_url = "tcp://localhost:23456"
  dist.init_process_group(backend=dist_backend, init_method=dist_url, rank=rank, world_size=comm.Get_size())
  device = torch.device("cuda:{}".format(rank))
  net = models.resnet18().to(device)

  optimizer = torch.optim.SGD(net.parameters(), lr=0.1)
  net, optimizer = amp.initialize(net, optimizer, opt_level="O2")
  time.sleep(100)

The start command is mpirun -n 8 python3 toy.py

Is that designed or I used in a wrong way? And is there a way to make all GPU memory usage uniform?

Source

flymark2010

Most helpful comment

How are you launching your scripts? When following the instructions in the Imagenet example with the upstream launcher torch.distributed.launch I observe that after a few iterations memory usage across devices stabilizes and is relatively uniform.

By the way I recommend using torch.nn.parallel.DistributedDataParallel instead of apex.parallel.DistributedDataParallel. The Torch version is really good these days and the Apex version is mostly to support internal use cases.

mcarilli on 30 Oct 2019

👍2

All 3 comments

This isn't the canonical Pytorch way of initializing distributed training. The upstream launch utility documentation provides a good description:
https://pytorch.org/docs/stable/distributed.html#launch-utility
I've also added a simple example showing how we typically set up distributed training. In particular, note that we call torch.cuda.set_device(args.local_rank) before any model creation, to ensure that each process only allocates tensors on its assigned device unless explicitly told otherwise. This may resolve your issue.

The Imagenet example shows more "industrial grade" use of distributed data sampling for training and validation.

If the canonical Pytorch distributed initialization fails, or there's a reason you must use mpirun as opposed to torch.distributed.launch, reopen and we can try to figure that out.

mcarilli on 14 Mar 2019

👍1

@flymark2010 Is it resolved? I encountered the same problem.