Hi all,
I found that when use amp with multi GPUs, the 0-GPU used much more memory than others.
| NVIDIA-SMI 410.73 Driver Version: 410.73 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:04:00.0 Off | N/A |
| 36% 30C P8 15W / 250W | 6520MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:05:00.0 Off | N/A |
| 36% 31C P8 1W / 250W | 897MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:08:00.0 Off | N/A |
| 36% 30C P8 19W / 250W | 897MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:09:00.0 Off | N/A |
| 36% 30C P8 4W / 250W | 897MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... Off | 00000000:85:00.0 Off | N/A |
| 36% 28C P8 8W / 250W | 897MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |
| 36% 30C P8 17W / 250W | 897MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... Off | 00000000:89:00.0 Off | N/A |
| 36% 29C P8 4W / 250W | 897MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce RTX 208... Off | 00000000:8A:00.0 Off | N/A |
| 36% 29C P8 21W / 250W | 897MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15079 C python3 887MiB |
| 0 15080 C python3 803MiB |
| 0 15081 C python3 803MiB |
| 0 15082 C python3 803MiB |
| 0 15083 C python3 803MiB |
| 0 15084 C python3 803MiB |
| 0 15085 C python3 803MiB |
| 0 15086 C python3 803MiB |
| 1 15080 C python3 887MiB |
| 2 15081 C python3 887MiB |
| 3 15082 C python3 887MiB |
| 4 15083 C python3 887MiB |
| 5 15084 C python3 887MiB |
| 6 15085 C python3 887MiB |
| 7 15086 C python3 887MiB |
+-----------------------------------------------------------------------------+
Here is my toy code:
import torch
import torch.distributed as dist
import time
from mpi4py import MPI
from apex import amp
import torchvision.models as models
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if __name__=='__main__':
dist_backend = 'nccl'
dist_url = "tcp://localhost:23456"
dist.init_process_group(backend=dist_backend, init_method=dist_url, rank=rank, world_size=comm.Get_size())
device = torch.device("cuda:{}".format(rank))
net = models.resnet18().to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=0.1)
net, optimizer = amp.initialize(net, optimizer, opt_level="O2")
time.sleep(100)
The start command is mpirun -n 8 python3 toy.py
Is that designed or I used in a wrong way? And is there a way to make all GPU memory usage uniform?
This isn't the canonical Pytorch way of initializing distributed training. The upstream launch utility documentation provides a good description:
https://pytorch.org/docs/stable/distributed.html#launch-utility
I've also added a simple example showing how we typically set up distributed training. In particular, note that we call torch.cuda.set_device(args.local_rank) before any model creation, to ensure that each process only allocates tensors on its assigned device unless explicitly told otherwise. This may resolve your issue.
The Imagenet example shows more "industrial grade" use of distributed data sampling for training and validation.
If the canonical Pytorch distributed initialization fails, or there's a reason you must use mpirun as opposed to torch.distributed.launch, reopen and we can try to figure that out.
@flymark2010 Is it resolved? I encountered the same problem.
How are you launching your scripts? When following the instructions in the Imagenet example with the upstream launcher torch.distributed.launch I observe that after a few iterations memory usage across devices stabilizes and is relatively uniform.
By the way I recommend using torch.nn.parallel.DistributedDataParallel instead of apex.parallel.DistributedDataParallel. The Torch version is really good these days and the Apex version is mostly to support internal use cases.
Most helpful comment
How are you launching your scripts? When following the instructions in the Imagenet example with the upstream launcher
torch.distributed.launchI observe that after a few iterations memory usage across devices stabilizes and is relatively uniform.By the way I recommend using
torch.nn.parallel.DistributedDataParallelinstead ofapex.parallel.DistributedDataParallel. The Torch version is really good these days and the Apex version is mostly to support internal use cases.