Getting this error when attempting to use ddp with the "getting started" autoencoder example:
Stack Trace:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
File "01_getting_started_autoencoder.py", line 66, in <module>
modle, trainer = cli_main()
File "01_getting_started_autoencoder.py", line 60, in cli_main
trainer.fit(model, train_dl)
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
Traceback (most recent call last):
File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 66, in <module>
results = self.accelerator_backend.train()
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
self.trainer.is_slurm_managing_tasks
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
torch_backend, rank=global_rank, world_size=world_size
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
modle, trainer = cli_main()
File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 60, in cli_main
trainer.fit(model, train_dl)
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
self.trainer.is_slurm_managing_tasks
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
torch_backend, rank=global_rank, world_size=world_size
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Follow the code in the getting started question with these parameters to Trainer:
model = LitAutoEncoder()
trainer = pl.Trainer(gpus='1,2', distributed_backend='ddp')
trainer.fit(model, train_dl)
For it to train on multiple GPUs :)
conda, pip, source): pipHi, thanks for reporting.
The autoencoder example runs fine for me.
Could you please let me know the Lightning version you are using?
We recently fixed a bug, please use 1.0.4 or newer.
Yah I'm using 1.0.4
Here's the full source for my .py file:
import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision import transforms
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from torch.utils.data import random_split
# define pl module
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(28*28, 64),
nn.ReLU(),
nn.Linear(64, 3)
)
self.decoder = nn.Sequential(
nn.Linear(3, 64),
nn.ReLU(),
nn.Linear(64, 28*28)
)
def forward(self, x):
# in lightning, forward defines the prediction/inference actions
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
# training_step defined the train loop.
# It is independent of forward
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
# Logging to TensorBoard by default
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
# define datasets/dataloaders
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train_dl = DataLoader(dataset)
# train
model = LitAutoEncoder()
trainer = pl.Trainer(gpus='0,1', distributed_backend='ddp')
trainer.fit(model, train_dl)
ok, I can confirm this is only happening on pytorch 1.7
I have the same issue on 1080ti, with V100 GPUs everything works fine.
@maxjeblick sounds like a driver issue?
Edit:
Certainly very odd that NCCL is bugging out only with 1080ti GPUs...
I tested the following with our examples:
ddp 1080ti pytorch 1.7: error
ddp 1080ti pytorch 1.6: good
ddp 2080ti pytorch 1.7: good
ddp 2080ti pytorch 1.6: good
so far was not able reproduce with pytorch examples :( need to dig deep
I can confirm the same error using the latest Lightning and PyTorch using Tesla V100s. Does not happen on a single node with 2 GPUs, but once I go to multiple nodes the error happens.
same error with A100 gpus.
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Have the same issue with 2x2080ti on ubuntu 20.04 using pytorch 1.7 and cuda 11.
downgrading to pytorch 1.6 and cuda 10.2 fixes the issue
Could it be this fix in pytorch?
https://github.com/pytorch/pytorch/issues/47257
Exact same error (line number and stack messages).
pytorch closed their issue because this issue exists and you close this issue because their issue exists...
@julian3xl are you referring to the one I posted? I was under the impression that the fix was merged into pytorch master.
I will check if it's fixed