Pytorch-lightning: NCCL error using DDP and PyTorch 1.7

Created on 29 Oct 2020  路  12Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

Getting this error when attempting to use ddp with the "getting started" autoencoder example:

Stack Trace:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "01_getting_started_autoencoder.py", line 66, in <module>
    modle, trainer = cli_main()
  File "01_getting_started_autoencoder.py", line 60, in cli_main
    trainer.fit(model, train_dl)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
Traceback (most recent call last):
  File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 66, in <module>
    results = self.accelerator_backend.train()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
    self.trainer.is_slurm_managing_tasks
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
    torch_backend, rank=global_rank, world_size=world_size
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    modle, trainer = cli_main()
  File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 60, in cli_main
    trainer.fit(model, train_dl)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
    self.trainer.is_slurm_managing_tasks
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
    torch_backend, rank=global_rank, world_size=world_size
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

To Reproduce

Follow the code in the getting started question with these parameters to Trainer:

model = LitAutoEncoder()
trainer = pl.Trainer(gpus='1,2', distributed_backend='ddp')
trainer.fit(model, train_dl)

Expected behavior

For it to train on multiple GPUs :)

Environment

  • PyTorch Version 1.7:
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source): n/a
  • Python version: 3.7
  • CUDA/cuDNN version: 10.2/7.6.5
  • GPU models and configuration: 2 1080Tis
  • Any other relevant information: n/a
3rd-party DDP Priority P0 bug / fix help wanted

All 12 comments

Hi, thanks for reporting.
The autoencoder example runs fine for me.
Could you please let me know the Lightning version you are using?
We recently fixed a bug, please use 1.0.4 or newer.

Yah I'm using 1.0.4

Here's the full source for my .py file:

import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision import transforms
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from torch.utils.data import random_split


# define pl module
class LitAutoEncoder(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 64),
            nn.ReLU(),
            nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64),
            nn.ReLU(),
            nn.Linear(64, 28*28)
        )

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        # training_step defined the train loop.
        # It is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)

        # Logging to TensorBoard by default
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# define datasets/dataloaders
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train_dl = DataLoader(dataset)


# train
model = LitAutoEncoder()
trainer = pl.Trainer(gpus='0,1', distributed_backend='ddp')
trainer.fit(model, train_dl)

ok, I can confirm this is only happening on pytorch 1.7

I have the same issue on 1080ti, with V100 GPUs everything works fine.

@maxjeblick sounds like a driver issue?

Edit:
Certainly very odd that NCCL is bugging out only with 1080ti GPUs...

I tested the following with our examples:
ddp 1080ti pytorch 1.7: error
ddp 1080ti pytorch 1.6: good
ddp 2080ti pytorch 1.7: good
ddp 2080ti pytorch 1.6: good

so far was not able reproduce with pytorch examples :( need to dig deep

I can confirm the same error using the latest Lightning and PyTorch using Tesla V100s. Does not happen on a single node with 2 GPUs, but once I go to multiple nodes the error happens.

same error with A100 gpus.

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

Have the same issue with 2x2080ti on ubuntu 20.04 using pytorch 1.7 and cuda 11.
downgrading to pytorch 1.6 and cuda 10.2 fixes the issue

Could it be this fix in pytorch?
https://github.com/pytorch/pytorch/issues/47257
Exact same error (line number and stack messages).

pytorch closed their issue because this issue exists and you close this issue because their issue exists...

@julian3xl are you referring to the one I posted? I was under the impression that the fix was merged into pytorch master.
I will check if it's fixed

Was this page helpful?
0 / 5 - 0 ratings

Related issues

monney picture monney  路  3Comments

awaelchli picture awaelchli  路  3Comments

mmsamiei picture mmsamiei  路  3Comments

DavidRuhe picture DavidRuhe  路  3Comments

williamFalcon picture williamFalcon  路  3Comments