Pytorch-lightning: Wrong AUCROC when runing training_step in DDP mod

Created on 20 Nov 2020 · 3Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I run a binary classification model and compute the auc as a performance indicator. The first I ran the code on 1 single GPU and it worked well, but the second time I tried to using 4 GPUs with DDP backend, the AUC became very weird, it seemed to just sum all the AUCs of the 4 GPUs. I use pl.metrics.AUROC() to compute auc and my pl version is 0.9.0

Please reproduce using the BoringModel and post here

Here is an example of my code
https://colab.research.google.com/drive/1d-3JTypoQdbPWQFFW_vqkBxDprqIVnFD?usp=sharing

I define a random dataset

class RandomDataset(Dataset):
    def __init__(self):
        self.len = 8
        self.data = np.array([1,5,2,6,3,7,4,8],dtype=np.float32).reshape([-1,1])
        self.label = np.array([1,1,0,0,0,1,0,0], dtype=np.float32)

    def __getitem__(self, index):
        return self.data[index], self.label[index]


    def __len__(self):
        return self.len

and use seed_everything(42)
In the example the first time I use a single GPUs and set batch size to 8, epoch to 1, and got auc 0.5

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
  warnings.warn(*args, **kwargs)

  | Name  | Type              | Params
--------------------------------------------
0 | model | Linear            | 4
1 | auc   | AUROC             | 0
2 | loss  | BCEWithLogitsLoss | 0
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]
Val part rank cuda:0 batch: tensor([1., 5., 2., 6., 3., 7., 4., 8.], device='cuda:0') batch_idx: 0

loss tensor(24.4544, device='cuda:0')
auc tensor(0.5000, device='cuda:0')
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]
Rank cuda:0 batch: tensor([1., 5., 2., 6., 3., 7., 4., 8.], device='cuda:0') batch_idx: 0

loss tensor(24.4544, device='cuda:0',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
auc tensor(0.5000, device='cuda:0')
Epoch 0:  50%|████████████████████████████████████████████████▌                                                | 1/2 [00:00<00:00, 130.40it/s, loss=24.454, v_num=20, train_loss=24.5, auc=0.5]
Val part rank cuda:0 batch: tensor([1., 5., 2., 6., 3., 7., 4., 8.], device='cuda:0') batch_idx: 0

loss tensor(24.4479, device='cuda:0')
auc tensor(0.5000, device='cuda:0')
Epoch 0: 100%|████████████████Saving latest checkpoint..████████████████████████████| 2/2 [00:00<00:00, 85.70it/s, loss=24.454, v_num=20, train_loss=24.5, auc=0.5, vla_loss=24.4, val_auc=0.5]
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 82.69it/s, loss=24.454, v_num=20, train_loss=24.5, auc=0.5, vla_loss=24.4, val_auc=0.5]
end

then I use 2 GPUs with batch size 4, epoch 1, and got auc 1.33

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [1,2]
2 GPU with DDP
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
  warnings.warn(*args, **kwargs)

  | Name  | Type              | Params
--------------------------------------------
0 | model | Linear            | 4
1 | auc   | AUROC             | 0
2 | loss  | BCEWithLogitsLoss | 0
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]
Rank cuda:1 batch: tensor([1., 6., 7., 4.], device='cuda:1') batch_idx: 0

Rank cuda:1 batch: tensor([3., 8., 2., 5.], device='cuda:1') batch_idx: 0

loss tensor(28.9794, device='cuda:1',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
auc tensor(1.3333, device='cuda:1')

loss tensor(19.9294, device='cuda:1',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
auc tensor(1.3333, device='cuda:1')
Epoch 0:  50%|█████████████████████████████████████████████████▌                                                 | 1/2 [00:00<00:00, 81.04it/s, loss=28.979, v_num=21, train_loss=29, auc=1.33]
Val part rank cuda:1 batch: tensor([5., 6., 7., 8.], device='cuda:1') batch_idx: 0

Val part rank cuda:1 batch: tensor([1., 2., 3., 4.], device='cuda:1') batch_idx: 0

loss tensor(12.5368, device='cuda:1')
auc tensor(0.6667, device='cuda:1')

loss tensor(36.3589, device='cuda:1')
auc tensor(0.6667, device='cuda:1')
Saving latest checkpoint..
end
Epoch 0: 100%|████████████████Saving latest checkpoint..███████████████████████████| 2/2 [00:00<00:00, 72.50it/s, loss=28.979, v_num=21, train_loss=29, auc=1.33, vla_loss=12.5, val_auc=0.667]
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 51.66it/s, loss=28.979, v_num=21, train_loss=29, auc=1.33, vla_loss=12.5, val_auc=0.667]
end

Expected behavior

Compute the real auc

Environment

PyTorch Version: 1.6.0
OS: Ubuntu 18.04
How you installed PyTorch: pip
Python version: 3.6
CUDA/cuDNN version: 10.2
GPU models and configuration: 2080Ti

Metrics Working as intended enhancement help wanted

Source

David-AJ

All 3 comments

Prior to v1.0 metrics did not do custom accumulation, but instead relied on either taking the sum/mean over the processes.
From v1.0 all metrics are implemented with custom accumulation such that we get the correct result when running in ddp mode. However, as of now AUROC has not been updated to v1.0, however it is definitely on the roadmap in a foreseeable future.

SkafteNicki on 20 Nov 2020

❤1

as stated, you seem to use some old version of PL and we have reimplemented all metric package but at this moment AUROC is missing, we are happy if want to send a PR 🐰