Pytorch-lightning: multi-gpu ddp calls validation and testing loops too many times

Created on 16 Mar 2020 · 6Comments · Source: PyTorchLightning/pytorch-lightning

When using ddp with multiple gpus, each validation and test loop is called with the entire validation dataset for each gpu.

Expected behavior is that the dataset is divided appropriately across the gpus.

I am using current master (cloned Mar 14), Ubuntu 19.10, Cuda 10.1, python 3.7.5, pytorch 1.4, venv environment.

The problem appears to be in auto_add_sampler() in data_loading.py. It does not create a DistributedSampler for validation or test datasets.

bug / fix help wanted

Source

sneiman

All 6 comments

Latest pull - 1 hour ago, no longer this behavior. Closing.

sneiman on 16 Mar 2020

Sorry - this issue still exists in some configurations. My proposed fix is not the total picture. Still investigating - will provide reproducible example.

sneiman on 17 Mar 2020

Testing underway. Will make PR tomorrow.

sneiman on 17 Mar 2020

👍1

Dont want to clutter up PR world if no one is interested in this. Let me know ...

sneiman on 18 Mar 2020

that sounds a good contribution to me... mind send a PR?
Any suggestion @PyTorchLightning/core-contributors?
in a technical note when you refer some master state pls use coit hash as there can be multiple commits each day...