Pytorch-lightning: Samplers are auto-added in DDP with no mechanism to override

Created on 23 Apr 2020 · 13Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

Lightning automatically adds DistributedSampler when you turn on ddp, ddp2 or TPU: https://github.com/PyTorchLightning/pytorch-lightning/blob/17f58d2e1191d61bc5b2b0cfbf1a42dff714ab8e/pytorch_lightning/trainer/data_loading.py#L86

This seems to be a recent change.

This is surprising behavior and not always something that's warranted. For example, it is common (at least in several of our large scale vision trainers) for each worker to read a specific partition of a large warehouse table. In this case, the automatic addition of the DistributedSampler will only provide access to a portion of the loaded data, which is unintended.

Worse, there's no mechanism at all to override this.

Possible fixes

At the very least, provide some way to override this functionality
If the dataset is iterable-style, never auto-add a Sampler

bug / fix help wanted

Source

ashwinb

Most helpful comment

yeah, i don't love adding new abstractions.I really want to keep it pure pytorch or we'll just end up with another hard to read library haha.

so we all agree that default=False is best for this flag?

@PyTorchLightning/core-contributors thoughts?
@srush?

I do like the ability to set gpus=2 and everything just work though. I'm torn but leaning on default=False

williamFalcon on 24 Apr 2020

👍3

All 13 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 23 Apr 2020

on master we have a flag to disable this

williamFalcon on 23 Apr 2020

Ah, thanks @williamFalcon. What about my comment around "if the dataset is iterable-style, never auto-add a distributed sampler" -- how would that even work when you don't have indexes for the dataset?

ashwinb on 23 Apr 2020

@tullie suggested defaulting to false. Maybe we should do that? i guess turned off is the intuitive behavior?

williamFalcon on 23 Apr 2020

“ If the dataset is iterable-style, never auto-add a Sampler”

If we had the flag to default off, then this would be taken care of no? that way we can only add when user wants it?

@srush thoughts?

williamFalcon on 23 Apr 2020

Yes default=off works — and I would prefer that.

Sent from my iPhone

On Apr 22, 2020, at 7:01 PM, William Falcon notifications@github.com wrote:

“ If the dataset is iterable-style, never auto-add a Sampler”

If we had the flag to default off, then this would be taken care of no? that way we can only add when user wants it?

@srushhttps://github.com/srush thoughts?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/PyTorchLightning/pytorch-lightning/issues/1567#issuecomment-618132471, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAAEXPXRVJRCECH2RMTO2PTRN6OQHANCNFSM4MOUPIXA.

ashwinb on 23 Apr 2020

Yeah I still think the add sampler argument is confusing for anyone that has used pytorch before. There's an expectation that when you explicitly pass a sampler into the dataloader it will use that sampler.

One way to make it automatic while keeping it intuitive to the user would be to add a LightningDataLoader which automatically does things like this. That way the user is buying into a new type of data loader with different expectations when specifying the type.

The "never add sampler when iterable-style dataset" rule makes sense for this specific issue but i'm concerned it doesn't solve the underlying surprising behavior.

tullie on 23 Apr 2020

👍3

+1 on default. I like the idea of the LightningDataLoader potentially, but it's a new abstraction so I'd recommend be very stingy with those :)

Darktex on 24 Apr 2020

yeah, i don't love adding new abstractions.I really want to keep it pure pytorch or we'll just end up with another hard to read library haha.

so we all agree that default=False is best for this flag?

@PyTorchLightning/core-contributors thoughts?
@srush?

I do like the ability to set gpus=2 and everything just work though. I'm torn but leaning on default=False

williamFalcon on 24 Apr 2020

👍3

I wouldn't introduce another data loader here. To make this work, the user would always have to use this, whereas now he can use a lightning-independent torch loader or even a custom one.

justusschock on 24 Apr 2020

👍1

https://twitter.com/PyTorchLightnin/status/1253656336329519108?s=20

williamFalcon on 24 Apr 2020

ok, in 0.7.4 we made these changes:

If dataset is iterable we don't mess with samplers.
Added a flag to enable adding auto_samplers.
The flag is set to True by default. So far this seems to be the consensus, but by 0.7.5 we will know if this was the right choice or not based on the number of GH issues and complaints we hear.
If it turns out to cause more problems than its worth, we will default the flag to False.

Thank you all for bringing this up!

williamFalcon on 26 Apr 2020

Thank you!

ashwinb on 26 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings