Lightning automatically adds DistributedSampler when you turn on ddp, ddp2 or TPU: https://github.com/PyTorchLightning/pytorch-lightning/blob/17f58d2e1191d61bc5b2b0cfbf1a42dff714ab8e/pytorch_lightning/trainer/data_loading.py#L86
This seems to be a recent change.
This is surprising behavior and not always something that's warranted. For example, it is common (at least in several of our large scale vision trainers) for each worker to read a specific partition of a large warehouse table. In this case, the automatic addition of the DistributedSampler will only provide access to a portion of the loaded data, which is unintended.
Worse, there's no mechanism at all to override this.
Hi! thanks for your contribution!, great first issue!
on master we have a flag to disable this
Ah, thanks @williamFalcon. What about my comment around "if the dataset is iterable-style, never auto-add a distributed sampler" -- how would that even work when you don't have indexes for the dataset?
@tullie suggested defaulting to false. Maybe we should do that? i guess turned off is the intuitive behavior?
“ If the dataset is iterable-style, never auto-add a Sampler”
If we had the flag to default off, then this would be taken care of no? that way we can only add when user wants it?
@srush thoughts?
Yes default=off works — and I would prefer that.
Sent from my iPhone
On Apr 22, 2020, at 7:01 PM, William Falcon notifications@github.com wrote:
“ If the dataset is iterable-style, never auto-add a Sampler”
If we had the flag to default off, then this would be taken care of no? that way we can only add when user wants it?
@srushhttps://github.com/srush thoughts?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/PyTorchLightning/pytorch-lightning/issues/1567#issuecomment-618132471, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAAEXPXRVJRCECH2RMTO2PTRN6OQHANCNFSM4MOUPIXA.
Yeah I still think the add sampler argument is confusing for anyone that has used pytorch before. There's an expectation that when you explicitly pass a sampler into the dataloader it will use that sampler.
One way to make it automatic while keeping it intuitive to the user would be to add a LightningDataLoader which automatically does things like this. That way the user is buying into a new type of data loader with different expectations when specifying the type.
The "never add sampler when iterable-style dataset" rule makes sense for this specific issue but i'm concerned it doesn't solve the underlying surprising behavior.
+1 on default. I like the idea of the LightningDataLoader potentially, but it's a new abstraction so I'd recommend be very stingy with those :)
yeah, i don't love adding new abstractions.I really want to keep it pure pytorch or we'll just end up with another hard to read library haha.
so we all agree that default=False is best for this flag?
@PyTorchLightning/core-contributors thoughts?
@srush?
I do like the ability to set gpus=2 and everything just work though. I'm torn but leaning on default=False
I wouldn't introduce another data loader here. To make this work, the user would always have to use this, whereas now he can use a lightning-independent torch loader or even a custom one.
ok, in 0.7.4 we made these changes:
Thank you all for bringing this up!
Thank you!
Most helpful comment
yeah, i don't love adding new abstractions.I really want to keep it pure pytorch or we'll just end up with another hard to read library haha.
so we all agree that default=False is best for this flag?
@PyTorchLightning/core-contributors thoughts?
@srush?
I do like the ability to set gpus=2 and everything just work though. I'm torn but leaning on default=False