Pytorch-lightning: How are metrics aggregated in DDP?

Created on 8 Jun 2020 · 10Comments · Source: PyTorchLightning/pytorch-lightning

I see that the LightningTemplateModel in 0.7.5 (no longer the case in master) manually averages the metrics in validation_epoch_end for DP and DDP2
https://github.com/PyTorchLightning/pytorch-lightning/blob/694f1d789dfa56b365b68dd4f3c6f5f7a4c8970a/pl_examples/models/lightning_template.py#L167-L168

But what about DDP? I get that each device can have its own loss for backward, but we want only one single metric across devices. How is that achieved? (Is averaging the best way to aggregate most metrics anyway?)

Metrics question waiting on author won't fix

Source

ZhaofengWu

Most helpful comment

Sorry but it's not a problem/bug in the code. It's just a question: what's the proper way to aggregate metrics under DDP if we don't want the overhead of subclassing the TensorMetric mentioned above. If "letting dev/test dataloaders read the entire datasets" is the answer, what's the best way to do that?

ZhaofengWu on 11 Jun 2020

👍5

All 10 comments

My guess was that only the train dataloader uses DistributedSampler but not val/test. In other words each process evals the entire val/test sets and only rank 0 reports (e.g. logs) the metrics. Apparently this used to be the case but #1192 changed val/test sets to use DistributedSampler too. So I think some aggregation must be done?

ZhaofengWu on 10 Jun 2020

@alexeykarnachev mind have a look, pls ^^

Borda on 11 Jun 2020

I found this
https://github.com/PyTorchLightning/pytorch-lightning/blob/bd49b07fbba09b1e7d8851ee5a1ffce3d5925e9e/pytorch_lightning/metrics/metric.py#L46-L54
But if I don't want the overhead of creating a class for simple one-liner metrics, and/or have metrics that can't be easily reduced, is there a way to let dev/test dataloaders to load the entire datasets like pre-#1192? The only way I can think of is to set replace_sampler_ddp=False and manually add the DistributedSampler to the training dataloader with something like

def load_dataset(self, mode, batch_size):
  ...
  if mode == 'train':
    self.trainer.replace_sampler_ddp = True
    dataloader = self.trainer.auto_add_sampler(dataloader, True)
    self.trainer.replace_sampler_ddp = False
  return dataloader

This feels kind of hacky though. If there's an option like replace_evaluation_sampler_ddp it would be much more straightforward.

ZhaofengWu on 11 Jun 2020

👍3

@ZhaofengWu could you, please provide a min. runnable script, which represents the problem?

alexeykarnachev on 11 Jun 2020

ZhaofengWu on 11 Jun 2020

👍5

I have the same problem

aaronma2020 on 20 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 19 Aug 2020

My current workaround is to use pl.metrics.converters._sync_ddp_if_available.

You can also use the pl.metrics.converters.sync_ddp decorator, but this means your metric will sync at each forward pass.

Actually, the lightning_template (https://github.com/PyTorchLightning/pytorch-lightning/blob/7cca3859a7b97a9ab4a6c6fb5f36ff94bff7f218/pl_examples/models/lightning_template.py) doesn't subclass Metric, which - if I understand correctly - means that it only logs metrics on rank = 0, same for the loss.

YassineYousfi on 22 Aug 2020

@aaronma2020 mind provide minimal running example?

Borda on 15 Sep 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

stale[bot] on 21 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings