Pytorch-lightning: validation_epoch_end only gets the outputs from one process

Created on 30 Jun 2020 · 21Comments · Source: PyTorchLightning/pytorch-lightning

Hi,

I need the whole validation set to get the validation result.
Current validation_epoch_end only gets the outputs from current process.

Can I collect the gather the outputs from different gpus, and then run validation_epoch_end. And also I don't necessary need it to run on all processes, I only need it to run once.

How can I achieve that?

question

Source

ruotianluo

All 21 comments

I found an ugly solution.

I use detectron2.utils.comm.all_gather and gather to achieve what I want.

But it would be better if anything builtin in lightning can work for this purpose.

ruotianluo on 30 Jun 2020

pamparana34 on 30 Jun 2020

I have something like this in my validation loop:

def validation_epoch_end(self, outputs):
        loss_val = torch.stack([x['val_loss'] for x in outputs]).mean()
        log_dict = {'validation_loss': loss_val, 'step': self.current_epoch}
        return {'log': log_dict, 'val_loss': log_dict['validation_loss'], 'progress_bar': log_dict}

but this takes only one process into account. Is there a way to aggregate all the batches when using ddp? Is using dist.all_gatherthe way to do it?

I also see some reference to training_step_end and validation_step_end. Is that something we can use for this but I do not see much examples of this? Could someone be so kind as to post an example of how one might use these to collect the data from all batches?

pamparana34 on 30 Jun 2020

dist all gather only support torch tensor I believe. detectron2 all_gather can gather anything. It works well for me now.

ruotianluo on 30 Jun 2020

I see a few other posts that have been closed, so I am assuming this has somehow been sorted and is possible. I also opened another issue about this: https://github.com/PyTorchLightning/pytorch-lightning/issues/2435

Hope someone can help :)

pamparana34 on 30 Jun 2020

currently I use this to gather other processes' outputs:

    def test_epoch_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        avg_acc = torch.stack([x['test_acc'] for x in outputs]).mean()

        dist.all_reduce(avg_acc, op=dist.ReduceOp.SUM)
        avg_acc = avg_acc / self.trainer.world_size

Now, new TensorMetric class can do dist_all_reduce automatically, but it only supports dist.ReduceOp, you have to divide by self.trainer.world_size if you want MEAN. If you have dict outputs, you can iterate all values and use dist_all_reduce

xiadingZ on 1 Jul 2020

this is fixed on master

williamFalcon on 1 Jul 2020

@williamFalcon
I don't think it is fixing my original quesiton. I asked if there is a way that the validation_epoch_end can gather the outputs from all the gpus, and do evaluation.

The reason is, in my case, the output of the validation set has to be evaluated as a whole. So I can't simply calculate the score in each process and all reduce them.

ruotianluo on 1 Jul 2020

this is fixed on master
@williamFalcon
When you say fixed on master, what do you mean? Will this do it automatically or do we need to do something about it? Can you point us to the relevant bit in the lightning source code?

dagap on 1 Jul 2020

My guess is this
https://github.com/PyTorchLightning/pytorch-lightning/commit/309ed75c5d6740538fca6d9a571d85606ac6d48b

ruotianluo on 1 Jul 2020

@dagap @ruotianluo install the latest version!
pip install pytorch-lightning==0.8.4 --upgrade

williamFalcon on 1 Jul 2020

@williamFalcon Thanks for that. So, we do not need to do anything from the caller end? it will do this automagically in validation/test modes?

pamparana34 on 1 Jul 2020

yup!

We handle all the cross GPU syncing automagically

williamFalcon on 1 Jul 2020

👍1

@williamFalcon I am still confused.

What is the expected behavior? Is there any doc related to this?

I guess a concrete example would be helpful. If I have 2gpus with ddp, and the validation set has 500batches, so each gpu gets 250 batches.

What is the size of the input argument outputs for validaiton_epoch_end?

ruotianluo on 1 Jul 2020

the behavior is transparent to you. meaning, if you calculate accuracy on 2 gpus, then we average the accuracy across gpus, but it's transparent to you.

Gpu 0

def validation_step(...)
   return {'acc': 10}

Gpu 1

def validation_step(...)
   return {'acc': 20}

What is logged, printed, etc... is 15.0

williamFalcon on 1 Jul 2020

Thanks. It's the same as I understand.

The reason why I said it doesn't solve my question, because in my case the validation_step does not return an accuracy, not even a number, but a set of captions.

What I want to achieve, is to collect all the captions of the validation images. So the input of validation_epoch_end is all the captions, and validaiton_epoch_end can call an external library to evaluate the results.

I think it's fine to evaluate separately, but it is just safer to evaluate all as a whole. That's why I asked the question at the first place.

For now, I think my workaround is fine.

ruotianluo on 1 Jul 2020

@williamFalcon Now the outputs of validation_epoch_end have to be cuda tensor. You may want to improve that?

ruotianluo on 2 Jul 2020

yes! that shouldn’t happen, that’s an oversight

williamFalcon on 2 Jul 2020

@williamFalcon
This does seem to be there on the latest master. Has this been removed again?

pamparana34 on 3 Jul 2020

it’s disabled on master at the moment because

it will blow up your ram
it may only work with certain tensors

so, we are cleaning it up and adding back in

williamFalcon on 3 Jul 2020

❤1

Now, my tensor not all the cross GPU syncing automagically

JusperLee on 13 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add "epoch" options to basic templates

baeseongsu · 3Comments

NumpyMetric not mapping back to GPU in multi-GPU training

jcreinhold · 3Comments

[DataModule] `prepare_data()` and `setup()` not called

remisphere · 3Comments

Managing Checkpoints

srush · 3Comments

Fix .test() on ddp

williamFalcon · 3Comments