Hi,
I need the whole validation set to get the validation result.
Current validation_epoch_end only gets the outputs from current process.
Can I collect the gather the outputs from different gpus, and then run validation_epoch_end. And also I don't necessary need it to run on all processes, I only need it to run once.
How can I achieve that?
I found an ugly solution.
I use detectron2.utils.comm.all_gather and gather to achieve what I want.
But it would be better if anything builtin in lightning can work for this purpose.
I have something like this in my validation loop:
def validation_epoch_end(self, outputs):
loss_val = torch.stack([x['val_loss'] for x in outputs]).mean()
log_dict = {'validation_loss': loss_val, 'step': self.current_epoch}
return {'log': log_dict, 'val_loss': log_dict['validation_loss'], 'progress_bar': log_dict}
but this takes only one process into account. Is there a way to aggregate all the batches when using ddp? Is using dist.all_gatherthe way to do it?
I also see some reference to training_step_end and validation_step_end. Is that something we can use for this but I do not see much examples of this? Could someone be so kind as to post an example of how one might use these to collect the data from all batches?
dist all gather only support torch tensor I believe. detectron2 all_gather can gather anything. It works well for me now.
I see a few other posts that have been closed, so I am assuming this has somehow been sorted and is possible. I also opened another issue about this: https://github.com/PyTorchLightning/pytorch-lightning/issues/2435
Hope someone can help :)
currently I use this to gather other processes' outputs:
def test_epoch_end(self, outputs):
avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
avg_acc = torch.stack([x['test_acc'] for x in outputs]).mean()
dist.all_reduce(avg_acc, op=dist.ReduceOp.SUM)
avg_acc = avg_acc / self.trainer.world_size
Now, new TensorMetric class can do dist_all_reduce automatically, but it only supports dist.ReduceOp, you have to divide by self.trainer.world_size if you want MEAN. If you have dict outputs, you can iterate all values and use dist_all_reduce
this is fixed on master
@williamFalcon
I don't think it is fixing my original quesiton. I asked if there is a way that the validation_epoch_end can gather the outputs from all the gpus, and do evaluation.
The reason is, in my case, the output of the validation set has to be evaluated as a whole. So I can't simply calculate the score in each process and all reduce them.
this is fixed on master
@williamFalcon
When you say fixed on master, what do you mean? Will this do it automatically or do we need to do something about it? Can you point us to the relevant bit in the lightning source code?
@dagap @ruotianluo install the latest version!
pip install pytorch-lightning==0.8.4 --upgrade
@williamFalcon Thanks for that. So, we do not need to do anything from the caller end? it will do this automagically in validation/test modes?
yup!
We handle all the cross GPU syncing automagically
@williamFalcon I am still confused.
What is the expected behavior? Is there any doc related to this?
I guess a concrete example would be helpful. If I have 2gpus with ddp, and the validation set has 500batches, so each gpu gets 250 batches.
What is the size of the input argument outputs for validaiton_epoch_end?
the behavior is transparent to you. meaning, if you calculate accuracy on 2 gpus, then we average the accuracy across gpus, but it's transparent to you.
Gpu 0
def validation_step(...)
return {'acc': 10}
Gpu 1
def validation_step(...)
return {'acc': 20}
What is logged, printed, etc... is 15.0
Thanks. It's the same as I understand.
The reason why I said it doesn't solve my question, because in my case the validation_step does not return an accuracy, not even a number, but a set of captions.
What I want to achieve, is to collect all the captions of the validation images. So the input of validation_epoch_end is all the captions, and validaiton_epoch_end can call an external library to evaluate the results.
I think it's fine to evaluate separately, but it is just safer to evaluate all as a whole. That's why I asked the question at the first place.
For now, I think my workaround is fine.
@williamFalcon Now the outputs of validation_epoch_end have to be cuda tensor. You may want to improve that?
yes! that shouldn鈥檛 happen, that鈥檚 an oversight
@williamFalcon
This does seem to be there on the latest master. Has this been removed again?
it鈥檚 disabled on master at the moment because
so, we are cleaning it up and adding back in
Now, my tensor not all the cross GPU syncing automagically