Pytorch-lightning: How to combine all the values passed from validation_step in validation_epoch_end for multi gpus case

Created on 26 Sep 2020 · 12Comments · Source: PyTorchLightning/pytorch-lightning

❓ Questions and Help

`
def validation_step(self, batch, batch_idx):
logits, softmax_logits = self.forward(**batch)
loss = self.loss_function(logits, batch["labels"])

    accuracy = self.compute_accuracy(logits, batch["labels"])

    return {
        "val_loss": loss,
        "accuracy": accuracy,
    }

def validation_epoch_end(self, outputs_of_validation_steps):
    avg_loss = torch.cat([x['val_loss'].view(1) for x in outputs_of_validation_steps]).mean()
    val_accuracy = torch.cat([x['accuracy'].view(1) for x in outputs_of_validation_steps]).mean()
    print("val accuracy: ", val_accuracy)
    log = {'val_loss': avg_loss, "val_accuracy": val_accuracy}
    return {'val_loss': avg_loss, "val_accuracy": val_accuracy, 'log': log}

What is your question?

What is the optimal way of combining the outputs from validation_step?
This doesn't work for multi gpu case as I'm using .view(1)
but if I remove .view(1), it fails for single gpu case.
Is there a way that works for both the cases?

What have you tried?

What's your environment?

OS: [e.g. iOS, Linux, Win]
Packaging [e.g. pip, conda]
Version [e.g. 0.5.2.1]

question

Source

nrjvarshney

All 12 comments

@nrjvarshney good question. We have a Results api for that.
check this out:

def validation_step(batch, batch_idx):
    logits, softmax_logits = self.forward(**batch)
    loss = self.loss_function(logits, batch["labels"])
    accuracy = self.compute_accuracy(logits, batch["labels"])     # or use our metrics (metrics.classification.accuracy)!
    result = EvalResult()
    result.log("val_accuracy", accuracy, sync_dist=True)  # reduce accross gpus
    result.log("val_loss", loss, reduce_fx=torch.sum)  # or choose a custom reduction (default torch.mean)
    return result

def validation_epoch_end(self, outputs):
    # actually not needed anymore!!!
    # above will log no validation end the reduced / aggregated metrics

https://pytorch-lightning.readthedocs.io/en/latest/results.html
https://pytorch-lightning.readthedocs.io/en/latest/metrics.html#accuracy-f

awaelchli on 26 Sep 2020

Thanks, @awaelchli for the quick reply.
I want to perform some computation at the end of validation, So, I need the outputs (from all gpus) in validation_epoch_end() method.
Is there a way ?

nrjvarshney on 26 Sep 2020

What kind of outputs are these?
Is it just a metric?
Can't you do what I shared above in the example and define a custom reduce_fx?

awaelchli on 26 Sep 2020

@awaelchli , I intend to use the softmax_values of predictions for the entire validation set.
I couldn't find a complete example using the new result api. That's why I'm a bit hesitant in using it.
It would be great if you could provide an example of a custom reduce function that takes in parameters.

Can you refer me to a complete example using the results api that includes --> using a callback to monitor a metric, specify the mode, aggregating the outputs from step, etc.
Also, does this api make the validation_epoch_end useless?

nrjvarshney on 26 Sep 2020

no, validation_epoch_end is not useless, it's just only needed for custom stuff that is not handled by result.
Ok, so return your logits in the dict and flatten them already with flatten(1) so that the batch dim is preserved.
Then in your epoch end use torch.cat on dim=0, now you have one big batch and you can apply your computation on that.

Anyway, I find it strange that you have
x['accuracy'].view(1)
and not view(-1)
is that a bug?

awaelchli on 26 Sep 2020

`softmax_logits = softmax_logits.squeeze(1)
and returning from validation_step()

in validation_epoch_end()
softmax_logits = torch.cat([x['softmax_logits'] for x in outputs_of_val_steps])`
This works for single gpu.
Will this work for multi gpu as well?

And, yes .view(-1) is a silly mistake.
It should solve my multi gpu-single gpu issue

nrjvarshney on 26 Sep 2020

Yes, this looks fine.
Just to clarify, by "not working" / "it fails" you mean it gives an error, or the result is not correct?

awaelchli on 26 Sep 2020

With .view(1) and multiple gpus was failing because it had 4 tensors (4 gpus).
and not using .view(1) for a single gpu was failing because it gives a 0-d tensor.

nrjvarshney on 26 Sep 2020

Thanks a lot.
Also, do I need to specify the metric to monitor in the EvalResult()?
Then what about the mode?

There isn't a concrete example of using the results API.

nrjvarshney on 26 Sep 2020

The metric to monitor for what?
If you would like to monitor for example val_loss for checkpointing, do
EvalResult(checkpoint_on=val_loss)
the mode is "min" by default but you can change that in the ModelCheckpoint callback.

For early stopping it is the same.

awaelchli on 26 Sep 2020

👍1

did you make progress on this? is it working now?

awaelchli on 28 Sep 2020

Yes, it worked smoothly.
Thanks a lot
I'll go ahead and close this issue.

nrjvarshney on 28 Sep 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings