Pytorch-lightning: How to combine all the values passed from validation_step in validation_epoch_end for multi gpus case

Created on 26 Sep 2020  ยท  12Comments  ยท  Source: PyTorchLightning/pytorch-lightning

โ“ Questions and Help

`
def validation_step(self, batch, batch_idx):
logits, softmax_logits = self.forward(**batch)
loss = self.loss_function(logits, batch["labels"])

    accuracy = self.compute_accuracy(logits, batch["labels"])

    return {
        "val_loss": loss,
        "accuracy": accuracy,
    }

def validation_epoch_end(self, outputs_of_validation_steps):
    avg_loss = torch.cat([x['val_loss'].view(1) for x in outputs_of_validation_steps]).mean()
    val_accuracy = torch.cat([x['accuracy'].view(1) for x in outputs_of_validation_steps]).mean()
    print("val accuracy: ", val_accuracy)
    log = {'val_loss': avg_loss, "val_accuracy": val_accuracy}
    return {'val_loss': avg_loss, "val_accuracy": val_accuracy, 'log': log}

`

What is your question?

What is the optimal way of combining the outputs from validation_step?
This doesn't work for multi gpu case as I'm using .view(1)
but if I remove .view(1), it fails for single gpu case.
Is there a way that works for both the cases?

What have you tried?

What's your environment?

  • OS: [e.g. iOS, Linux, Win]
  • Packaging [e.g. pip, conda]
  • Version [e.g. 0.5.2.1]
question

All 12 comments

@nrjvarshney good question. We have a Results api for that.
check this out:

def validation_step(batch, batch_idx):
    logits, softmax_logits = self.forward(**batch)
    loss = self.loss_function(logits, batch["labels"])
    accuracy = self.compute_accuracy(logits, batch["labels"])     # or use our metrics (metrics.classification.accuracy)!
    result = EvalResult()
    result.log("val_accuracy", accuracy, sync_dist=True)  # reduce accross gpus
    result.log("val_loss", loss, reduce_fx=torch.sum)  # or choose a custom reduction (default torch.mean)
    return result

def validation_epoch_end(self, outputs):
    # actually not needed anymore!!!
    # above will log no validation end the reduced / aggregated metrics

https://pytorch-lightning.readthedocs.io/en/latest/results.html
https://pytorch-lightning.readthedocs.io/en/latest/metrics.html#accuracy-f

Thanks, @awaelchli for the quick reply.
I want to perform some computation at the end of validation, So, I need the outputs (from all gpus) in validation_epoch_end() method.
Is there a way ?

What kind of outputs are these?
Is it just a metric?
Can't you do what I shared above in the example and define a custom reduce_fx?

@awaelchli , I intend to use the softmax_values of predictions for the entire validation set.
I couldn't find a complete example using the new result api. That's why I'm a bit hesitant in using it.
It would be great if you could provide an example of a custom reduce function that takes in parameters.

Can you refer me to a complete example using the results api that includes --> using a callback to monitor a metric, specify the mode, aggregating the outputs from step, etc.
Also, does this api make the validation_epoch_end useless?

no, validation_epoch_end is not useless, it's just only needed for custom stuff that is not handled by result.
Ok, so return your logits in the dict and flatten them already with flatten(1) so that the batch dim is preserved.
Then in your epoch end use torch.cat on dim=0, now you have one big batch and you can apply your computation on that.

Anyway, I find it strange that you have
x['accuracy'].view(1)
and not view(-1)
is that a bug?

`softmax_logits = softmax_logits.squeeze(1)
and returning from validation_step()

in validation_epoch_end()
softmax_logits = torch.cat([x['softmax_logits'] for x in outputs_of_val_steps])`
This works for single gpu.
Will this work for multi gpu as well?

And, yes .view(-1) is a silly mistake.
It should solve my multi gpu-single gpu issue

Yes, this looks fine.
Just to clarify, by "not working" / "it fails" you mean it gives an error, or the result is not correct?

With .view(1) and multiple gpus was failing because it had 4 tensors (4 gpus).
and not using .view(1) for a single gpu was failing because it gives a 0-d tensor.

Thanks a lot.
Also, do I need to specify the metric to monitor in the EvalResult()?
Then what about the mode?

There isn't a concrete example of using the results API.

The metric to monitor for what?
If you would like to monitor for example val_loss for checkpointing, do
EvalResult(checkpoint_on=val_loss)
the mode is "min" by default but you can change that in the ModelCheckpoint callback.

For early stopping it is the same.

did you make progress on this? is it working now?

Yes, it worked smoothly.
Thanks a lot
I'll go ahead and close this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

srush picture srush  ยท  3Comments

williamFalcon picture williamFalcon  ยท  3Comments

anthonytec2 picture anthonytec2  ยท  3Comments

jcreinhold picture jcreinhold  ยท  3Comments

iakremnev picture iakremnev  ยท  3Comments