Pytorch-lightning: TrainResult not writing values

Created on 26 Aug 2020  路  13Comments  路  Source: PyTorchLightning/pytorch-lightning

I'm converting a project to pytorch-lightning and I'm having problems logging the loss in training_step(...). Neither the old reporting with a dict nor the new TrainResult class write to the event file. I can only report the values using self.logger.experiment.add_scalar('loss/train',batch_nb,self.global_step). I tested the exact same conda environment with one of the examples in the docs and it works fine. I suspected it had something to do with different Tensors having different GPU devices, but even running on CPU yields the same outcome.

This is the code, x,y,z,o,h are all on cpu - loss: tensor(0.2871, grad_fn=<L1LossBackward>).

    def training_step(self, batch, batch_nb):
        x, y, z, corrupt_locs = batch
        o,h = self.forward(x,y, z)

        loss = self.L1_loss(o, x)
        result = pl.TrainResult(loss, early_stop_on=loss, checkpoint_on=loss)

        self.logger.experiment.add_scalar('loss/train',loss.item(),self.global_step)
        result.log('myloss', loss)
        return result

What could be the issue?

question

All 13 comments

Hi! thanks for your contribution!, great first issue!

This is super weird if it works with the docs example and not in your code. Can you mention which example you used from the docs?

It's the LitClassifier on your main page: https://github.com/PyTorchLightning/pytorch-lightning. Could it be something with the devices or the computational graph? Why TrainResult doesn't also accept simple scalars and instead requires tensors with gradients?

@fperazzi Try add row_log_interval=5 or other some small value at pl.Trainer

I also have the same problem. Setting row_log_interval=1 and log_save_interval=1 does not fix this (for my project).
Logging worked fine in version 0.7.

The workaround shown above works however.

@ykoneee, thanks! Setting row_log_interval=1 fixed it for me.

@elPistolero by workaround shown above, do you mean row_log_interval=5? Would like to track this just to make sure if this is a bug or not.

Can I log extra scalar values that do not have gradients attached using TrainResult?

@fperazzi Yes! It should be a bug if you can't.

@elPistolero by workaround shown above, do you mean row_log_interval=5? Would like to track this just to make sure if this is a bug or not.

Apologies for my previous statement. Indeed, setting row_log_interval to a smaller value works as expected. The workaround is not necessary.

Any specific way to flush log data?

@fperazzi closing this if there are no further issues!

I had the same problem. Logging with EvalResult works, but TrainResult not, unless I specify row_log_interval=5. It is weird to me that logging does not work with default Trainer settings, can someone explain why this is happening and why row_log_interval=5 fixes it?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

edenlightning picture edenlightning  路  3Comments

polars05 picture polars05  路  3Comments

baeseongsu picture baeseongsu  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

chuong98 picture chuong98  路  3Comments