Pytorch-lightning: In Multi GPU DDP, pytorch-lightning creates several tfevents files

Created on 21 Sep 2019  路  20Comments  路  Source: PyTorchLightning/pytorch-lightning

Describe the bug

Right now pytorch-lightning seems to create several tfevent files in the multi-gpu ddp way:
e.g. for 2 GPUs:

-rw-rw-r--. 1 sam sam   40 Sep 19 08:11 events.out.tfevents.1568880714.google2-compute82.3156.0
-rw-rw-r--. 1 sam sam 165K Sep 19 08:22 events.out.tfevents.1568880716.google2-compute82.3186.0
-rw-rw-r--. 1 sam sam   40 Sep 19 08:11 events.out.tfevents.1568880718.google2-compute82.3199.0

I suppose the first one is created by the main process and the next 2 are created by the 2 DDP processes (one per GPU). Unfortunately, the actual events are not logged in the last created one, and that confuses tensorboard, cf https://github.com/tensorflow/tensorboard/issues/1011

I have to restart tensorboard if I want to see the new data.

A clear and concise description of what the bug is.

To Reproduce
Launch any training on multi GPU DDP.

Expected behavior
Only one tfevent file is created, from the master GPU.

bug / fix

All 20 comments

thanks for bringing this up. this has been reported a few times already. the problem is what you described.

the solution is to init the logger from proc zero only. want to take a look at how we can approach this? @neggert is working on an abstraction that will need this fix

Are you thinking we should make sure that the test tube experiment doesn't even get initialized unless we're on process 0? Right now I have it initialize, but never log, but that's easy enough to change.

yeah, i think the best thing is to make sure it鈥檚 only initialized once. This will save a ton of space in the experiment file as well

@williamFalcon Starting to take a look at this. This turns out to affect the MLFlow logger as well when doing multi-node DDP. I think what I'd like to do is make constructing the experiment / MLFlow run inside the logger lazy, so that it doesn't get created until a method that needs it is called.

Once consequence of this is that users shouldn't call log_hyperparams themselves, since that will happen on multiple nodes in multi-node DDP. To make up for this, we should call it for them when they do training. I'm thinking we check to see if they've defined model.hparams, and if so, we can log for them. In code, it looks like adding this to __run_pretrain_routine:

if hasattr(ref_model, "hparams"):
    self.logger.log_hyperparams(ref_model.hparams)

Thoughts? I guess we should document somewhere that we're expecting a hparams attribute, although I think most of the examples follow that convention already.

(Side note: the docs claim this is already done automatically, but I don't see it in the code anywhere.)

Makes sense, i think the hparams makes the most sense. I wonder if there's a way to automatically do it even if users don't define hparams. Maybe argparse has some sort of global state we can inspect? or look at vars in the current frame? I'd love to remove the need for users to remember to have to use hparams.

I'm converned about people who don't use argparse and/or init their models using actual args.

Case 1:

MyObj(lr=0.1, ..., arg_2=0.3)

Case 2:

MyObj(hparams)

Case 3:

MyObj(hparams, lr=0.1, ...)

Yeah, I'd definitely welcome other ideas that would cover those cases. Maybe ask users to define hparams as a property if they're not doing case 2, but they still want lightning to log their parameters for them?

in 0.5.1.3 multiple tfevents files are still being created with ddp (I'm not logging any hparams if that matters), did that PR fix it? or is there more work to be done?

Are you interacting with the logger manually at all before training starts? Are you doing single-node or multi-node DDP?

single node, and the only time I manually call logger is in optimizer step (self.logger.log_metrics) otherwise I only return log entries in training step and validation end

I removed that call and I'm still getting multiple tfevents, no other calls to logging besides metrics returned by train and val steps. Currently using the experimental --reload_multifile=true in tensorboard to get around the issue.

is this problem solved?

I see same problem, why was this closed?

multiple tfevents files are still created but tensorboard updates made this a non-issue for me with everything displaying and updating correctly

@s-rog
Thank you for your answer.
What version of tensorboard are you using?
I'm using tensorboard-2.2.1, but when I set logdir to a folder that contains multiple tfevents, I get the following error:
E0702 08:34:19.903362 140036689770240 directory_watcher.py:262] File logs/Deepfakes/version_0/events.out.tfevents.1593678836.e5a0de705cfa.24157.0 updated even though the current file is logs/Deepfakes/version_0/events.out.tfevents.1593678841.e5a0de705cfa.24169.0

I'm also on 2.2.1 but I'm using jupyter_tensorboard within the ngc pytorch container so I don't manually setup logdir

@s-rog

Oh, It (https://github.com/PyTorchLightning/pytorch-lightning/issues/241#issuecomment-549722983) meant adding --reload_multifile= true to the tensorboard line!
I solved my problem. Thank you very much.

I would like to ask you another question.
Is there no error if you don't specify a Trainer logger using pl_loggers.TensorBoardLogger()?
Unless I specify the logger of Trainer() separately, the error that tensorboard path already exists will occur.

I don't remember what I did back then to get logging working... but currently I use TestTubeLogger and call it in trainer as logger=TestTubeLogger(".", "lightning_logs")

this logs losses/metrics and hparams in self.hparams correctly (this method logs hparams under TEXT in tensorboard)

are you guys on 0.8.4?

@awaelchli

multiple tfevent files does not mean they come from differnt gpus, it's just a tensorboard thing.
0.8.4 logs only on rank 0.
previously we had the problem that other ranks would log as well, this would lead to multiple directories (version0, version1, ...) but this is fixed now.

if you manually log things, then do this:

if self.trainer.is_global_zero:
    # your custom non-Lightning logging
Was this page helpful?
0 / 5 - 0 ratings

Related issues

DavidRuhe picture DavidRuhe  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

srush picture srush  路  3Comments

justusschock picture justusschock  路  3Comments