Pytorch-lightning: In Multi GPU DDP, pytorch-lightning creates several tfevents files

Created on 21 Sep 2019 · 20Comments · Source: PyTorchLightning/pytorch-lightning

Describe the bug

Right now pytorch-lightning seems to create several tfevent files in the multi-gpu ddp way:
e.g. for 2 GPUs:

-rw-rw-r--. 1 sam sam   40 Sep 19 08:11 events.out.tfevents.1568880714.google2-compute82.3156.0
-rw-rw-r--. 1 sam sam 165K Sep 19 08:22 events.out.tfevents.1568880716.google2-compute82.3186.0
-rw-rw-r--. 1 sam sam   40 Sep 19 08:11 events.out.tfevents.1568880718.google2-compute82.3199.0

I suppose the first one is created by the main process and the next 2 are created by the 2 DDP processes (one per GPU). Unfortunately, the actual events are not logged in the last created one, and that confuses tensorboard, cf https://github.com/tensorflow/tensorboard/issues/1011

I have to restart tensorboard if I want to see the new data.

A clear and concise description of what the bug is.

To Reproduce
Launch any training on multi GPU DDP.

Expected behavior
Only one tfevent file is created, from the master GPU.

bug / fix

Source

samhumeau

All 20 comments

thanks for bringing this up. this has been reported a few times already. the problem is what you described.

the solution is to init the logger from proc zero only. want to take a look at how we can approach this? @neggert is working on an abstraction that will need this fix

williamFalcon on 21 Sep 2019

Are you thinking we should make sure that the test tube experiment doesn't even get initialized unless we're on process 0? Right now I have it initialize, but never log, but that's easy enough to change.

neggert on 23 Sep 2019

yeah, i think the best thing is to make sure it’s only initialized once. This will save a ton of space in the experiment file as well

williamFalcon on 23 Sep 2019

@williamFalcon Starting to take a look at this. This turns out to affect the MLFlow logger as well when doing multi-node DDP. I think what I'd like to do is make constructing the experiment / MLFlow run inside the logger lazy, so that it doesn't get created until a method that needs it is called.

Once consequence of this is that users shouldn't call log_hyperparams themselves, since that will happen on multiple nodes in multi-node DDP. To make up for this, we should call it for them when they do training. I'm thinking we check to see if they've defined model.hparams, and if so, we can log for them. In code, it looks like adding this to __run_pretrain_routine:

if hasattr(ref_model, "hparams"):
    self.logger.log_hyperparams(ref_model.hparams)

Thoughts? I guess we should document somewhere that we're expecting a hparams attribute, although I think most of the examples follow that convention already.

(Side note: the docs claim this is already done automatically, but I don't see it in the code anywhere.)

neggert on 27 Sep 2019

Makes sense, i think the hparams makes the most sense. I wonder if there's a way to automatically do it even if users don't define hparams. Maybe argparse has some sort of global state we can inspect? or look at vars in the current frame? I'd love to remove the need for users to remember to have to use hparams.

I'm converned about people who don't use argparse and/or init their models using actual args.

Case 1:

MyObj(lr=0.1, ..., arg_2=0.3)

Case 2:

MyObj(hparams)

Case 3:

MyObj(hparams, lr=0.1, ...)

williamFalcon on 27 Sep 2019

Yeah, I'd definitely welcome other ideas that would cover those cases. Maybe ask users to define hparams as a property if they're not doing case 2, but they still want lightning to log their parameters for them?

neggert on 28 Sep 2019

in 0.5.1.3 multiple tfevents files are still being created with ddp (I'm not logging any hparams if that matters), did that PR fix it? or is there more work to be done?

s-rog on 21 Oct 2019

Are you interacting with the logger manually at all before training starts? Are you doing single-node or multi-node DDP?

neggert on 21 Oct 2019

single node, and the only time I manually call logger is in optimizer step (self.logger.log_metrics) otherwise I only return log entries in training step and validation end

s-rog on 22 Oct 2019

I removed that call and I'm still getting multiple tfevents, no other calls to logging besides metrics returned by train and val steps. Currently using the experimental --reload_multifile=true in tensorboard to get around the issue.

s-rog on 5 Nov 2019

👀1 👍1

is this problem solved?

marsggbo on 11 May 2020

I see same problem, why was this closed?

lepoeme20 on 2 Jul 2020

multiple tfevents files are still created but tensorboard updates made this a non-issue for me with everything displaying and updating correctly

s-rog on 2 Jul 2020

@s-rog
Thank you for your answer.
What version of tensorboard are you using?
I'm using tensorboard-2.2.1, but when I set logdir to a folder that contains multiple tfevents, I get the following error:
E0702 08:34:19.903362 140036689770240 directory_watcher.py:262] File logs/Deepfakes/version_0/events.out.tfevents.1593678836.e5a0de705cfa.24157.0 updated even though the current file is logs/Deepfakes/version_0/events.out.tfevents.1593678841.e5a0de705cfa.24169.0

lepoeme20 on 2 Jul 2020

I'm also on 2.2.1 but I'm using jupyter_tensorboard within the ngc pytorch container so I don't manually setup logdir

s-rog on 2 Jul 2020

@s-rog

Oh, It (https://github.com/PyTorchLightning/pytorch-lightning/issues/241#issuecomment-549722983) meant adding --reload_multifile= true to the tensorboard line!
I solved my problem. Thank you very much.

I would like to ask you another question.
Is there no error if you don't specify a Trainer logger using pl_loggers.TensorBoardLogger()?
Unless I specify the logger of Trainer() separately, the error that tensorboard path already exists will occur.

lepoeme20 on 2 Jul 2020

I don't remember what I did back then to get logging working... but currently I use TestTubeLogger and call it in trainer as logger=TestTubeLogger(".", "lightning_logs")

this logs losses/metrics and hparams in self.hparams correctly (this method logs hparams under TEXT in tensorboard)

s-rog on 2 Jul 2020

👍1

are you guys on 0.8.4?

@awaelchli

williamFalcon on 2 Jul 2020

multiple tfevent files does not mean they come from differnt gpus, it's just a tensorboard thing.
0.8.4 logs only on rank 0.
previously we had the problem that other ranks would log as well, this would lead to multiple directories (version0, version1, ...) but this is fixed now.

awaelchli on 5 Jul 2020

👍1

if you manually log things, then do this:

if self.trainer.is_global_zero:
    # your custom non-Lightning logging

awaelchli on 5 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add "epoch" options to basic templates

baeseongsu · 3Comments

How to save checkpoints within lightning_logs?

polars05 · 3Comments

How to use pytorch-lightning to run GAN？

as754770178 · 3Comments

Dataloader starving the gpu

maxime-louis · 3Comments

Wandb Flatten Dict

anthonytec2 · 3Comments