Pytorch-lightning: WandbLogger cannot be used with 'ddp'

Created on 29 Feb 2020  ·  17Comments  ·  Source: PyTorchLightning/pytorch-lightning

🐛 Bug

wandb modifies init such that a child process calling init returns None if the master process has called init. This seems to cause a bug with ddp, and results in rank zero having experiment = None, which crashes the program.

To Reproduce

Can be reproduced with the basic MNIST gpu template, simply add a WandbLogger and pass 'ddp' as the distributed backend.

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 331, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 757, in run_pretrain_routine
    self.logger.log_hyperparams(ref_model.hparams)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/logging/base.py", line 14, in wrapped_fn
    fn(self, *args, **kwargs)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/logging/wandb.py", line 79, in log_hyperparams
    self.experiment.config.update(params)
AttributeError: 'NoneType' object has no attribute 'config'

This occurs with the latest wandb version and with pytorch-lightning 0.6.

Logger bug / fix help wanted

All 17 comments

Hi! thanks for your contribution!, great first issue!

Some hacky solutions: calling reinit=True for wandb, adding or this terrible hack:

def init_ddp_connection(self, *args, **kwargs):
    super().init_ddp_connection(*args, **kwargs)

    if torch.distributed.get_rank() == 0:
        import wandb
        wandb.run = None

These both seem to only kind-of work and result in multiple independent calls to wandb.init. I think the ideal solution is that the experiment is only ever initialized on rank zero. However doing this means that wandb cannot be initialized in the master thread at all.

Better than this probably requires some changes to the wandb API.

Following up slightly - my hacky solution doesn't really work. It's easy enough though to get the rank zero only solution to work. If this seems like a reasonable solution, let me know and I'll submit a PR.

well, have observed some issues with wandb earlier #906 could you check it?

Hmm, I think this is a slightly different issue (I'm running on Ubuntu so I don't think that's the issue). Pickling also works correctly.

This particular problem I think stems from this part of the wandb.init(...) code:

def init(...):
    ...
    # If a thread calls wandb.init() it will get the same Run object as
    # the parent. If a child process with distinct memory space calls
    # wandb.init(), it won't get an error, but it will get a result of
    # None.
    # This check ensures that a child process can safely call wandb.init()
    # after a parent has (only the parent will create the Run object).
    # This doesn't protect against the case where the parent doesn't call
    # wandb.init but two children do.
    if run or os.getenv(env.INITED):
        return run

Child processes end up getting None for the wandb run object, which causes logging to fail. There are probably two reasonable and complementary solutions:

  1. The main thread should avoid creating a wandb experiment unless absolutely necessary.

Right now, this is the only part of the logging code that the parent thread calls (I assume it's called when pickling):

def __getstate__(self):
    state = self.__dict__.copy()
    # cannot be pickled
    state['_experiment'] = None
    # args needed to reload correct experiment
    state['_id'] = self.experiment.id
    return state

If this is changed to:

def __getstate__(self):
    state = self.__dict__.copy()
    # args needed to reload correct experiment
    if self._experiment is not None:
        state['_id'] = self._experiment.id
    else:
        state['_id'] = None

    # cannot be pickled
    state['_experiment'] = None
    return state

That will ensure that unless the user explicitly logs something or creates the wandb experiment first, then the main thread will not try to create an experiment. Since subsequent logging / saving code is wrapped by the @rank_zero_only decorator, this will generally solve the issue in the base case.

It's also possible that these properties are also called by master. Ideally they would be wrapped to not create the experiment unless it had been already created (i.e. experiment should only be created by a function that is wrapped with the @rank_zero_only decorator).

  1. If the main thread has created an experiment, rank zero should be passed the re-init argument.

wandb does allow you to reinitialize the experiment. I tried to play around with this a little bit and got some errors, but in theory adding this:

wandb.init(..., reinit=dist.is_available() and dist.is_initialized() and dist.get_rank() == 0)

should force a re-initialization when wandb is already initialzed for rank zero.

@rmrao I just made a PR. It should be safe to always set reinit=True. I wasn't able to reproduce the specific environment you mentioned. Can you share some code or a Colab with me to be sure this works ok?

I think I had tried setting reinit without fixing other calls to avoid creating the experiment unless necessary. I don’t think this will produce errors.

I haven’t experimented with reinit significantly and I’m not sure what it does in terms of creating a new run in wandb, etc. but I don’t think I have a setup that explicitly causes an error with reinit.

It would be good also add test for logger in ddp mode

Has there been any progress on this? What is the fix to enable WandbLogger on 'ddp'?

Is there any update to this? I'm also hoping to use wandb with ddp. I've upgraded to the most recent release 0.7.5 but the problem persists

@vanpelt @shawnlewis

Can someone provide a small script to reproduce this? We can work on a fix and include it in the next release.

@julianmack try 0.7.3

I found a similar bug with the WandB logger where the checkpointing callback expects the logger.name property to not be None

https://github.com/PyTorchLightning/pytorch-lightning/blob/3a642601e84c3abf1f1b438f9acc932a1f150f7f/pytorch_lightning/trainer/callback_config.py#L54-L59

However, WandB will set the logger.name and logger.version properties to None if experiments are not being used.

https://github.com/PyTorchLightning/pytorch-lightning/blob/3a642601e84c3abf1f1b438f9acc932a1f150f7f/pytorch_lightning/loggers/wandb.py#L124-L133

I worked around this issue by adding the following line to the top of my training job

setattr(WandbLogger, 'name', property(lambda self: self._name))

Just to clarify @parasj's solution, presumably you also have to pass in some name as a keyword arg while constructing the logger.

Unfortunately this means you're now responsible for generating unique names for subsequent runs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

versatran01 picture versatran01  ·  3Comments

as754770178 picture as754770178  ·  3Comments

mmsamiei picture mmsamiei  ·  3Comments

anthonytec2 picture anthonytec2  ·  3Comments

Vichoko picture Vichoko  ·  3Comments