Pytorch-lightning: Comet logger cannot be pickled after creating an experiment

Created on 1 May 2020 · 8Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

The Comet logger cannot be pickled after an experiment (at least an OfflineExperiment) has been created.

To Reproduce

Steps to reproduce the behavior:

initialize the logger object (works fine)

from pytorch_lightning.loggers import CometLogger
import tests.base.utils as tutils
from pytorch_lightning import Trainer
import pickle

model, _ = tutils.get_default_model()
logger = CometLogger(save_dir='test')
pickle.dumps(logger)

initialize a Trainer object with the logger (works fine)

trainer = Trainer(
    max_epochs=1,
    logger=logger
)
pickle.dumps(logger)
pickle.dumps(trainer)

access the experiment attribute which creates the OfflineExperiment object (fails)

logger.experiment
pickle.dumps(logger)
>> TypeError: can't pickle _thread.lock objects

Expected behavior

We should be able to pickle loggers for distributed training.

Environment

CUDA:
- GPU:
- available: False
- version: None
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.5
- tensorboard: 2.1.0
- tqdm: 4.42.0
System:
- OS: Darwin
- architecture:
- 64bit
-
- processor: i386
- python: 3.7.6
- version: Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64

Logger bug / fix help wanted

Source

jeremyjordan

👀1

Most helpful comment

I don't know if it can help or if it is the right place, but a similar error occurswhen running in ddp mode with the WandB logger.

WandB uses a lambda function at some point.

Does the logger have to pickled ? Couldn't it log only on rank 0 at epoch_end ?

Traceback (most recent call last):
  File "../train.py", line 140, in <module>
    main(args.gpus, args.nodes, args.fast_dev_run, args.mixed_precision, project_config, hparams)
  File "../train.py", line 117, in main
    trainer.fit(model)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
    process.start()
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
    return Popen(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TorchHistory.add_log_hooks_to_pytorch_module.<locals>.<lambda>'

also related:

1704

F-Barto on 18 May 2020

👍2

All 8 comments

@ceyzaguirre4 pls ^^

Borda on 1 May 2020

I don't know if it can help or if it is the right place, but a similar error occurswhen running in ddp mode with the WandB logger.

WandB uses a lambda function at some point.

Does the logger have to pickled ? Couldn't it log only on rank 0 at epoch_end ?

Traceback (most recent call last):
  File "../train.py", line 140, in <module>
    main(args.gpus, args.nodes, args.fast_dev_run, args.mixed_precision, project_config, hparams)
  File "../train.py", line 117, in main
    trainer.fit(model)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
    process.start()
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
    return Popen(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TorchHistory.add_log_hooks_to_pytorch_module.<locals>.<lambda>'

also related:

1704

F-Barto on 18 May 2020

👍2

I had the same error as @jeremyjordan can't pickle _thread.lock objects. This happened when I added the logger and additional callbacks in from_argparse_args, as explained here https://pytorch-lightning.readthedocs.io/en/latest/hyperparameters.html

trainer = pl.Trainer.from_argparse_args(hparams, logger=logger, callbacks=[PrinterCallback(), ])

I could make the problem go away by directly overwriting the members of Trainer

trainer = pl.Trainer.from_argparse_args(hparams)
trainer.logger = logger
trainer.callbacks.append(PrinterCallback())

joseluisvaz on 18 May 2020

Same issue as @F-Barto using a wandb logger across 2 nodes with ddp.

jacobdanovitch on 20 May 2020

same issue when using wandb logger with ddp

huyvnphan on 21 May 2020

same here.. @joseluisvaz your workaround doesn't solve the callback issue.. when I try to add a callback like this it is simply being ignored :/ but adding it the Trainer init call normally works.. so I'm pretty sure the error is thrown by the logger (I'm using TB) not the callbacks.