Pytorch-lightning: Comet logger cannot be pickled after creating an experiment

Created on 1 May 2020  路  8Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

The Comet logger cannot be pickled after an experiment (at least an OfflineExperiment) has been created.

To Reproduce

Steps to reproduce the behavior:

initialize the logger object (works fine)

from pytorch_lightning.loggers import CometLogger
import tests.base.utils as tutils
from pytorch_lightning import Trainer
import pickle

model, _ = tutils.get_default_model()
logger = CometLogger(save_dir='test')
pickle.dumps(logger)

initialize a Trainer object with the logger (works fine)

trainer = Trainer(
    max_epochs=1,
    logger=logger
)
pickle.dumps(logger)
pickle.dumps(trainer)

access the experiment attribute which creates the OfflineExperiment object (fails)

logger.experiment
pickle.dumps(logger)
>> TypeError: can't pickle _thread.lock objects

Expected behavior

We should be able to pickle loggers for distributed training.

Environment

  • CUDA:
    - GPU:
    - available: False
    - version: None
  • Packages:
    - numpy: 1.18.1
    - pyTorch_debug: False
    - pyTorch_version: 1.4.0
    - pytorch-lightning: 0.7.5
    - tensorboard: 2.1.0
    - tqdm: 4.42.0
  • System:
    - OS: Darwin
    - architecture:
    - 64bit
    -
    - processor: i386
    - python: 3.7.6
    - version: Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64
Logger bug / fix help wanted

Most helpful comment

I don't know if it can help or if it is the right place, but a similar error occurswhen running in ddp mode with the WandB logger.

WandB uses a lambda function at some point.

Does the logger have to pickled ? Couldn't it log only on rank 0 at epoch_end ?

Traceback (most recent call last):
  File "../train.py", line 140, in <module>
    main(args.gpus, args.nodes, args.fast_dev_run, args.mixed_precision, project_config, hparams)
  File "../train.py", line 117, in main
    trainer.fit(model)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
    process.start()
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
    return Popen(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TorchHistory.add_log_hooks_to_pytorch_module.<locals>.<lambda>'

also related:

1704

All 8 comments

@ceyzaguirre4 pls ^^

I don't know if it can help or if it is the right place, but a similar error occurswhen running in ddp mode with the WandB logger.

WandB uses a lambda function at some point.

Does the logger have to pickled ? Couldn't it log only on rank 0 at epoch_end ?

Traceback (most recent call last):
  File "../train.py", line 140, in <module>
    main(args.gpus, args.nodes, args.fast_dev_run, args.mixed_precision, project_config, hparams)
  File "../train.py", line 117, in main
    trainer.fit(model)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
    process.start()
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
    return Popen(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TorchHistory.add_log_hooks_to_pytorch_module.<locals>.<lambda>'

also related:

1704

I had the same error as @jeremyjordan can't pickle _thread.lock objects. This happened when I added the logger and additional callbacks in from_argparse_args, as explained here https://pytorch-lightning.readthedocs.io/en/latest/hyperparameters.html

trainer = pl.Trainer.from_argparse_args(hparams, logger=logger, callbacks=[PrinterCallback(), ])

I could make the problem go away by directly overwriting the members of Trainer

trainer = pl.Trainer.from_argparse_args(hparams)
trainer.logger = logger
trainer.callbacks.append(PrinterCallback())

Same issue as @F-Barto using a wandb logger across 2 nodes with ddp.

same issue when using wandb logger with ddp

same here.. @joseluisvaz your workaround doesn't solve the callback issue.. when I try to add a callback like this it is simply being ignored :/ but adding it the Trainer init call normally works.. so I'm pretty sure the error is thrown by the logger (I'm using TB) not the callbacks.

Same issue, using wandb logger with 8 gpus in an AWS p2.8xlarge machine

With CometLogger, I get this error only when the experiment name is declared. If it is not declared, I get no issue.

Was this page helpful?
0 / 5 - 0 ratings