The Comet logger cannot be pickled after an experiment (at least an OfflineExperiment) has been created.
Steps to reproduce the behavior:
initialize the logger object (works fine)
from pytorch_lightning.loggers import CometLogger
import tests.base.utils as tutils
from pytorch_lightning import Trainer
import pickle
model, _ = tutils.get_default_model()
logger = CometLogger(save_dir='test')
pickle.dumps(logger)
initialize a Trainer object with the logger (works fine)
trainer = Trainer(
max_epochs=1,
logger=logger
)
pickle.dumps(logger)
pickle.dumps(trainer)
access the experiment attribute which creates the OfflineExperiment object (fails)
logger.experiment
pickle.dumps(logger)
>> TypeError: can't pickle _thread.lock objects
We should be able to pickle loggers for distributed training.
@ceyzaguirre4 pls ^^
I don't know if it can help or if it is the right place, but a similar error occurswhen running in ddp mode with the WandB logger.
WandB uses a lambda function at some point.
Does the logger have to pickled ? Couldn't it log only on rank 0 at epoch_end ?
Traceback (most recent call last):
File "../train.py", line 140, in <module>
main(args.gpus, args.nodes, args.fast_dev_run, args.mixed_precision, project_config, hparams)
File "../train.py", line 117, in main
trainer.fit(model)
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
process.start()
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/clear/fbartocc/miniconda3/envs/Depth_env/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TorchHistory.add_log_hooks_to_pytorch_module.<locals>.<lambda>'
also related:
I had the same error as @jeremyjordan can't pickle _thread.lock objects. This happened when I added the logger and additional callbacks in from_argparse_args, as explained here https://pytorch-lightning.readthedocs.io/en/latest/hyperparameters.html
trainer = pl.Trainer.from_argparse_args(hparams, logger=logger, callbacks=[PrinterCallback(), ])
I could make the problem go away by directly overwriting the members of Trainer
trainer = pl.Trainer.from_argparse_args(hparams)
trainer.logger = logger
trainer.callbacks.append(PrinterCallback())
Same issue as @F-Barto using a wandb logger across 2 nodes with ddp.
same issue when using wandb logger with ddp
same here.. @joseluisvaz your workaround doesn't solve the callback issue.. when I try to add a callback like this it is simply being ignored :/ but adding it the Trainer init call normally works.. so I'm pretty sure the error is thrown by the logger (I'm using TB) not the callbacks.
Same issue, using wandb logger with 8 gpus in an AWS p2.8xlarge machine
With CometLogger, I get this error only when the experiment name is declared. If it is not declared, I get no issue.
Most helpful comment
I don't know if it can help or if it is the right place, but a similar error occurswhen running in ddp mode with the WandB logger.
WandB uses a lambda function at some point.
Does the logger have to pickled ? Couldn't it log only on rank 0 at epoch_end ?
also related:
1704