Pytorch-lightning: DDP cannot start due to pickle problem

Created on 27 Apr 2020  ·  5Comments  ·  Source: PyTorchLightning/pytorch-lightning

🐛 Bug

DDP cannot start with following error. This happened after I upgraded from 0.7.1 to 0.7.5.

Traceback (most recent call last):                                                                                        
  File "train.py", line 365, in <module>                                                                                  
    fire.Fire(train)                                                                                                      
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/site-packages/fire/core.py", line 138, in Fire                       
    component_trace = _Fire(component, args, parsed_flag_args, context, name)                                             
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/site-packages/fire/core.py", line 468, in _Fire                      
    target=component.__name__)                                                                                            
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace        
    component = fn(*varargs, **kwargs)                                                                                    
  File "train.py", line 348, in train                                                                                     
    trainer.fit(model)                                                                                                    
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))                                                    
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn    
    process.start()                                                                                                       
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/multiprocessing/process.py", line 112, in start                      
    self._popen = self._Popen(self)                                                                                       
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/multiprocessing/context.py", line 284, in _Popen                     
    return Popen(process_obj)                                                                                             
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__          
    super().__init__(process_obj)                                                                                         
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__                 
    self._launch(process_obj)                                                                                             
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch           
    reduction.dump(process_obj, fp)                                                                                       
  File "/home/jacobz/.conda/envs/lidar/lib/python3.7/multiprocessing/reduction.py", line 60, in dump                      
    ForkingPickler(file, protocol).dump(obj)                                                                              
TypeError: cannot serialize '_io.TextIOWrapper' object   

To Reproduce

Sorry I don't have a short example to reproduce this yet.

Environment

* CUDA:
        - GPU:
                - GeForce GTX 1080 Ti
        - available:         True
        - version:           10.1
* Packages:
        - numpy:             1.18.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.4.0
        - pytorch-lightning: 0.7.5
        - tensorboard:       2.2.0
        - tqdm:              4.45.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.7.7
        - version:           #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020
bug / fix help wanted

All 5 comments

Find a related issue https://github.com/hyperopt/hyperopt-sklearn/issues/74, but I'm sure there's no logger in my module.

And also if I try to pickle my model:

import pickle
pickle.dumps(model)

There's no error occurred..

@cmpute try pickling the Trainer, that's what usually fails. See #1628 for a similar error I debugged yesterday. It is probably something custom (or just non-default!) that you pass to the Trainer, ie. in args.

I haven't met this problem for a while, it may be fixed by latest commits.. I'll close it for now

I'm also experiencing this now. Not fixed yet!

the same issue, please reopen

pl.__version__='0.9.0rc1'

In my case, it happens when I provide the output_filename parameter to pytorch_lightning.profiler.SimpleProfiler and run in ddp_spawn regime

Was this page helpful?
0 / 5 - 0 ratings

Related issues

as754770178 picture as754770178  ·  3Comments

Vichoko picture Vichoko  ·  3Comments

edenlightning picture edenlightning  ·  3Comments

polars05 picture polars05  ·  3Comments

srush picture srush  ·  3Comments