馃悰 Bug
I switched to the master branch in order to test the bugfix for #1919 but the same code that was running on the stable version 0.7.6 is not running anymore.
Maybe just switching to the recent master branch would reproduce the issue.
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
WARNING:lightning:No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
File "/home/mycode/src/vae.py", line 457, in <module>
main()
File "/home/mycode/src/vae.py", line 450, in main
run_model(hparams)
File "/home/mycode/src/vae.py", line 386, in run_model
trainer.fit(vae)
File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 894, in fit
self.single_gpu_train(model)
File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 502, in single_gpu_train
self.run_pretrain_routine(model)
File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in run_pretrain_routine
self.logger.save()
File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 10, in wrapped_fn
return fn(*args, **kwargs)
File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 161, in save
save_hparams_to_yaml(hparams_file, self.hparams)
File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 151, in save_hparams_to_yaml
yaml.dump(hparams, fp)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/__init__.py", line 290, in dump
return dump_all([data], stream, Dumper=Dumper, **kwds)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/__init__.py", line 278, in dump_all
dumper.represent(data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 27, in represent
node = self.represent_data(data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 48, in represent_data
node = self.yaml_representers[data_types[0]](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 207, in represent_dict
return self.represent_mapping('tag:yaml.org,2002:map', data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
return self.represent_mapping(
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
return self.represent_mapping(
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
return self.represent_mapping(
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 48, in represent_data
node = self.yaml_representers[data_types[0]](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 207, in represent_dict
return self.represent_mapping('tag:yaml.org,2002:map', data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
return self.represent_mapping(
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
return self.represent_mapping(
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
return self.represent_mapping(
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
return self.represent_mapping(
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
return self.represent_mapping(
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
node_value = self.represent_data(item_value)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
node = self.yaml_multi_representers[data_type](self, data)
File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 317, in represent_object
reduce = data.__reduce_ex__(2)
TypeError: cannot pickle '_thread.lock' object
Error in atexit._run_exitfuncs:
TypeError: run_training_teardown() missing 1 required positional argument: 'self'
I would've expected it to run exactly the same way as it does while using the 0.7.6 version
* CUDA:
- GPU:
- GeForce GTX 1050 Ti with Max-Q Design
- available: True
- version: 10.2
* Packages:
- numpy: 1.18.3
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.7.7-dev
- tensorboard: 2.2.1
- tqdm: 4.46.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.2
Error in atexit._run_exitfuncs:
TypeError: run_training_teardown() missing 1 required positional argument: 'self'
conda, pip, source): pipThe same error is present while running the collect_env_details.py script.
@justusschock ^^
Probably this line here
https://github.com/PyTorchLightning/pytorch-lightning/blob/ceecf1cea92dc2d8c29b1364237ac9467abf2f9b/pytorch_lightning/trainer/training_loop.py#L309
should be
self.run_training_teardown()
Not sure
Investigated.
It is because the atexit.register decorator can only be applied to functions, not methods.
The self argument is not passed in.
This error should have shown up in the tests.
@awaelchli so changing this to self.run_training_teardown would fix it?
No, I tested it, the problem is not there. The problem is that the atexit.register is applied to a method (which has self as an argument) but the decorator is meant for functions which don't get self as input.
It seems this is causing the problem.
Yes. I think, that's why I explicitly passed the self argument :) I'll try to come up with another solution for this :)
Maybe we can wrap the cleanup code into a closure that binds self and then the decorator can be applied to this closure function. Not sure though, have not looked at the details
Do you want to take this over? Otherwise I'd try to make some time for it later/tomorrow :)
more broadly, why is that function needed? we already have teardown that works on ctrl+c no?
is this to teardown with a USSIG1 as well?
And we already have that signal registered...
This would start teardown for all kills except SIGKILL (like SIGTERM etc.).
There are clusters (like non-SLURM) that also need this kind of signal handling. And I think we should do cleanup whenever a job ends if possible (either exit after program end or exit on error). Otherwise you may get issues with checkpointing etc.
Also this would maybe enable proper hparam logging with metrics (not sure about that though).
Do you want to take this over? Otherwise I'd try to make some time for it later/tomorrow :)
I think I better keep my hands away from it. I can't test the SLURM signals anyway due to lack of this setup.
Do you want to take this over? Otherwise I'd try to make some time for it later/tomorrow :)
I think I better keep my hands away from it. I can't test the SLURM signals anyway due to lack of this setup.
SLURM shall be also running on CPU :]
@awaelchli please see comments. i'm not sure we should have this handler thing. i think it'll also break ddp