I'm trying to train the Levenstein transformer with the suggested dataset and settings (but max-tokens set to 4000) on 1 machine with 4 V100 32GB GPUs. I'm using pytorch 1.2 and python 3.6, on a Scientific Linux 7.6 distribution.
The process always fails with:
| model levenshtein_transformer, criterion LabelSmoothedDualImitationCriterion
| num. model params: 66251776 (num. trained: 66251776)
| training on 4 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| no existing checkpoint found checkpoints/levt/checkpoint_last.pt
| loading train data for epoch 0
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.en
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.de
| data-bin/joint-bpe-37k train en-de 3961179 examples
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "fairseq_cli/train.py", line 342, in <module>
cli_main()
File "fairseq_cli/train.py", line 334, in cli_main
nprocs=args.distributed_world_size,
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 1 terminated with signal SIGSEGV
~/repos/fairseq
n-62-20-9(s172185) $
~/repos/fairseq
n-62-20-9(s172185) $ Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/appl/python/3.6.2/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
len(cache))
When I try to run it on 1 GPU with the --distributed-world-size 1 setting then I simply get a Segmentation fault without any traceback.
What could cause this issue?
Thank you!
As reported in https://github.com/pytorch/fairseq/issues/1220#issuecomment-545016796, Is it / Is it not ? due to fairseq's code or due to levenshtein variant.
Another reported issue https://github.com/pytorch/fairseq/issues/1308.
@MultiPath this is easily replicable.
The training using same command (data and model options) with --cpu i.e., is working fine on CPU (fp32 as fp16 is not supported on CPU and throws RuntimeError: _th_index_select not supported on CPUType for Half) but not on single/multigpu @myleott
Also as per https://github.com/pytorch/fairseq/issues/1294 using pytorch1.1.0 this does not occur. Maybe after fixing this few recent issues can be closed.
Thank you for the suggestions @gvskalyan. Unfortunately the same error occurs with pytorch 1.1.0 as well.
As an addition: running the Transformer model on the same machine and same dataset works fine, but I still get the SIGSEGV error when I try to train the Levenshtein Transformer.
So far I tried the training with Pytorch 1.1, 1.2, 1.3 and 1.3.1, always with the latest commit from the Fairseq master branch. I also made my own implementation of the LevT that is largely based on the Fairseq implementation and that also works on my machine - although I expect that the Fairseq version is many times faster, so I would much rather rely on that.
@gvskalyan @myleott @MultiPath are there plans on addressing this issue? Do you have any suggestion that might solve it?
Please refer https://github.com/pytorch/fairseq/issues/1350#issuecomment-550476700
Please refer: https://github.com/pytorch/fairseq/issues/1308#issuecomment-554887274
@Shujian2015 thank you for the suggestion, it was very helpful! It seems the problem in my case comes whenever libnat.suggested_ed2_path is executed, so probably the issue is that it can't run the c code.
| model levenshtein_transformer, criterion LabelSmoothedDualImitationCriterion
| num. model params: 65821696 (num. trained: 65821696)
| training on 4 GPUs
| max tokens per GPU = 8000 and max sentences per GPU = None
| no existing checkpoint found checkpoints/levt/checkpoint_last.pt
| loading train data for epoch 0
| loaded 1129207 examples from: data-bin/gec/train.cor-wrg.cor
| loaded 1129207 examples from: data-bin/gec/train.cor-wrg.wrg
| data-bin/gec train cor-wrg 1129207 examples
| NOTICE: your device may support faster training with --fp16
Fatal Python error: Segmentation fault
Thread 0x00007f28bbfff700 (most recent call first):
File "/appl/python/3.6.2/lib/python3.6/threading.py", line 295 in wait
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/queues.py", line 229 in _feed
File "/appl/python/3.6.2/lib/python3.6/threading.py", line 864 in run
File "/appl/python/3.6.2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/appl/python/3.6.2/lib/python3.6/threading.py", line 884 in _bootstrap
Current thread 0x00007f2970c08740 (most recent call first):
File "/zhome/60/6/124738/repos/fairseq/fairseq/models/levenshtein_transformer.py", line 110 in _get_ins_targets
File "/zhome/60/6/124738/repos/fairseq/fairseq/models/levenshtein_transformer.py", line 361 in forward
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
File "/zhome/60/6/124738/repos/fairseq/fairseq/legacy_distributed_data_parallel.py", line 86 in forward
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
File "/zhome/60/6/124738/repos/fairseq/fairseq/criterions/nat_loss.py", line 92 in forward
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
File "/zhome/60/6/124738/repos/fairseq/fairseq/tasks/translation_lev.py", line 145 in train_step
File "/zhome/60/6/124738/repos/fairseq/fairseq/trainer.py", line 306 in train_step
File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 131 in train
File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 90 in main
File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 332 in distributed_main
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19 in _wrap
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/process.py", line 93 in run
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/process.py", line 249 in _bootstrap
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 118 in _main
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105 in spawn_main
File "<string>", line 1 in <module>
Fatal Python error: Segmentation fault
Thread 0x00007f9ea1fff700 (most recent call first):
File "/appl/python/3.6.2/lib/python3.6/threading.py", line 295 in wait
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/queues.py", line 229 in _feed
File "/appl/python/3.6.2/lib/python3.6/threading.py", line 864 in run
File "/appl/python/3.6.2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/appl/python/3.6.2/lib/python3.6/threading.py", line 884 in _bootstrap
Current thread 0x00007f9f17258740 (most recent call first):
File "/zhome/60/6/124738/repos/fairseq/fairseq/models/levenshtein_transformer.py", line 110 in _get_ins_targets
File "/zhome/60/6/124738/repos/fairseq/fairseq/models/levenshtein_transformer.py", line 361 in forward
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
File "/zhome/60/6/124738/repos/fairseq/fairseq/legacy_distributed_data_parallel.py", line 86 in forward
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
File "/zhome/60/6/124738/repos/fairseq/fairseq/criterions/nat_loss.py", line 92 in forward
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
File "/zhome/60/6/124738/repos/fairseq/fairseq/tasks/translation_lev.py", line 145 in train_step
File "/zhome/60/6/124738/repos/fairseq/fairseq/trainer.py", line 306 in train_step
File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 131 in train
File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 90 in main
File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 332 in distributed_main
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19 in _wrap
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/process.py", line 93 in run
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/process.py", line 249 in _bootstrap
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 118 in _main
File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105 in spawn_main
File "<string>", line 1 in <module>
Traceback (most recent call last):
File "fairseq_cli/train.py", line 373, in <module>
cli_main()
File "fairseq_cli/train.py", line 365, in cli_main
nprocs=args.distributed_world_size,
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGSEGV
~/repos/fairseq
/appl/python/3.6.2/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 48 leaked semaphores to clean up at shutdown
len(cache))
I tried it in a script that only runs this one libnat function and I do get the segmentation error.
I'll see if I can get around this somehow.
The problem seems to be that pytorch cannot run the c code on my machine, so I doubt this is an issue with fairseq. I rewrote the c code in python and the model works fine (I guess with a substantial cost in speed).
I appreciate any suggestions towards how to get pytorch run the c code, but otherwise I'm closing this issue.
I got this too, even when I have uninstalled the apex
Most helpful comment
Please refer https://github.com/pytorch/fairseq/issues/1350#issuecomment-550476700