Fairseq: SIGSEGV error while trying to train the Levenshtein transformer

Created on 25 Oct 2019  路  8Comments  路  Source: pytorch/fairseq

I'm trying to train the Levenstein transformer with the suggested dataset and settings (but max-tokens set to 4000) on 1 machine with 4 V100 32GB GPUs. I'm using pytorch 1.2 and python 3.6, on a Scientific Linux 7.6 distribution.

The process always fails with:

| model levenshtein_transformer, criterion LabelSmoothedDualImitationCriterion
| num. model params: 66251776 (num. trained: 66251776)
| training on 4 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
| no existing checkpoint found checkpoints/levt/checkpoint_last.pt
| loading train data for epoch 0
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.en
| loaded 3961179 examples from: data-bin/joint-bpe-37k/train.en-de.de
| data-bin/joint-bpe-37k train en-de 3961179 examples
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
  File "fairseq_cli/train.py", line 342, in <module>
    cli_main()
  File "fairseq_cli/train.py", line 334, in cli_main
    nprocs=args.distributed_world_size,
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 1 terminated with signal SIGSEGV
~/repos/fairseq
n-62-20-9(s172185) $ 
~/repos/fairseq
n-62-20-9(s172185) $ Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/appl/python/3.6.2/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
  len(cache))

When I try to run it on 1 GPU with the --distributed-world-size 1 setting then I simply get a Segmentation fault without any traceback.

What could cause this issue?

Thank you!

Most helpful comment

All 8 comments

As reported in https://github.com/pytorch/fairseq/issues/1220#issuecomment-545016796, Is it / Is it not ? due to fairseq's code or due to levenshtein variant.
Another reported issue https://github.com/pytorch/fairseq/issues/1308.
@MultiPath this is easily replicable.

The training using same command (data and model options) with --cpu i.e., is working fine on CPU (fp32 as fp16 is not supported on CPU and throws RuntimeError: _th_index_select not supported on CPUType for Half) but not on single/multigpu @myleott

Also as per https://github.com/pytorch/fairseq/issues/1294 using pytorch1.1.0 this does not occur. Maybe after fixing this few recent issues can be closed.

Thank you for the suggestions @gvskalyan. Unfortunately the same error occurs with pytorch 1.1.0 as well.

As an addition: running the Transformer model on the same machine and same dataset works fine, but I still get the SIGSEGV error when I try to train the Levenshtein Transformer.

So far I tried the training with Pytorch 1.1, 1.2, 1.3 and 1.3.1, always with the latest commit from the Fairseq master branch. I also made my own implementation of the LevT that is largely based on the Fairseq implementation and that also works on my machine - although I expect that the Fairseq version is many times faster, so I would much rather rely on that.

@gvskalyan @myleott @MultiPath are there plans on addressing this issue? Do you have any suggestion that might solve it?

@Shujian2015 thank you for the suggestion, it was very helpful! It seems the problem in my case comes whenever libnat.suggested_ed2_path is executed, so probably the issue is that it can't run the c code.

| model levenshtein_transformer, criterion LabelSmoothedDualImitationCriterion
| num. model params: 65821696 (num. trained: 65821696)
| training on 4 GPUs
| max tokens per GPU = 8000 and max sentences per GPU = None
| no existing checkpoint found checkpoints/levt/checkpoint_last.pt
| loading train data for epoch 0
| loaded 1129207 examples from: data-bin/gec/train.cor-wrg.cor
| loaded 1129207 examples from: data-bin/gec/train.cor-wrg.wrg
| data-bin/gec train cor-wrg 1129207 examples
| NOTICE: your device may support faster training with --fp16
Fatal Python error: Segmentation fault

Thread 0x00007f28bbfff700 (most recent call first):
  File "/appl/python/3.6.2/lib/python3.6/threading.py", line 295 in wait
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/queues.py", line 229 in _feed
  File "/appl/python/3.6.2/lib/python3.6/threading.py", line 864 in run
  File "/appl/python/3.6.2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/appl/python/3.6.2/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f2970c08740 (most recent call first):
  File "/zhome/60/6/124738/repos/fairseq/fairseq/models/levenshtein_transformer.py", line 110 in _get_ins_targets
  File "/zhome/60/6/124738/repos/fairseq/fairseq/models/levenshtein_transformer.py", line 361 in forward
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
  File "/zhome/60/6/124738/repos/fairseq/fairseq/legacy_distributed_data_parallel.py", line 86 in forward
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
  File "/zhome/60/6/124738/repos/fairseq/fairseq/criterions/nat_loss.py", line 92 in forward
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
  File "/zhome/60/6/124738/repos/fairseq/fairseq/tasks/translation_lev.py", line 145 in train_step
  File "/zhome/60/6/124738/repos/fairseq/fairseq/trainer.py", line 306 in train_step
  File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 131 in train
  File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 90 in main
  File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 332 in distributed_main
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19 in _wrap
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/process.py", line 93 in run
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/process.py", line 249 in _bootstrap
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 118 in _main
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105 in spawn_main
  File "<string>", line 1 in <module>
Fatal Python error: Segmentation fault

Thread 0x00007f9ea1fff700 (most recent call first):
  File "/appl/python/3.6.2/lib/python3.6/threading.py", line 295 in wait
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/queues.py", line 229 in _feed
  File "/appl/python/3.6.2/lib/python3.6/threading.py", line 864 in run
  File "/appl/python/3.6.2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/appl/python/3.6.2/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f9f17258740 (most recent call first):
  File "/zhome/60/6/124738/repos/fairseq/fairseq/models/levenshtein_transformer.py", line 110 in _get_ins_targets
  File "/zhome/60/6/124738/repos/fairseq/fairseq/models/levenshtein_transformer.py", line 361 in forward
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
  File "/zhome/60/6/124738/repos/fairseq/fairseq/legacy_distributed_data_parallel.py", line 86 in forward
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
  File "/zhome/60/6/124738/repos/fairseq/fairseq/criterions/nat_loss.py", line 92 in forward
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547 in __call__
  File "/zhome/60/6/124738/repos/fairseq/fairseq/tasks/translation_lev.py", line 145 in train_step
  File "/zhome/60/6/124738/repos/fairseq/fairseq/trainer.py", line 306 in train_step
  File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 131 in train
  File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 90 in main
  File "/zhome/60/6/124738/repos/fairseq/fairseq_cli/train.py", line 332 in distributed_main
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19 in _wrap
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/process.py", line 93 in run
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/process.py", line 249 in _bootstrap
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 118 in _main
  File "/appl/python/3.6.2/lib/python3.6/multiprocessing/spawn.py", line 105 in spawn_main
  File "<string>", line 1 in <module>
Traceback (most recent call last):
  File "fairseq_cli/train.py", line 373, in <module>
    cli_main()
  File "fairseq_cli/train.py", line 365, in cli_main
    nprocs=args.distributed_world_size,
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/zhome/60/6/124738/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGSEGV
~/repos/fairseq
/appl/python/3.6.2/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 48 leaked semaphores to clean up at shutdown
  len(cache))

I tried it in a script that only runs this one libnat function and I do get the segmentation error.
I'll see if I can get around this somehow.

The problem seems to be that pytorch cannot run the c code on my machine, so I doubt this is an issue with fairseq. I rewrote the c code in python and the model works fine (I guess with a substantial cost in speed).

I appreciate any suggestions towards how to get pytorch run the c code, but otherwise I'm closing this issue.

I got this too, even when I have uninstalled the apex

Was this page helpful?
0 / 5 - 0 ratings