Fairseq: Speech recognition can't use multi GPU to train

Created on 24 Sep 2019 · 10Comments · Source: pytorch/fairseq

When I train "speech recognition" on multi GPU,
I got this error codes:

| model vggtransformer_2, criterion CrossEntropyWithAccCriterion
| num. model params: 315190057 (num. trained: 315190057)
| training on 2 GPUs
| max tokens per GPU = 5000 and max sentences per GPU = None
| no existing checkpoint found ./checkpoints/checkpoint_last.pt
| loading train data for epoch 0
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/bin/fairseq-train", line 11, in
sys.exit(cli_main())
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq_cli/train.py", line 317, in cli_main
nprocs=args.distributed_world_size,
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq_cli/train.py", line 284, in distributed_main
main(args, init_distributed=True)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq_cli/train.py", line 80, in main
train(args, trainer, task, epoch_itr)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq_cli/train.py", line 120, in train
for i, samples in enumerate(progress, start=epoch_itr.iterations_in_epoch):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tqdm/_tqdm.py", line 955, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq/data/iterators.py", line 286, in __next__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq/data/iterators.py", line 40, in __next__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq/data/iterators.py", line 35, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 278, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 682, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/process.py", line 105, in start
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_fork.py", line 26, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 323, in reduce_storage
RuntimeError: unable to open shared memory object in read-write mode

I think It may be the dataloader problem, when I set the args "num-workers" to 0, this error will not appear!
But it will be slowly!

Thank you !

Source

BoKaiChen

Most helpful comment

When I add these code on top of "fairseq/train.py" file to close the shared memory, this problem can be solved.
but I think it may cause loss of performance.
Image of added_code

BoKaiChen on 25 Sep 2019

❤1 🎉1 👍1

All 10 comments

Can you include the command you are using to launch the job?

lematt1991 on 24 Sep 2019

This is my script(./run.sh):
Image of script

This im my result:

root@todebug-1924478367-rcgnb:/workspace/lustre/fairseq# ./run.sh
| NOTE: you may get better performance with: --ddp-backend=no_c10d
| distributed init (rank 0): tcp://localhost:19270
| distributed init (rank 1): tcp://localhost:19270
| initialized host todebug-1924478367-rcgnb as rank 1
| initialized host todebug-1924478367-rcgnb as rank 0
Namespace(adadelta_eps=1e-08, adadelta_rho=0.95, anneal_eps=False, arch='vggtransformer_2', best_checkpoint_metric='loss',bpe=None, bucket_cap_mb=25, clip_norm=10.0, conv_dec_config='((256, 3, True),) * 4', cpu=False, criterion='cross_entropy_acc', curriculum=0, data='/workspace/lustre/tmp/librispeech_final', dataset_impl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:19270', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, enc_output_dim=1024, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, input_feat_per_channel=80, keep_interval_updates=-1, keep_last_epochs=-1, log_format='json', log_interval=1, lr=[1.0], lr_scheduler='fixed', lr_shrink=0.1, max_epoch=80, max_sentences=None, max_sentences_valid=None, max_tokens=5000, max_tokens_valid=5000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, num_workers=1, optimizer='adadelta', optimizer_overrides='{}', required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./examples/speech_recognition/checkpoints', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, skip_invalid_size_inputs_valid_test=False, task='speech_recognition', tbmf_wrapper=False, tensorboard_logdir='', tgt_embed_dim=512, threshold_loss_scale=None, tokenizer=None, train_subset='train', transformer_dec_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 6', transformer_enc_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 16', update_freq=[8], use_bmuf=False, user_dir='./examples/speech_recognition/', valid_subset='valid', validate_interval=1, vggblock_enc_config='[(64, 3, 2, 2, True), (128, 3, 2, 2, True)]', warmup_updates=0, weight_decay=0.0)
| dictionary: 5001 types
VGGTransformerModel(
...
...
)
| model vggtransformer_2, criterion CrossEntropyWithAccCriterion
| num. model params: 315190057 (num. trained: 315190057)
| training on 2 GPUs
| max tokens per GPU = 5000 and max sentences per GPU = None
| no existing checkpoint found ./examples/speech_recognition/checkpoints/checkpoint_last.pt
| loading train data for epoch 0
| NOTICE: your device may support faster training with --fp16
Traceback (most recent call last):
File "train.py", line 352, in
cli_main()
File "train.py", line 344, in cli_main
nprocs=args.distributed_world_size,
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
File "/workspace/lustre/fairseq/train.py", line 311, in distributed_main
main(args, init_distributed=True)
File "/workspace/lustre/fairseq/train.py", line 107, in main
train(args, trainer, task, epoch_itr)
File "/workspace/lustre/fairseq/train.py", line 147, in train
for i, samples in enumerate(progress, start=epoch_itr.iterations_in_epoch):
File "/workspace/lustre/fairseq/fairseq/progress_bar.py", line 125, in __iter__
File "/workspace/lustre/fairseq/fairseq/data/iterators.py", line 290, in __next__
File "/workspace/lustre/fairseq/fairseq/data/iterators.py", line 41, in __next__
File "/workspace/lustre/fairseq/fairseq/data/iterators.py", line 36, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 278, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 682, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/process.py", line 105, in start
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_fork.py", line 26, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 323, in reduce_storage
RuntimeError: unable to open shared memory object in read-write mode
root@todebug-1924478367-rcgnb:/workspace/lustre/fairseq#

Can you include the command you are using to launch the job?

BoKaiChen on 25 Sep 2019

When I add these code on top of "fairseq/train.py" file to close the shared memory, this problem can be solved.
but I think it may cause loss of performance.
Image of added_code

BoKaiChen on 25 Sep 2019

❤1 🎉1 👍1

Other threads ([[1](https://github.com/facebookresearch/maskrcnn-benchmark/issues/103)][[2](https://discuss.pytorch.org/t/runtimeerror-unable-to-open-shared-memory-object/22641)][3]) seem to suggest that this is an out of memory issue. Try setting --num-workers 0. I'm going to close this as it doesn't seem to be an issue with fairseq.

lematt1991 on 25 Sep 2019

I'm getting the same issue. @lematt1991 Setting workers to 0 will work, but training will be unreasonably slow and GPU utility is low. It shouldn't be an OOM issue because I'm using a Kubernetes pod with 256Gi of RAM. I believe it is likely an issue with fairseq because if I try to load speech datasets with just plain Pytorch I was able to get it working.

Here's a version of that hack @BoKaiChen to fix the issue that can be copied into train.py (but it did not work for me)

import sys
from torch.utils.data import dataloader
from torch.multiprocessing import reductions
from multiprocessing.reduction import ForkingPickler

def_collate_fn = dataloader.default_collate

def def_collate_override(batch):
    dataloader._use_shared_memory = False
    return def_collate_fn(batch)

setattr(dataloader, 'default_collate', def_collate_override)
for t in torch._storage_classes:
    if sys.version_info[0] == 2:
        if t in ForkingPickler.dispatch:
            del ForkingPickler.dispatch[t]
        else:
            if t in ForkingPickler._extra_reducers:
                del ForkingPickler._extra_reducers[t]

calclavia on 2 Nov 2019

I modify the "train.py" code:

add these lines of code to the top of "train.py"

to debug

import sys
from torch.utils.data import dataloader
from torch.multiprocessing import reductions
from multiprocessing.reduction import ForkingPickler

default_collate_func = dataloader.default_collate

def default_collate_override(batch, args, *kwargs):
dataloader._use_shared_memory = False
return default_collate_func(batch, args, *kwargs)

setattr(dataloader, 'default_collate', default_collate_override)

for t in torch._storage_classes:
if sys.version_info[0] == 2:
if t in ForkingPickler.dispatch:
del ForkingPickler.dispatch[t]
else:
if t in ForkingPickler._extra_reducers:
del ForkingPickler._extra_reducers[t]

Henry Mao notifications@github.com 於 2019年11月2日週六上午8:55寫道：

I'm getting the same issue. @lematt1991 https://github.com/lematt1991
Setting workers to 0 will work, but training will be unreasonably slow and
GPU utility is low.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/1171?email_source=notifications&email_token=AEO52MK7K2MUF2YPXY2GYHDQRTFWNA5CNFSM4I2ACDR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC4P2SA#issuecomment-548994376,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEO52MJBXJVFWD62TGSXOTDQRTFWNANCNFSM4I2ACDRQ
.

BoKaiChen on 4 Nov 2019

❤1 🎉1

Try this fix, it works for me.

https://github.com/pytorch/fairseq/pull/1453/commits/ad038bc4e74b989b74249ba4a675e931be5a9a19

billzyx on 30 Oct 2020

FYI, we have a new implementation now for speech-to-text tasks (speech recognition, speech translation, etc.): https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text. Dataloading is optimized there and does not have this issue. And we will merge this ASR example (and VGGTransformer) into that one soon.

kahne on 2 Nov 2020

Hi @kahne, I am not in the speed to text field, but I am curious about how did you optimize the data loading that avoids the above issue? Can you share some points on that? Thanks!