When I train "speech recognition" on multi GPU,
I got this error codes:
| model vggtransformer_2, criterion CrossEntropyWithAccCriterion
| num. model params: 315190057 (num. trained: 315190057)
| training on 2 GPUs
| max tokens per GPU = 5000 and max sentences per GPU = None
| no existing checkpoint found ./checkpoints/checkpoint_last.pt
| loading train data for epoch 0
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/bin/fairseq-train", line 11, in
sys.exit(cli_main())
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq_cli/train.py", line 317, in cli_main
nprocs=args.distributed_world_size,
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq_cli/train.py", line 284, in distributed_main
main(args, init_distributed=True)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq_cli/train.py", line 80, in main
train(args, trainer, task, epoch_itr)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq_cli/train.py", line 120, in train
for i, samples in enumerate(progress, start=epoch_itr.iterations_in_epoch):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tqdm/_tqdm.py", line 955, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq/data/iterators.py", line 286, in __next__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq/data/iterators.py", line 40, in __next__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/fairseq/data/iterators.py", line 35, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 278, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 682, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/process.py", line 105, in start
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_fork.py", line 26, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 323, in reduce_storage
RuntimeError: unable to open shared memory object in read-write mode
I think It may be the dataloader problem, when I set the args "num-workers" to 0, this error will not appear!
But it will be slowly!
Thank you !
Can you include the command you are using to launch the job?
This is my script(./run.sh):

This im my result:
root@todebug-1924478367-rcgnb:/workspace/lustre/fairseq# ./run.sh
| NOTE: you may get better performance with: --ddp-backend=no_c10d
| distributed init (rank 0): tcp://localhost:19270
| distributed init (rank 1): tcp://localhost:19270
| initialized host todebug-1924478367-rcgnb as rank 1
| initialized host todebug-1924478367-rcgnb as rank 0
Namespace(adadelta_eps=1e-08, adadelta_rho=0.95, anneal_eps=False, arch='vggtransformer_2', best_checkpoint_metric='loss',bpe=None, bucket_cap_mb=25, clip_norm=10.0, conv_dec_config='((256, 3, True),) * 4', cpu=False, criterion='cross_entropy_acc', curriculum=0, data='/workspace/lustre/tmp/librispeech_final', dataset_impl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:19270', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, enc_output_dim=1024, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, input_feat_per_channel=80, keep_interval_updates=-1, keep_last_epochs=-1, log_format='json', log_interval=1, lr=[1.0], lr_scheduler='fixed', lr_shrink=0.1, max_epoch=80, max_sentences=None, max_sentences_valid=None, max_tokens=5000, max_tokens_valid=5000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, num_workers=1, optimizer='adadelta', optimizer_overrides='{}', required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./examples/speech_recognition/checkpoints', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, skip_invalid_size_inputs_valid_test=False, task='speech_recognition', tbmf_wrapper=False, tensorboard_logdir='', tgt_embed_dim=512, threshold_loss_scale=None, tokenizer=None, train_subset='train', transformer_dec_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 6', transformer_enc_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 16', update_freq=[8], use_bmuf=False, user_dir='./examples/speech_recognition/', valid_subset='valid', validate_interval=1, vggblock_enc_config='[(64, 3, 2, 2, True), (128, 3, 2, 2, True)]', warmup_updates=0, weight_decay=0.0)
| dictionary: 5001 types
VGGTransformerModel(
...
...
)
| model vggtransformer_2, criterion CrossEntropyWithAccCriterion
| num. model params: 315190057 (num. trained: 315190057)
| training on 2 GPUs
| max tokens per GPU = 5000 and max sentences per GPU = None
| no existing checkpoint found ./examples/speech_recognition/checkpoints/checkpoint_last.pt
| loading train data for epoch 0
| NOTICE: your device may support faster training with --fp16
Traceback (most recent call last):
File "train.py", line 352, in
cli_main()
File "train.py", line 344, in cli_main
nprocs=args.distributed_world_size,
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
File "/workspace/lustre/fairseq/train.py", line 311, in distributed_main
main(args, init_distributed=True)
File "/workspace/lustre/fairseq/train.py", line 107, in main
train(args, trainer, task, epoch_itr)
File "/workspace/lustre/fairseq/train.py", line 147, in train
for i, samples in enumerate(progress, start=epoch_itr.iterations_in_epoch):
File "/workspace/lustre/fairseq/fairseq/progress_bar.py", line 125, in __iter__
File "/workspace/lustre/fairseq/fairseq/data/iterators.py", line 290, in __next__
File "/workspace/lustre/fairseq/fairseq/data/iterators.py", line 41, in __next__
File "/workspace/lustre/fairseq/fairseq/data/iterators.py", line 36, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 278, in __iter__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 682, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/process.py", line 105, in start
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_fork.py", line 26, in __init__
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 323, in reduce_storage
RuntimeError: unable to open shared memory object in read-write mode
root@todebug-1924478367-rcgnb:/workspace/lustre/fairseq#
Can you include the command you are using to launch the job?
When I add these code on top of "fairseq/train.py" file to close the shared memory, this problem can be solved.
but I think it may cause loss of performance.

Other threads ([[1](https://github.com/facebookresearch/maskrcnn-benchmark/issues/103)][[2](https://discuss.pytorch.org/t/runtimeerror-unable-to-open-shared-memory-object/22641)][3]) seem to suggest that this is an out of memory issue. Try setting --num-workers 0. I'm going to close this as it doesn't seem to be an issue with fairseq.
I'm getting the same issue. @lematt1991 Setting workers to 0 will work, but training will be unreasonably slow and GPU utility is low. It shouldn't be an OOM issue because I'm using a Kubernetes pod with 256Gi of RAM. I believe it is likely an issue with fairseq because if I try to load speech datasets with just plain Pytorch I was able to get it working.
Here's a version of that hack @BoKaiChen to fix the issue that can be copied into train.py (but it did not work for me)
import sys
from torch.utils.data import dataloader
from torch.multiprocessing import reductions
from multiprocessing.reduction import ForkingPickler
def_collate_fn = dataloader.default_collate
def def_collate_override(batch):
dataloader._use_shared_memory = False
return def_collate_fn(batch)
setattr(dataloader, 'default_collate', def_collate_override)
for t in torch._storage_classes:
if sys.version_info[0] == 2:
if t in ForkingPickler.dispatch:
del ForkingPickler.dispatch[t]
else:
if t in ForkingPickler._extra_reducers:
del ForkingPickler._extra_reducers[t]
I modify the "train.py" code:
import sys
from torch.utils.data import dataloader
from torch.multiprocessing import reductions
from multiprocessing.reduction import ForkingPickler
default_collate_func = dataloader.default_collate
def default_collate_override(batch, args, *kwargs):
dataloader._use_shared_memory = False
return default_collate_func(batch, args, *kwargs)
setattr(dataloader, 'default_collate', default_collate_override)
for t in torch._storage_classes:
if sys.version_info[0] == 2:
if t in ForkingPickler.dispatch:
del ForkingPickler.dispatch[t]
else:
if t in ForkingPickler._extra_reducers:
del ForkingPickler._extra_reducers[t]
Henry Mao notifications@github.com 於 2019年11月2日 週六 上午8:55寫道:
I'm getting the same issue. @lematt1991 https://github.com/lematt1991
Setting workers to 0 will work, but training will be unreasonably slow and
GPU utility is low.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/1171?email_source=notifications&email_token=AEO52MK7K2MUF2YPXY2GYHDQRTFWNA5CNFSM4I2ACDR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC4P2SA#issuecomment-548994376,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEO52MJBXJVFWD62TGSXOTDQRTFWNANCNFSM4I2ACDRQ
.
Try this fix, it works for me.
FYI, we have a new implementation now for speech-to-text tasks (speech recognition, speech translation, etc.): https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text. Dataloading is optimized there and does not have this issue. And we will merge this ASR example (and VGGTransformer) into that one soon.
Hi @kahne, I am not in the speed to text field, but I am curious about how did you optimize the data loading that avoids the above issue? Can you share some points on that? Thanks!
@hiyyg It reduced the usage of shared memory and allows faster online feature extraction with pyKaldi as well as offline feature extraction.
Most helpful comment
When I add these code on top of "fairseq/train.py" file to close the shared memory, this problem can be solved.

but I think it may cause loss of performance.