Fairseq: bug :: base_transformer :: fairseq-train fails with 'RuntimeError: received 0 items of ancdata'

Created on 11 Nov 2019  路  4Comments  路  Source: pytorch/fairseq

I have trained the Transformer on Gigaword with following preprocessing:
python3 $FAIRSEQ_PATH/preprocess.py \ --source-lang articles \ --target-lang summaries \ --trainpref $DATA_PATH/train.gigaword \ --validpref $DATA_PATH/valid.gigaword \ --testpref $DATA_PATH/test.gigaword \ --destdir $DEST_DIR \ --workers 70 \ --bpe gpt2 \ --joined-dictionary

Then ran the following train script:
python $FAIRSEQ_PATH/train.py $DATA_PATH --clip-norm 0.1 \ --fp16 --optimizer adam --adam-betas '(0.9, 0.98)' --skip-invalid-size-inputs-valid-test \ --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 \ --min-lr 1e-09 --clip-norm 0.0 --dropout 0.3 --weight-decay 0.0 \ --max-tokens 1500 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-epoch 200 --arch transformer --save-dir $LM_CHECKPOINT_PATH --bpe gpt2\ --source-lang articles --target-lang summaries --num-workers 70 \ --memory-efficient-fp16 \ --save-interval-updates 5000 --keep-interval-updates 10

After that the training started, but it crashes when it comes to computing scores on validation set with the following Traceback:
Traceback (most recent call last):set: 0%| | 2/9370 [00:03<6:43:42, 2.59s/it] File "/home/whiteRa2bit/fairseq//train.py", line 423, in <module> cli_main() File "/home/whiteRa2bit/fairseq//train.py", line 417, in cli_main main(args, config=config) File "/home/whiteRa2bit/fairseq//train.py", line 121, in main train(args, trainer, task, epoch_itr, experiment) File "/home/whiteRa2bit/fairseq//train.py", line 210, in train valid_losses = validate(args, trainer, task, epoch_itr, valid_subsets) File "/home/whiteRa2bit/fairseq//train.py", line 306, in validate for sample in progress: File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/tqdm/std.py", line 1087, in __iter__ for obj in iterable: File "/home/whiteRa2bit/fairseq/fairseq/data/iterators.py", line 36, in __iter__ for x in self.iterable: File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 804, in __next__ idx, data = self._get_data() File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 771, in _get_data success, data = self._try_get_data() File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle return recvfds(s, 1)[0] File "/usr/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds len(ancdata)) RuntimeError: received 0 items of ancdata

Could you please tell me how to fix this?

Most helpful comment

I am getting the same issue.

@whiteRa2bit where did you place?
torch.multiprocessing.set_sharing_strategy('file_system')

I also when ahead and did:

echo "ulimit -n 4096" >> .bashrc
echo "ulimit -n 4096" >> .bash_profile
source ~/.bashrc

as suggested here

All 4 comments

This seems like a widespread issue, and might be related to the number of available file descriptors.

cc: @myleott

This seems like a widespread issue, and might be related to the number of available file descriptors.

cc: @myleott

Thanks a lot!
It helped!

I am getting the same issue.

@whiteRa2bit where did you place?
torch.multiprocessing.set_sharing_strategy('file_system')

I also when ahead and did:

echo "ulimit -n 4096" >> .bashrc
echo "ulimit -n 4096" >> .bash_profile
source ~/.bashrc

as suggested here

In case somebody still has this problem and lands here: I encountered this when submitting training jobs to SLURM. Increasing the file descriptor limit didn't help. I solved the problem by reducing the number of CPUs that I allocate for the job with sbatch.

Was this page helpful?
0 / 5 - 0 ratings