I have trained the Transformer on Gigaword with following preprocessing:
python3 $FAIRSEQ_PATH/preprocess.py \
--source-lang articles \
--target-lang summaries \
--trainpref $DATA_PATH/train.gigaword \
--validpref $DATA_PATH/valid.gigaword \
--testpref $DATA_PATH/test.gigaword \
--destdir $DEST_DIR \
--workers 70 \
--bpe gpt2 \
--joined-dictionary
Then ran the following train script:
python $FAIRSEQ_PATH/train.py $DATA_PATH --clip-norm 0.1 \
--fp16 --optimizer adam --adam-betas '(0.9, 0.98)' --skip-invalid-size-inputs-valid-test \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 \
--min-lr 1e-09 --clip-norm 0.0 --dropout 0.3 --weight-decay 0.0 \
--max-tokens 1500 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-epoch 200 --arch transformer --save-dir $LM_CHECKPOINT_PATH --bpe gpt2\
--source-lang articles --target-lang summaries --num-workers 70 \
--memory-efficient-fp16 \
--save-interval-updates 5000 --keep-interval-updates 10
After that the training started, but it crashes when it comes to computing scores on validation set with the following Traceback:
Traceback (most recent call last):set: 0%| | 2/9370 [00:03<6:43:42, 2.59s/it]
File "/home/whiteRa2bit/fairseq//train.py", line 423, in <module>
cli_main()
File "/home/whiteRa2bit/fairseq//train.py", line 417, in cli_main
main(args, config=config)
File "/home/whiteRa2bit/fairseq//train.py", line 121, in main
train(args, trainer, task, epoch_itr, experiment)
File "/home/whiteRa2bit/fairseq//train.py", line 210, in train
valid_losses = validate(args, trainer, task, epoch_itr, valid_subsets)
File "/home/whiteRa2bit/fairseq//train.py", line 306, in validate
for sample in progress:
File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/tqdm/std.py", line 1087, in __iter__
for obj in iterable:
File "/home/whiteRa2bit/fairseq/fairseq/data/iterators.py", line 36, in __iter__
for x in self.iterable:
File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 804, in __next__
idx, data = self._get_data()
File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 771, in _get_data
success, data = self._try_get_data()
File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/whiteRa2bit/venv/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
Could you please tell me how to fix this?
This seems like a widespread issue, and might be related to the number of available file descriptors.
cc: @myleott
This seems like a widespread issue, and might be related to the number of available file descriptors.
cc: @myleott
Thanks a lot!
It helped!
I am getting the same issue.
@whiteRa2bit where did you place?
torch.multiprocessing.set_sharing_strategy('file_system')
I also when ahead and did:
echo "ulimit -n 4096" >> .bashrc
echo "ulimit -n 4096" >> .bash_profile
source ~/.bashrc
as suggested here
In case somebody still has this problem and lands here: I encountered this when submitting training jobs to SLURM. Increasing the file descriptor limit didn't help. I solved the problem by reducing the number of CPUs that I allocate for the job with sbatch.
Most helpful comment
I am getting the same issue.
@whiteRa2bit where did you place?
torch.multiprocessing.set_sharing_strategy('file_system')I also when ahead and did:
as suggested here