Transformers: Unexpectedly preprocess when multi-gpu using

Created on 1 Aug 2019 · 8Comments · Source: huggingface/transformers

When run example/run_squad with more than one gpu, the preprocessor cannot work as expected. For example, the unique_id will not be a serial numbers, then keyerror occurs when writing the result to json file.

https://github.com/huggingface/pytorch-transformers/blob/f2a3eb987e1fc2c85320fc3849c67811f5736b50/examples/utils_squad.py#L511

I do not check there are others unexpectedly behavior or not yet.
I'll update the issue after checking them.

wontfix

Source

Liangtaiwan

Most helpful comment

The problem is that evaluate() distributes the evaluation under DDP:
https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/run_squad.py#L216
Meaning each process collects a subset of all_results
but then write_predictions() expects all_results to have all the results 😮

Specifically, unique_id_to_result only maps a subset of ids
https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/utils_squad.py#L489-L491
but the code expects an entry for every feature
https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/utils_squad.py#L510-L511

For DDP evaluate to work all_results needs to be collected from all the threads. Otherwise don't allow args.do_eval and args.local_rank != -1 at the same time.
edit: or get rid of the DistributedSampler and use SequentialSampler in all cases.

immawatson on 30 Oct 2019

👍3

All 8 comments

Why do you think the unique_id is not serial?
Each process should convert ALL the dataset.
Only the PyTorch dataset should be split among processes.

By the way it would be cleaner if the other processes wait for the first process to pre-process the dataset before using the cache so the dataset is only converted once and not several time in parrallel (waste of compute). I'll add this option.

thomwolf on 5 Aug 2019

@thomwolf Hi, thanks for your reply

For example, the unique_id should [1000000, 1000001, 1000002, ...]
However, with multi-process I got [1000000, 100001, 1000004, ....]

I did not check what cause the error.
As a result, when predict the answer with multiple gpu, the key error happened.

Liangtaiwan on 5 Aug 2019

Had the same problema: a KeyError 1000000 after doing a distributed training. Does anyone know how to fix it?

ayrtondenner on 23 Sep 2019

@ayrtondenner Have the same problem as you when distributed training, after evaluation completes and in writing predictions:

_" File "/media/dn/dssd/nlp/transformers/examples/utils_squad.py", line 511, in write_predictions
result = unique_id_to_result[feature.unique_id]
KeyError: 1000000000"_

Setup: transformers 2.0.0; pytorch 1.2.0; python 3.7.4; NVIDIA 1080Ti x 2
_" python -m torch.distributed.launch --nproc_per_node=2 ./run_squad.py \ "_

Data parallel with the otherwise same shell script works fine producing the results below, but of course takes longer with more limited GPU memory for batch sizes.
Results:
{
"exact": 81.06906338694418,
"f1": 88.57343698391432,
"total": 10570,
"HasAns_exact": 81.06906338694418,
"HasAns_f1": 88.57343698391432,
"HasAns_total": 10570
}

Data parallel shell script:

python ./run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file=${SQUAD_DIR}/train-v1.1.json \
--predict_file=${SQUAD_DIR}/dev-v1.1.json \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--gradient_accumulation_steps=1 \
--learning_rate=3e-5 \
--num_train_epochs=2 \
--max_seq_length=384 \
--doc_stride=128 \
--adam_epsilon=1e-6 \
--save_steps=2000 \
--output_dir=./runs/bert_base_squad1_ft_2

ahotrod on 4 Oct 2019

immawatson on 30 Oct 2019

👍3

Make sense, do you have a fix in mind @immawatson? Happy to welcome a PR that would fix that.

thomwolf on 5 Nov 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 4 Jan 2020

The problem is that evaluate() distributes the evaluation under DDP:
https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/run_squad.py#L216

Meaning each process collects a subset of all_results
but then write_predictions() expects all_results to have _all the results_ 😮
Specifically, unique_id_to_result only maps a subset of ids
https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/utils_squad.py#L489-L491

but the code expects an entry for every feature
https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/utils_squad.py#L510-L511

For DDP evaluate to work all_results needs to be collected from all the threads. Otherwise don't allow args.do_eval and args.local_rank != -1 at the same time.
edit: or get rid of the DistributedSampler and use SequentialSampler in all cases.