Transformers: [s2s trainer] tests fail on multi-gpu machine

Created on 15 Oct 2020  路  9Comments  路  Source: huggingface/transformers

Command

RUN_SLOW=1 USE_CUDA=1 pytest examples/seq2seq/test_finetune_trainer.py

Traceback

=========================================================== test session starts ===========================================================
platform linux -- Python 3.7.4, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /home/shleifer/transformers_fork, inifile: pytest.ini
plugins: forked-1.1.3, hydra-core-1.0.0, xdist-1.31.0, requests-mock-1.8.0
collected 2 items

examples/seq2seq/test_finetune_trainer.py /home/shleifer/transformers_fork/src/transformers/training_args.py:339: FutureWarning: The `evaluate_during_training` argument is deprecated in favor of `evaluation_strategy` (which has more options)
  FutureWarning,
F/home/shleifer/transformers_fork/src/transformers/training_args.py:339: FutureWarning: The `evaluate_during_training` argument is deprecated in favor of `evaluation_strategy` (which has more options)
  FutureWarning,
F

================================================================ FAILURES =================================================================
__________________________________________________________ test_finetune_trainer __________________________________________________________

    def test_finetune_trainer():
>       output_dir = run_trainer(1, "12", MBART_TINY, 1)

examples/seq2seq/test_finetune_trainer.py:19:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
examples/seq2seq/test_finetune_trainer.py:105: in run_trainer
    main()
examples/seq2seq/finetune_trainer.py:294: in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
src/transformers/trainer.py:583: in train
    train_dataloader = self.get_train_dataloader()
src/transformers/trainer.py:386: in get_train_dataloader
    train_sampler = self._get_train_sampler()
examples/seq2seq/seq2seq_trainer.py:108: in _get_train_sampler
    self.args.per_device_train_batch_size, distributed=self.args.n_gpu > 1
examples/seq2seq/utils.py:156: in make_sortish_sampler
    return DistributedSortishSampler(self, batch_size, shuffle=shuffle, **kwargs)
examples/seq2seq/utils.py:368: in __init__
    num_replicas = dist.get_world_size()
../miniconda3/envs/nb/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:582: in get_world_size
    return _get_group_size(group)
../miniconda3/envs/nb/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:196: in _get_group_size
    _check_default_pg()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def _check_default_pg():
        """
        Helper that checks if the default ProcessGroup has been initialized, with
        assertion

        """
        assert _default_pg is not None, \
>           "Default process group is not initialized"
E       AssertionError: Default process group is not initialized

../miniconda3/envs/nb/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:187: AssertionError
_______________________________________________________ test_finetune_trainer_slow ________________________________________________________

    @slow
    def test_finetune_trainer_slow():
        # TODO(SS): This will fail on devices with more than 1 GPU.
        # There is a missing call to __init__process_group somewhere
>       output_dir = run_trainer(eval_steps=2, max_len="128", model_name=MARIAN_MODEL, num_train_epochs=3)

examples/seq2seq/test_finetune_trainer.py:30:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
examples/seq2seq/test_finetune_trainer.py:105: in run_trainer
    main()
examples/seq2seq/finetune_trainer.py:294: in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
src/transformers/trainer.py:583: in train
    train_dataloader = self.get_train_dataloader()
src/transformers/trainer.py:386: in get_train_dataloader
    train_sampler = self._get_train_sampler()
examples/seq2seq/seq2seq_trainer.py:108: in _get_train_sampler
    self.args.per_device_train_batch_size, distributed=self.args.n_gpu > 1
examples/seq2seq/utils.py:156: in make_sortish_sampler
    return DistributedSortishSampler(self, batch_size, shuffle=shuffle, **kwargs)
examples/seq2seq/utils.py:368: in __init__
    num_replicas = dist.get_world_size()
../miniconda3/envs/nb/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:582: in get_world_size
    return _get_group_size(group)
../miniconda3/envs/nb/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:196: in _get_group_size
    _check_default_pg()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def _check_default_pg():
        """
        Helper that checks if the default ProcessGroup has been initialized, with
        assertion

        """
        assert _default_pg is not None, \
>           "Default process group is not initialized"
E       AssertionError: Default process group is not initialized

../miniconda3/envs/nb/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:187: AssertionError
========================================================= short test summary info =========================================================
FAILED examples/seq2seq/test_finetune_trainer.py::test_finetune_trainer - AssertionError: Default process group is not initialized
FAILED examples/seq2seq/test_finetune_trainer.py::test_finetune_trainer_slow - AssertionError: Default process group is not initialized
=========================================================== 2 failed in 11.51s ============================================================

Most helpful comment

Yes, I will work on it today, Sam.

All 9 comments

@stas00 would you be interested in taking a look at this, possibly reusing the fix in https://github.com/huggingface/transformers/pull/7281 ?
If that doesn't work we can hack it like tests/test_trainer.py:

https://github.com/huggingface/transformers/blob/a1d1b332d07a40177ae1959609ab70dab34018b8/tests/test_trainer.py#L245

cc @patil-suraj

Yes, I will work on it today, Sam.

the other temp fix option is to use @require_non_multigpu

This is not the test's issue, but the script's one - this fails with the same error.

python examples/seq2seq/finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --data_dir examples/seq2seq/test_data/wmt_en_ro --output_dir /tmp/test_outputsarhj9od --overwrite_output_dir --n_train 8 --n_val 8 --max_source_length 12 --max_target_length 12 --val_max_target_length 12 --do_train --do_eval --do_predict --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --learning_rate 3e-4 --warmup_steps 8 --evaluate_during_training --predict_with_generate --logging_steps 0 --save_steps 1 --eval_steps 1 --sortish_sampler --label_smoothing 0.1 --adafactor --task translation --tgt_lang ro_RO --src_lang en_XX

I just dumped the args the test was invoking.

AssertionError: Default process group is not initialized means that the distributed setup is not done.

I will look more into it tomorrow morning.

On the other hand - if we sort it out - perhaps we could do the same for distributed eval!? It'd be much much better to delegate to PL all that forking, etc.

If that doesn't work we can hack it like tests/test_trainer.py: line 245

Can you please clarify how do you think it could help? that line of code you quoted does nothing - it's just used for testing and it'll result in n_gpu=2 anyway. Perhaps you meant somewhere else in that file?

You need to launch with

python -m torch.distributed.launch --nproc_per_node=2  finetune_trainer.py

caught me up as well.

In which case, yes, this would be 100% the same as https://github.com/huggingface/transformers/pull/7281 - let's finish it first, then refactor all that new code and use it here.

until then you can use @require_non_multigpu so that it doesn't interfere.

I thought PL had a way of handling distributed internally w/o the user needing to call -m torch.distributed.launch - is it not working or I misread it?

These tests don't use PL.

Was this page helpful?
0 / 5 - 0 ratings