Fairseq: Cannot generate translation using m2m_100

Created on 26 Oct 2020 · 6Comments · Source: pytorch/fairseq

Greetings,

I was trying to deploy and use the m2m_100 model. I tried to do so on a p3.8xlarge instance since the model loading will eat all the memory on the p3.2xlarge.

With CUDA version 11.0 using deep learning ubuntu AMI Deep Learning Base AMI (Ubuntu 18.04) Version 30.0, and using the latest pip version of torch, and installed fairseq and fairscale using git. When trying to reproduce the generation results, I keep getting CUDA Out-of-Memory errors. Even when I tried adding --cpu it kept trying to use the GPUs and kept exiting because of CUDA memory issues.

Please advise,
Thanks,

needs triage question

Source

AbdallahNasir

Most helpful comment

@AbdallahNasir Please refer to the latest version of the README for checkpoints and pipeline arguments to use with different hardware configurations.

About fine-tuning, that's a great point. We don't have support yet for loading a pretrained model and fine-tuning it, but we will try to add this soon.

shruti-bh on 27 Oct 2020

👍2

All 6 comments

At present, the model checkpoint works on 2 V100 GPUs with 32GB memory each. I will be soon release checkpoints that work with V100 GPUs containing 16GB memory each.

shruti-bh on 26 Oct 2020

@shruti-bh Thanks for your reply.
May you please include specific instructions on how to fine-tune the model on other sources of parallel data, like domain-specific data?
Thank you so much for your efforts :)
Best wishes,

AbdallahNasir on 27 Oct 2020

@AbdallahNasir Please refer to the latest version of the README for checkpoints and pipeline arguments to use with different hardware configurations.

About fine-tuning, that's a great point. We don't have support yet for loading a pretrained model and fine-tuning it, but we will try to add this soon.

shruti-bh on 27 Oct 2020

👍2

Hello,

I was also trying to deploy and use the m2m_100 model. I also tried to do it on a p3.8xlarge instance with Deep Learning AMI (Ubuntu 18.04) Version 36.0. Pytorch version 1.7.0. I also installed fairseq and fairscale using git. I'm running it in a virtual environment.

But when I run the following command:

fairseq-generate \
    data_bin \
    --batch-size 1 \
    --path 12b_last_chk_4_gpus.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s de -t fr \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[1,15,10]' \
    --pipeline-encoder-devices '[0,1,0]' \
    --pipeline-decoder-balance '[3,11,11,1]' \
    --pipeline-decoder-devices '[0,2,3,0]' > gen_out

I get the following error:

/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/hydra/_internal/hydra.py:71: UserWarning: 
@hydra.main(strict) flag is deprecated and will removed in the next version.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/strict_mode_flag_deprecated
  warnings.warn(message=msg, category=UserWarning)
Traceback (most recent call last):
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 513, in _apply_overrides_to_config
    OmegaConf.update(cfg, key, value, merge=True)
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/omegaconf/omegaconf.py", line 613, in update
    root.__setattr__(last_key, value)
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 278, in __setattr__
    self._format_and_raise(key=key, value=value, cause=e)
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/omegaconf/base.py", line 101, in _format_and_raise
    type_override=type_override,
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/omegaconf/_utils.py", line 694, in format_and_raise
    _raise(ex, cause)
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/omegaconf/_utils.py", line 610, in _raise
    raise ex  # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ValidationError: Invalid value 'simple', expected one of [c10d, no_c10d]
    full_key: distributed_training.ddp_backend
    reference_type=DistributedTrainingConfig
    object_type=DistributedTrainingConfig

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/bin/fairseq-generate", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
  File "/home/ubuntu/fb_mt/test_env_1/fairseq/fairseq_cli/generate.py", line 389, in cli_main
    main(args)
  File "/home/ubuntu/fb_mt/test_env_1/fairseq/fairseq_cli/generate.py", line 50, in main
    return _main(cfg, sys.stdout)
  File "/home/ubuntu/fb_mt/test_env_1/fairseq/fairseq_cli/generate.py", line 103, in _main
    num_shards=cfg.checkpoint.checkpoint_shard_count,
  File "/home/ubuntu/fb_mt/test_env_1/fairseq/fairseq/checkpoint_utils.py", line 264, in load_model_ensemble
    num_shards,
  File "/home/ubuntu/fb_mt/test_env_1/fairseq/fairseq/checkpoint_utils.py", line 288, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/ubuntu/fb_mt/test_env_1/fairseq/fairseq/checkpoint_utils.py", line 240, in load_checkpoint_to_cpu
    state = _upgrade_state_dict(state)
  File "/home/ubuntu/fb_mt/test_env_1/fairseq/fairseq/checkpoint_utils.py", line 458, in _upgrade_state_dict
    state["cfg"] = convert_namespace_to_omegaconf(state["args"])
  File "/home/ubuntu/fb_mt/test_env_1/fairseq/fairseq/dataclass/utils.py", line 298, in convert_namespace_to_omegaconf
    composed_cfg = compose("config", overrides=overrides, strict=False)
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/hydra/experimental/compose.py", line 37, in compose
    with_log_configuration=False,
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 512, in compose_config
    from_shell=from_shell,
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 156, in load_configuration
    from_shell=from_shell,
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 277, in _load_configuration
    ConfigLoaderImpl._apply_overrides_to_config(config_overrides, cfg)
  File "/home/ubuntu/fb_mt/test_env_1/test_env_1/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 522, in _apply_overrides_to_config
    ) from ex
hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple'

I see that the problem is coming from "Invalid value 'simple', expected one of [c10d, no_c10d]".
Has anyone encountered similar error? Do you have any suggestion on how to resolve it?