Fairseq: "argument --distributed-world-size: conflicting option string: --distributed-world-size" Error

Created on 30 Apr 2020 · 4Comments · Source: pytorch/fairseq

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

After training my model, I would like to evaluate it; however, I run into an argument parse error, as seen below. I am using the command lines from here and have slightly modified them where I am using a patience of 3, no-epoch-checkpoints, removed fp16, and distributed-world-size of 1 when training. I also changed the paths to reflect my own directory structure. These are the only changes I have made from the link, and I am sure that they are properly formatted. Any help is appreciated. :)

Code

Traceback (most recent call last):
File "/home/e/miniconda3/envs/eshaan/bin/fairseq-eval-lm", line 11, in
load_entry_point('fairseq', 'console_scripts', 'fairseq-eval-lm')()
File "/srv/home/e/eshaan/fairseq/fairseq_cli/eval_lm.py", line 251, in cli_main
add_distributed_training_args(parser)
File "/srv/home/e/eshaan/fairseq/fairseq/options.py", line 356, in add_distributed_training_args
help='total number of GPUs across all nodes (default: all visible GPUs)')
File "/home/e/miniconda3/envs/eshaan/lib/python3.6/argparse.py", line 1352, in add_argument
return self._add_action(action)
File "/home/e/miniconda3/envs/eshaan/lib/python3.6/argparse.py", line 1556, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
File "/home/e/miniconda3/envs/eshaan/lib/python3.6/argparse.py", line 1366, in _add_action
self._check_conflict(action)
File "/home/e/miniconda3/envs/eshaan/lib/python3.6/argparse.py", line 1505, in _check_conflict
conflict_handler(action, confl_optionals)
File "/home/e/miniconda3/envs/eshaan/lib/python3.6/argparse.py", line 1514, in _handle_conflict_error
raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --distributed-world-size: conflicting option string: --distributed-world-size

What have you tried?

I have tried retraining my model in case it was an issue with how my checkpoints were stored, despite how the output always said my distributed world size is 1. I have also looked at this similar error to make sure that no other python processes are running.

What's your environment?

fairseq Version (e.g., 1.0 or master): 0.9.0
PyTorch Version (e.g., 1.0): 1.4.0
OS (e.g., Linux): Ubuntu 16.04.6 LTS (Xenial Xerus)
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install -e fairseq/
Python version: 3.6.10
CUDA/cuDNN version: CUDA release 10.1, V10.1.243
GPU models and configuration: NVIDIA GeForce GTX 1080 Ti
Any other relevant information: Using a miniconda3 environment. There are 8 GPUs on the server that I am SSH'd into, but I am only connected to 1.

bug

Source

eshaan-pathak

Most helpful comment

Hi Myle!
I think there might still be an issue here. When I run eval_lm with the argument "--distributed-world-size 1" it fails:

File "eval_lm.py", line 11, in
cli_main()
File "fairseq_cli/eval_lm.py", line 252, in cli_main
distributed_utils.call_main(args, main)
File "fairseq/distributed_utils.py", line 173, in call_main
main(args, kwargs)
TypeError: main() takes 1 positional argument but 2 were given