Fairseq: Process 0 terminated with exit code 17

Created on 25 Aug 2020 · 15Comments · Source: pytorch/fairseq

@myleott Hi, I am training my model on TPU device but am encountering this error:-

2020-08-25 13:00:20 | WARNING | root | TPU has started up successfully with version pytorch-1.6

WARNING:root:TPU has started up successfully with version pytorch-1.6
Exception in device=TPU:0:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
fn(gindex, *args)
File "/content/fairseq/fairseq/distributed_utils.py", line 150, in distributed_main
args.distributed_rank = distributed_init(args)
File "/content/fairseq/fairseq/distributed_utils.py", line 112, in distributed_init
assert xm.xrt_world_size() == args.distributed_world_size
AssertionError
Traceback (most recent call last):
File "/usr/local/bin/fairseq-train", line 33, in
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/content/fairseq/fairseq_cli/train.py", line 333, in cli_main
distributed_utils.call_main(args, main)
File "/content/fairseq/fairseq/distributed_utils.py", line 185, in call_main
nprocs=8, # use all 8 TPU cores
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 296, in spawn
start_method=start_method)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 113, in join
(error_index, exitcode)
Exception: process 0 terminated with exit code 17

I am trying to do a seq2seq problem and am using preprocessed data using the commands provided in FAIRseq docs. the data was preprocessed successfully, I am presumably running my code on a TPUv2 and have PyTorch along with XLA installed. In both the Tracebacks, the error does not clearly specify what exactly went wrong. This is the training code:-

%%bash
fairseq-train /content/drive/'My Drive'/HashPro/preprocessed/ \
--lr 0.02 --clip-norm 0.1 --optimizer sgd --bpe characters --dropout 0.2 --tpu --bf16 \
--arch bart_large --save-dir /content/drive/'My Drive'/HashPro/Checkpoints/

Any ideas on what the problem might be?

question

Source

neel04

All 15 comments

@myleott Any help would be highly appreciated!

neel04 on 1 Sep 2020

@myleott Apparently, #2503 has the same error which also hasn't been resolved. The problem may be in xla and not causing it to work on TPU's (since it gives an AssertionError for me) but please do understand that these large models take a lot of processing power requiring multiple GPU's or using a few TPU's. GPU costs after 4 start to become prohibitive so TPU support is highly critical for users to run their code.

I am attaching my TraceBack if it may help somehow:-

2020-09-03 14:49:24 | WARNING | root | Waiting for TPU to be start up with version pytorch-1.6...
2020-09-03 14:49:34 | WARNING | root | Waiting for TPU to be start up with version pytorch-1.6...
2020-09-03 14:49:44 | WARNING | root | Waiting for TPU to be start up with version pytorch-1.6...
2020-09-03 14:49:54 | WARNING | root | TPU has started up successfully with version pytorch-1.6

WARNING:root:TPU has started up successfully with version pytorch-1.6
Exception in device=TPU:0: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/content/fairseq/fairseq/distributed_utils.py", line 148, in distributed_main
    args.distributed_rank = distributed_init(args)
  File "/content/fairseq/fairseq/distributed_utils.py", line 112, in distributed_init
    assert xm.xrt_world_size() == args.distributed_world_size
AssertionError
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/fairseq/fairseq_cli/train.py", line 343, in cli_main
    distributed_utils.call_main(args, main)
  File "/content/fairseq/fairseq/distributed_utils.py", line 183, in call_main
    nprocs=8,  # use all 8 TPU cores
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 296, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 113, in join
    (error_index, exitcode)
Exception: process 0 terminated with exit code 17

EDIT:- I tried the same code by removing the --tpu flag and --bf16. Since I don't understand any of the intricacies behind FairSeq, so not sure whether it would be helpful but still here is the subsequent error:-

2020-09-03 14:58:20 | INFO | fairseq_cli.train | Namespace(activation_fn='gelu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='bart_large', attention_dropout=0.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.1, cpu=False, criterion='cross_entropy', cross_self_attention=False, curriculum=0, data='/content/drive/My Drive/HashPro/New/', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layerdrop=0, decoder_layers=12, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.2, empty_cache_freq=0, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layerdrop=0, encoder_layers=12, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=True, left_pad_source='True', left_pad_target='False', load_alignments=False, localsgd_frequency=3, log_format=None, log_interval=100, lr=[0.02], lr_scheduler='fixed', lr_shrink=0.1, max_epoch=0, max_sentences=8, max_sentences_valid=8, max_source_positions=1024, max_target_positions=1024, max_tokens=None, max_tokens_valid=None, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, model_parallel_size=1, momentum=0.0, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=True, no_seed_provided=True, no_token_positional_embeddings=False, nprocs_per_node=1, num_batch_buckets=0, num_workers=1, optimizer='sgd', optimizer_overrides='{}', patience=-1, pooler_activation_fn='tanh', pooler_dropout=0.0, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='/content/drive/My Drive/HashPro/Checkpoints/', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang=None, stop_time_hours=0, target_lang=None, task='translation', tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=0, weight_decay=0.0, zero_sharding='none')
2020-09-03 14:58:20 | INFO | fairseq.tasks.translation | [input] dictionary: 21936 types
2020-09-03 14:58:20 | INFO | fairseq.tasks.translation | [output] dictionary: 9216 types
2020-09-03 14:58:20 | INFO | fairseq.data.data_utils | loaded 1 examples from: /content/drive/My Drive/HashPro/New/valid.input-output.input
2020-09-03 14:58:20 | INFO | fairseq.data.data_utils | loaded 1 examples from: /content/drive/My Drive/HashPro/New/valid.input-output.output
2020-09-03 14:58:20 | INFO | fairseq.tasks.translation | /content/drive/My Drive/HashPro/New/ valid input-output 1 examples

Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/fairseq/fairseq_cli/train.py", line 343, in cli_main
    distributed_utils.call_main(args, main)
  File "/content/fairseq/fairseq/distributed_utils.py", line 187, in call_main
    main(args, **kwargs)
  File "/content/fairseq/fairseq_cli/train.py", line 68, in main
    model = task.build_model(args)
  File "/content/fairseq/fairseq/tasks/translation.py", line 279, in build_model
    model = super().build_model(args)
  File "/content/fairseq/fairseq/tasks/fairseq_task.py", line 248, in build_model
    model = models.build_model(args, self)
  File "/content/fairseq/fairseq/models/__init__.py", line 48, in build_model
    return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
  File "/content/fairseq/fairseq/models/transformer.py", line 198, in build_model
    raise ValueError("--share-all-embeddings requires a joined dictionary")
ValueError: --share-all-embeddings requires a joined dictionary

neel04 on 3 Sep 2020

@alexeib @myleott @lematt1991 Any update on TPU support? I still have the same error!

neel04 on 9 Sep 2020

You need to explicitly pass --distributed-world-size and it should work (this should equal the number of TPU devices).

myleott on 14 Sep 2020

@myleott No, same error as before:-

2020-09-14 16:00:27 | WARNING | root | TPU has started up successfully with version pytorch-1.6
WARNING:root:TPU has started up successfully with version pytorch-1.6
Exception in device=TPU:0: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 235, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 229, in _start_fn
    fn(gindex, *args)
  File "/content/fairseq/fairseq/distributed_utils.py", line 218, in distributed_main
    args.distributed_rank = distributed_init(args)
  File "/content/fairseq/fairseq/distributed_utils.py", line 182, in distributed_init
    assert xm.xrt_world_size() == args.distributed_world_size
AssertionError
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/fairseq/fairseq_cli/train.py", line 350, in cli_main
    distributed_utils.call_main(args, main)
  File "/content/fairseq/fairseq/distributed_utils.py", line 250, in call_main
    nprocs=8,  # use all 8 TPU cores
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 300, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 113, in join
    (error_index, exitcode)
Exception: process 0 terminated with exit code 17

neel04 on 14 Sep 2020

@myleott I had the exact same problem running bart model on TPU v3-8, and resolve it by setting --distributed-world-size=8 for TPU v3-8. However, I had another question on dynamic shape in the new issue .

JunjieHu on 16 Sep 2020

@JunjieHu Can you tell me the number of workers for your TPU? I am using Colab TPU which I think has 8 workers, but it doesn't work on that at all.

neel04 on 2 Oct 2020

@myleott I set --distributed-world-size=1 to get the code to work and the model to train, but the training is extremely slow. I have 2.5Million iterations to do in 1 epoch and each iteration takes like half a minute. Clearly, my TPU is not being used fully. Can you suggest any fix for it?

neel04 on 2 Oct 2020

@neel04, I use the TPU v3-8 which I think has 8 workers. I finally end up modifying a lots in Fairseq by referring to the code under this folder /usr/share/torch-xla-1.6/tpu-examples/deps/fairseq/ in a GCP instance, based on the GCP's tutorial .

JunjieHu on 2 Oct 2020

👍1

@myleott I fixed some code that doesn't install Nvidia's Apex properly, and now it all works, I am able to pass --distributed-world-size=8 but the speed doesn't show any improvement at all (The epoch is estimated to be completed in about ~70-ish hrs). Would you have any idea why it happens? I am using a TPU v2-8

neel04 on 3 Oct 2020

@neel04 did installing Nvidia/Apex properly solve this issue for you? I seem to have the same issue. Works with distributed-world-size=1but not with =8

dhruvrnaik on 7 Oct 2020

@dhruvrnaik I think that the TPU code isn't really ready/working. Better ping the maintainers for that. It was working for me before but I updated my repo and now I can't use --distributed-world-size=8, only =1. But note that the flags for TPU like --bf16 etc. do not seem to be available, so I think they want to close TPU feature entirely...

neel04 on 7 Oct 2020

👍1

Yeah, my code seems to work without the --bf16 flag

dhruvrnaik on 12 Oct 2020

Of course, since it won't use the TPU then to its full limit; it would likely be worse than your CPU. Better to use a GPU

neel04 on 12 Oct 2020

I am getting the same error for this

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def init_process(rank, size, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)

def train(rank, num_epochs, world_size):
    init_process(rank, world_size)
    print(
        f"Rank {rank + 1}/{world_size} process initialized.\n"
    )
    # rest of the training script goes here!

WORLD_SIZE = torch.cuda.device_count()
NUM_EPOCHS = 8

mp.spawn(
        train, args=(NUM_EPOCHS, WORLD_SIZE),
        nprocs=1, 
        join=True)

This is the error

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/spawn.py", line 114, in _main
    prepare(preparation_data)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/CS/pvp0001/sample.py", line 24, in <module>
    join=True)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
    process.start()
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/multiprocessing/spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "sample.py", line 24, in <module>
    join=True)
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/CS/pvp0001/.conda/envs/praveen_tf/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 113, in join
    (error_index, exitcode)
Exception: process 0 terminated with exit code 1