fairseq stuck during training

Created on 6 May 2019  路  10Comments  路  Source: pytorch/fairseq

Since last fairseq versions, during the training of a transformer_vaswani_wmt_en_de_big the process gets stuck, normally after an OOM batch but not necessarily.

It is reproduceable with pytorch 1.0.1, 1.1.0 and nightly as of today, all with either CUDA 9 or CUDA 10, and the latest master of fairseq (39cd4ce2723d550dd54f6b14b0ed2878e10427f8).

This is the command Iine invocation I'm using:

fairseq-train $DATA_DIR \
  --tensorboard-logdir $CHECKPOINTS_DIR/tb \
  -s en -t de \
  --arch transformer_vaswani_wmt_en_de_big \
  --share-all-embeddings \
  --optimizer adam --adam-betas '(0.9, 0.98)' \
  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
  --lr 0.0007 --min-lr 1e-09 \
  --clip-norm 0.0 \
  --update-freq 8 \
  --dropout 0.3 --weight-decay 0.0 \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --max-tokens 3000 \
  --save-dir $CHECKPOINTS_DIR

The problem happens with multiple GPUs (I reproduced it with 4 GPUs and with 2 GPUs).

Python version is 3.6. OS is Ubuntu 16.04.2 on one machine and 18.04 in the other one. GPUs are 1080Ti's.

After getting stuck for an while with no new log lines, I CTRL+C it, getting this stack trace:

WARNING: ran out of memory with exception: CUDA out o
f memory. Tried to allocate 354.00 MiB (GPU 0; 10.91 GiB total capacity; 9.27 GiB already allocated; 207.38 MiB free; 913.54 MiB cached);
 Skipping batch                                                                                                                          

^CTraceback (most recent call last):                                                                                                     
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/bin/fairseq-train", line 11, in <module>                                           
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()                                                                    
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 419, in cli_main                             
    nprocs=args.distributed_world_size,                                                                                                  
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn    
    while not spawn_context.join():                                                                                                      
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 73, in join      
    timeout=timeout,                                                                                                                     
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/connection.py", line 911, in wait                    
    ready = selector.select(timeout)                                                          
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/selectors.py", line 376, in select                                   
    fd_event_list = self._poll.poll(timeout)                                                                                             
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

After CTRL+C, I systematically need to manually kill the children processes, which are still occupying GPU memory.

When I run with --ddp-backend no_c10d, the process does not get stuck but crashes with the following stack trace:

WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 332.00 MiB (GPU 0; 10.91 GiB total capacity; 9.33 GiB already allocated; 299.38 MiB free; 756.70 MiB cached);
 Skipping batch                           
                                                                                                                                                                           /mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))                             
/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))                             
/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))                             
Traceback (most recent call last):        
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/bin/fairseq-train", line 11, in <module>                                                                             
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()                
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 439, in cli_main                                                               
    nprocs=args.distributed_world_size,   
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn                                      
    while not spawn_context.join():       
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join                                       
    raise Exception(msg)                  
Exception:                                

-- Process 0 terminated with the following error:                                    
Traceback (most recent call last):        
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/distributed_utils.py", line 169, in all_gather_list                                                
    result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))            
_pickle.UnpicklingError: pickle data was truncated                                   

During handling of the above exception, another exception occurred:                  

Traceback (most recent call last):        
  File "/mnt/md0/home/noe/miniconda3/envs/nlp_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap                                       
    fn(i, *args)                          
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 406, in distributed_main                                                       
    main(args, init_distributed=True)     
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 100, in main                                                                   
    train(args, trainer, task, epoch_itr) 
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq_cli/train.py", line 159, in train                                                                  
    log_output = trainer.train_step(samples)                                         
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/trainer.py", line 253, in train_step                                                               
    [logging_outputs, sample_sizes, ooms, self._prev_grad_norm],                     
  File "/mnt/md0/home/noe/devel/nmt-word-subword/3party/fairseq/fairseq/distributed_utils.py", line 173, in all_gather_list                                                
    'Unable to unpickle data from other workers. all_gather_list requires all '      
Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data.

The last message is clear:

            'Unable to unpickle data from other workers. all_gather_list requires all '
            'workers to enter the function together, so this error usually indicates '
            'that the workers have fallen out of sync somehow. Workers can fall out of '
            'sync if one of them runs out of memory, or if there are other conditions '
            'in your training script that can cause one worker to finish an epoch '
            'while other workers are still iterating over their portions of the data.'

So, if a batch causes OOM then the distributed training is doomed? This wasn't happening a few weeks ago.

Most helpful comment

We try to catch OOM by skipping the batch, but sometimes it doesn't work (often in the multi GPU case). Usually this causes it to become stuck when the workers are not in sync.

All 10 comments

I have a similar problem to yours, however when I ctrl+c I get a different error:

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 219, in <module> main() File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 215, in main process.wait() File "/usr/lib/python3.6/subprocess.py", line 1477, in wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.6/subprocess.py", line 1424, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags)

@noe I have also encountered the problems you described above . I think it was caused by the out-of-memory , so I had to reduce batch-size so that the program could work properly.
How can such problem be avoided ?

I also reduce the batch size until I get absolutely no OOM error, so that I can avoid training to hang/crash. Nevertheless, not all OOM seem to be _fatal_.

We try to catch OOM by skipping the batch, but sometimes it doesn't work (often in the multi GPU case). Usually this causes it to become stuck when the workers are not in sync.

Yes @huihuifan , in trainer.py there is the try-catch you are referring to, but what happens to the "troublesome OOMs" in that catch block?

what happens to the "troublesome OOMs" in that catch block?

If you're using --ddp-backend=c10d then troublesome OOMs can cause hangs. This is because the c10d DistributedDataParallel module communicates gradients during the backward pass, so we can't really recover from an OOM during the backward pass. The no_c10d backend is more robust since it only communicates at the end of the backward pass, but there are still limits to this kind of recovery.

The solution is usually to reduce batch size (and possibly compensate for this with --update-freq).

If you're using --ddp-backend=c10d then troublesome OOMs can cause hangs... The no_c10d backend is more robust since it only communicates at the end of the backward pass, but there are still limits to this kind of recovery.

I'm experiencing a similar issue to this bug. If I change to --ddp-backend=no_c10d, should I expect the same results? (AKA, are models trained with and without c10d equivalent?)

Yes, no_c10d is equivalent, just a slightly more robust DDP backend (and a small amount slower).

Ok - do you also recommend no_c10d on a single GPU? I'm going to run one GPU with --update-freq 4 -- am trying to avoid the frequent freezes I saw on 2 GPUs.

It's just for distributed training, so it's irrelevant on a single GPU :)

Was this page helpful?
0 / 5 - 0 ratings