Fairseq: speed up beam search by ~2x

Created on 17 Mar 2020  路  1Comment  路  Source: pytorch/fairseq

馃殌 Feature Request


Speed up beam search ~2x by removing unnecessary reorder and merging small ops

Motivation


GPU utility is only ~40% during BART model inference. By profile, I see 2 issues in incremental generation.

  1. Half of time is used for transfer small data between GPU and CPU when no_repeat_ngram_size > 0. This pattern may apply to other seq2seq models, because the code cause small data transfer is in beam search part, not in model code.
  2. State reorder use as much time as computation in model forward, and many of these reorder are unnecessary.

Pitch


I created PR #1852 with below changes.

  1. Copy whole tensor from gpu to cpu once, instead of do it in for loop
  2. Ban ngram token in one kernel call, instead of in for loop
  3. Remove unnecessary reorder
    In encoder_decoder_attention, reorder only need when batch size change. Because encoder state
    is shared across beam size.

Additional context


Inference speed (sample/s) on CNN-DM dataset using V100
| | Before change | After change | Speed up |
|------------------------|-----------------------------------|----------------------------------|----------|
| no_repeat_ngram_size=3 | 3.6 | 6.8 | 1.9X |
| no_repeat_ngram_size=0 | 5.3 | 8.3 | 1.6X |

(beam=4, lenpen=2.0, max_len_b=140, min_len=55)

Profile data to compare before and after change.
image

To benchmark the speed, run "CUDA_VISIBLE_DEVICES=0 python generation_speed_test.py".
benchmark code modify from here
cnndm_128.txt
generation_speed_test.py.txt

enhancement help wanted

Most helpful comment

Very nice!

>All comments

Very nice!

Was this page helpful?
0 / 5 - 0 ratings