Speed up beam search ~2x by removing unnecessary reorder and merging small ops
GPU utility is only ~40% during BART model inference. By profile, I see 2 issues in incremental generation.
I created PR #1852 with below changes.
Inference speed (sample/s) on CNN-DM dataset using V100
| | Before change | After change | Speed up |
|------------------------|-----------------------------------|----------------------------------|----------|
| no_repeat_ngram_size=3 | 3.6 | 6.8 | 1.9X |
| no_repeat_ngram_size=0 | 5.3 | 8.3 | 1.6X |
(beam=4, lenpen=2.0, max_len_b=140, min_len=55)
Profile data to compare before and after change.

To benchmark the speed, run "CUDA_VISIBLE_DEVICES=0 python generation_speed_test.py".
benchmark code modify from here
cnndm_128.txt
generation_speed_test.py.txt
Very nice!
Most helpful comment
Very nice!