Tensor2tensor: Decoding speed per sentence

Created on 29 Jun 2017  路  17Comments  路  Source: tensorflow/tensor2tensor

Hi,

I have trained a transformer_big model for the wmt_ende_tokens_32k problem.
After 37118 steps, I found that it gives a decent result:

INFO:tensorflow:Saving dict for global step 37118:
    global_step = 37118,
    loss = 0.980365,
    metrics-wmt_ende_tokens_32k/accuracy = 0.789868,
    metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.0,
    metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.90224,
    metrics-wmt_ende_tokens_32k/bleu_score = 0.493593,
    metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.11336,
    metrics/accuracy = 0.789868,
    metrics/accuracy_per_sequence = 0.0,
    metrics/accuracy_top5 = 0.90224,
    metrics/bleu_score = 0.493593,
    metrics/neg_log_perplexity = -1.11336

I then tried to translate a newstest2014-deen-src.en file which consists of 10008 lines.
I followed the default HPARAMS setting for transformer_big, and set BEAM_SIZE=3 and ALPHA=0.6.

However, as the decoding process seemed to be taking forever, I re-tried the same process with a smaller file that consisted of just 10 lines. This time, the decoding took approx. 30 seconds after the loading of the learned model parameters.
Taking one second to decode a source sentence seems to be too long as this would suggest that translating a newstest2014-deen-src.en file would take a couple of hours.

Am I missing some options here?

Most helpful comment

Fast decoding as well as avoiding model reloads have both been released.

All 17 comments

Hi @zaemyung
Yes, the decoding process for transformer_big is slow.

Besides, it seems that the decoder will load model every batch. This is very strange to me, because I think a decoder only need to load model once.

My solution is to use larger batch size to reduce the number of model re-loading by
--decode_batch_size=${bs}

Hi @cshanbo
Yes, I noticed the reloading of the model too.
While the reloading of the model does take extra few seconds for every batch, the real bottleneck is the time taken to decode a single sentence which is at least 2+ secs.

I am wondering whether the slow decoding time is inherent in the transformer architecture, or an engineering issue of this library? I noticed that the multi-gpu support for decoding process is not ready yet - would it make a big difference on the per-sentence-decoding time?

The training time of the transformer models seems to be a lot quicker than that of encoder-decoder models with attention mechanism. However, the decoding time seems to be the other way around.

Hi @zaemyung ,

Yes, I noticed the bottleneck, too.

I am wondering whether the slow decoding time is inherent in the transformer architecture, or an engineering issue of this library? I noticed that the multi-gpu support for decoding process is not ready yet - would it make a big difference on the per-sentence-decoding time?

I think there might be some tricks, like decoding different sentences on different GPUs simultaneously that could bring out speed improvement. I do agree with you that there might be an engineering issue for now. I believe the later updates will provide faster decoding.

The training time of the transformer models seems to be a lot quicker than that of encoder-decoder models with attention mechanism. However, the decoding time seems to be the other way around.

I think the quick training process of transformer benefits from the architecture (my understanding). Without RNN:

  1. Model parallelism could be easier to be incorporated in.
  2. No BPTT, which makes the training faster.

While in decoding process, both transformer and seq2seq don't have the BPTT process. And the parameter scale doesn't reduce much because of the deep architecture, and large dim cells of transformer_big. (That's my understanding for now, you can judge me :-) ). But I do think the decoding process should be quicker than the RNN seq2seq architecture. We can look into the code to see if there's anything to do.

The reload-model-per-batch is a problem with tf.learn or now tf.estimator. We're working with TF guys to improve this, but it'll probably only come in TF 1.3.

The slow decoding problem is due to the fact that we're re-running the whole computation in the decoder in every step. This is not needed, it'd be better to do caching instead, as shown in this repo/paper: https://github.com/PrajitR/fast-pixel-cnn . It just needs to be designed and implemented nicely in tensor2tensor so that you can change the model and not rewrite inference code every time (that's too slow for research work).

With proper caching, Transformer should run a bit faster than a comparable LSTM in inference, as it has less multiply/add instructions and a similar (lack of) parallelism.

@lukaszkaiser I also have noticed that the decoding in fairseq conv seq2seq lua/torch code https://github.com/facebookresearch/fairseq have similar optimization.

With T2T 1.1, the decode speed with transformer_big_single_gpu was 2 sentences (40 words) per second.
With T2T 1.2, the decode speed (same type of model, same machine) dropped to 0.33 sentences (6 words) per second.
Also with T2T 1.1 the default decode_batch_size=32 was OK, now it fails on OOM, so I had to decrease it to decode_batch_size=24.
I had to retrain the model and used different data, so it is not exactly comparable, but still I didn't expect such slow down.
Any ideas? Can someone else compare the decoding speed of v1.2 vs. some older version?

The reload-model-per-batch is a problem with tf.learn or now tf.estimator. We're working with TF guys to improve this, but it'll probably only come in TF 1.3.

Now, I have TF 1.3 and the model is still restoring parameters from the checkpoint for each decode batch.
(I guess some work is needed on the T2T side as well, I am not complaining, I just want to know if I am not the only one affected and should search the bug elsewhere.)

We're seeing similar memory problems since we switched to using Dataset as our input pipeline. We're investigating and trying to fix that.

i guess a proper speed will comes out only when 1.3 is released?

Hi,
I am also experiencing speed issues. I'm trying interactive translation with a transformer model and both transformer_big and transformer_base.

Even if the latter configuration is faster than the former (quite expected due the reduced number of net parameters) in both cases translation speed is extremely slow: ~3 secs for transformer_base and ~4.5 secs for transformer_big.

So, as far as I understand, this should be fixed by the improvement suggested by @lukaszkaiser : caching the decoder computation.

Do you have you release date forecast for such feature? I just want to have an idea because right now this is a big issue for me and it's basically preventing me to use it in a real scenario.

Thanks for your help!

We have fast decoding, should be in a release soon :).

Fast decoding as well as avoiding model reloads have both been released.

How to use fast decoding ?? Is it enabled by default or I need to set any options ??
I am using transformer base setting and following steps in walkthrough for en-de translation.

For translation it is enabled by default.

Thanks !!

With the latest code (presumably, with fast decoding enabled), with model 'transformer' hparams 'transformer_base', on 2x GTX 1080 Ti (T2T only seems to use one of them), the highest translation performance I can get is around 25 sentences per second.

beam_size=1, batch_size=80 20.2
beam_size=1, batch_size=100 22.3
beam_size=1, batch_size=200 25.0
beam_size=1, batch_size=500 25.5

beam_size=2, batch_size=100 12.7
beam_size=2, batch_size=150 13.8
beam_size=2, batch_size=200 OOM

beam_size=4, batch_size=80 8.1
beam_size=4, batch_size=100 OOM

It basically looks like it takes less time to run 50,000 steps of training on a 500k line dataset, than to translate that same dataset even once.

I don't understand how this is possible. Doesn't it need to run the entire inference process on the dataset many times in 50,000 steps? Can someone explain this to me?

Hi, it is because decoding process is linear while training is constant time. In training you have the whole target sequence and all instances are trained at the same time in a parallel manner. During inference, you autoregressively make prediction, so if the target sentence has 5 tokens, you repeat computation 5 times, whereas in training, no matter your target sentence has 10 or 1 token, you only do 1 single calculation. Google has recently introduced flexible insertion model which decodes in logarithmic time, I wonder if it will be included in tensor2tensor.

@lukaszkaiser i noticed there is a parallel_transformer in the research models, i wonder if it is from the latent variable paper, blockwise parallel paper or if it is something new, since there has not been any pointer in that script.

P. S. I think t2t decoding uses only 1 gpu as observed. You might want to parallelize it within or outside the library.

Makes sense.

I tried to parallelize it by running two instances of t2t-decoder in parallel tied to each GPU, and discovered that it won't load on one of my GPUs at all. This seems to have something to do with memory usage. One of the 1080s (#0) is hooked up to a monitor (and the X server uses a small amount of graphics memory), and the other (#1) is used purely for computation.
t2t-decoder would load fine on #1 (CUDA_VISIBLE_DEVICES=1), but would fail, throwing varying CUDA errors (I've seen CUDNN_STATUS_INTERNAL_ERROR, CUBLAS_STATUS_ALLOC_FAILED, and CUBLAS_STATUS_NOT_INITIALIZED), on #0.

I've managed to fix it by changing line 127 of utils/trainer_lib.py from

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_mem_fraction)
to
gpu_options = tf.GPUOptions(allow_growth=True)

Was this page helpful?
0 / 5 - 0 ratings