Tensor2tensor: Is that possible to make Encoder or Decoder models variable length?

Created on 25 Feb 2018  路  7Comments  路  Source: tensorflow/tensor2tensor

The current model needs padding the sentence, that cost a lot of memory. I wonder if possible use dynamic graph to make either encode or decoder variable length?

question

All 7 comments

It does require padding, but we bucket sequences in the input pipeline so that padding is minimized per batch. We have no immediate plans for the Transformer to support fully dynamic variable length sequence handling.

@rsepassi: Could you elaborate more on "we bucket sequences in the input pipeline so that padding is minimized per batch".
Plus I have a quick question - I thought doing stochastic gradient descent requires shuffle dataset properly. By doing what you did I believe it is not random anymore. Any explanation why it still works?
Thx!

There is a window of several batches where sentences are sorted by length, so that each batch contains sentences of similar length, so that padding is minimized. (The actual implementation is quite complex... with bits of black magic:-).)
The training data is still shuffled on the global level (tf records files should be fully shuffled on sentence level, but the above-mentioned sorting/bucketing is done on the fly when training).

BTW: there are many cases when fully shuffling the training data for SGD/Adam/... is not the best option for convergence speed and final results (one of the better-known examples is curriculum learning).

@martinpopel: Thx for your comments - yes it makes sense, and meanwhile I should practice more about curriculum learning.
I have another quick question based on your answer. You mentioned: "so that each batch contains sentences of similar length, so that padding is minimized". It actually makes me confused. I thought Transformer was implemented such that there is a maximum length number of tokens (e.g. 512 as in BERT). Any input with length that is less than that number will be paddled to that number. So there must be no such things as "minimizing padding". Did I miss anything here?
Thx.

Yes, T2T has a max_length hparam to limit the maximum size of training sequences (which are usually sentences). It depends on the task: masked LM (e.g. BERT) or seq2seq/NMT. For NMT, you want to set max_length high enough because Transformer does not generalized well to longer sequences than seen in training (cf. Training Tips for the Transformer Model, Section 4.4). However, lengths of natural sentences follow Poisson-like distribution, so most of the sentences will be much shorter than the max_length limit. So minimizing padding still makes sense for NMT (or any other seq2seq / encoder-decoder task) with variable sentence lengths. T2T supports also "packed" problems where multiple sentences can be packed together into a single sequence.

@martinpopel: Thx - just to be clear - by minimizing padding you mean to set max_length hparam so that we will do mininum number of padding? With this this max_length should be somewhere near the mean of the list of lengths of all sentences in the training data I guess?
Thank you, again - I did not know this indeed.

No, I meant minimizing the padding by sorting the sentences by length. Of course, by setting max_length very low, you'll also contribute to minimizing the padding (there are many shorter sentences in a given window of training data, so there is a higher chances of similar-length sentences in a batch), but at the cost of building a useless model (not capable of translating longer sentences).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jsawruk picture jsawruk  路  4Comments

anglil picture anglil  路  5Comments

KayShenClarivate picture KayShenClarivate  路  3Comments

sebastian-nehrdich picture sebastian-nehrdich  路  4Comments

SapphireEmbers picture SapphireEmbers  路  3Comments