Tensor2tensor: Out Of Memory (OOM) when having long lines of text to translate.

Created on 28 Jan 2018 · 14Comments · Source: tensorflow/tensor2tensor

Hi, I get OOM when running training (See below). Some of the lines I have in the source and target translation are long. When I try to crop or limit the lines' length, it works, but I want to know what is the maximum line length I can permit.

My question is how the calculation works. I want to know what is the longest line I can have (approximately), when I have 12GB of memory on the GPU, when I use vocab of 32k tokens, and a batch size of 4096 (single GPU machine). In other words, how much memory do I need to fetch batch size of B, with number of vocabulary tokens V, and a hidden state size of H (in my case 512) in a single machine, in the transformer_base model?

In addition, I would like to propose that the trainer will first validate the training files, to make sure they can be placed in memory, because what is happening right now is that the training gets stopped somewhere in the middle, which is very problematic and doesn't allow me to crop the sentences to the max possible length.

=====

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[336840,2807]
[[Node: transformer/parallel_0_5/transformer/body/decoder/layer_0/self_attention/multihead_attention/dot_product_attention/Softmax = Softmax[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/parallel_0_5/transformer/body/decoder/layer_0/self_attention/multihead_attention/dot_product_attention/Reshape)]]
[[Node: transformer/parallel_0_5/transformer/body/encoder/layer_3/self_attention/multihead_attention/output_transform/Tensordot/Shape/_1503 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1862_transformer/parallel_0_5/transformer/body/encoder/layer_3/self_attention/multihead_attention/output_transform/Tensordot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Source

ndvbd

All 14 comments

I did experiments with transformer_big_single_gpu, 32k vocabulary and 11GB GPU memory. In the default setting (without any sentence length restriction, which I think means that only sentences longer than the batch size are dropped), I was able to train with batch_size=2040 and got OOM with batch_size=2050. With --hparams="max_length=150", I was able to use batch_size=2200 and got OOM with batch_size=2250. With max_length=100, I was able to use batch_size=2350 and got OOM with batch_size=2400 (after 2401 training steps). Finding the exact maximum batch_size is problematic because the training may fail after several hours when the batch size is a bit above the maximum.

how much memory do I need to fetch batch size of B, with ...

I am not sure if it is possible to provide such a formula. If yes, it depends also on other parameters, e.g. with optimizer=Adafactor instead of the default Adam, I was able to use batch_size=2800.
I think the easiest way is just to try it and see (there is also tensorflow profiler, but I haven't tried it).
A batch consists of a variable number of sentences, such that the total number of subwords in batch approximately equals the batch_size. I guess the problem is that "approximately equals" does not mean that it is always lower or equal. Thus we observe OOM errors even after several hours of training. My guess is that these are caused by a sentence which is almost max_length subwords long and comes at the moment when the batch is almost full. I have not inspected the source code, so I am not sure. If this is the case, I would vote for changing the algorithm to make sure each batch is lower of equal than batch_size, so the training either fails immediately or never.

My current approach is:

set a reasonable max_length (so excluding at most x percent of the training data)
find the lowest batch_size which fails immediately after the start of training
decrease this batch_size by max_length (plus a small reserve)
Now start the real training from scratch, it should not fail with OOM.
My goal is to use the highest possible safe batch size. I think it's better than fixing the batch size and finding the highest possible safe max_length.

martinpopel on 28 Jan 2018

Can you explain what exactly does max_length do? Is there a documentation for it?

When I set max_length = 100, does it mean that the trainer will simply ignore all the training cases containing more than 100 subwords?

ndvbd on 28 Jan 2018

Yes, hp.max_length: For variable length features, sequences with length longer than this will be dropped during training.
See https://github.com/tensorflow/tensor2tensor/blob/master/docs/overview.md#batching

martinpopel on 28 Jan 2018

In your current approach, you fixed the max_length, and then found the max batch.
Do you know what is more important for a better bleu score, the length or batch size?
I saw you on another thread here, that increasing the batch size is important for better bleu score (and therefore increasing the number of GPUs, to have a bigger effective batch size), but is there a similar research regarding the max_length?

ndvbd on 28 Jan 2018

I haven't seen any effect of using max_length=150 or 100 (except for the ability to use a bigger batch and thus faster training convergence), but it depends on your training data, vocabulary size and the expected length of the test set sentences.
Too long sentences in the training data are often noisy or from another domain than my target, so excluding them completely may improve the quality. Also training on much longer sentences than I am actually going to translate does not make much sense to me. E.g. Nematus authors use 50 subwords limit.
On the other hand, excluding a big portion of the training data or leaving only sentences much shorter than the expected test-set sentences will lead to lower quality.

martinpopel on 28 Jan 2018

Martin, I found what was the problem.
Apparently reducing the batch_size and the max_length didn't help.
I had to set eval_drop_long_sequences=true, and this solved the problem.

For some reason, when you don't set this (as it is written in the docs), it takes the batch size as the max length for the eval stage. This is a bit weird, because if your batch is 4096, and the default max_length is 256, then in the eval stage it goes from 256->4096, and you get OOM....

ndvbd on 28 Jan 2018

I had to set eval_drop_long_sequences=true, and this solved the problem.

So I guess the error was during the internal evaluation (which is on when you run t2t-trainer with the default --schedule=continuous_train_and_eval). Am I right?
I use --schedule=train for other reasons, so I don't face these problems.

This is a bit weird, because if your batch is 4096, and the default max_length is 256, then in the eval stage it goes from 256->4096, and you get OOM....

Does your eval set really contain sentences longer than 256 subwords?
And if you keep them the batch_size is 4096 and if you delete them it is 256? This sounds really weird.

martinpopel on 28 Jan 2018

👍1

Yes, I don't use the --schedule param, so I guess it's the default which is as you say training and every 2000 steps eval.

Yes, my train and eval contain sentences much longer than 256 tokens/subwords.

Now I run --hparams="max_length=256,batch_size=4096,eval_drop_long_sequences=true" on one GPU with 12GB and 32k vocab with no problem.

ndvbd on 28 Jan 2018

Have you checked out pack_examples https://github.com/tensorflow/tensor2tensor/blob/a1d7ed7e7d96ea0d89ca153ba46b7f51d394580f/tensor2tensor/data_generators/generator_utils.py#L521
during data generation?

If chop_long_sequences is set, then any input sequence
  longer than packed_length gets chopped up into multiple examples.  Otherwise,
  long sequences are emitted as singletons.

mehmedes on 30 Jan 2018

This issue is about translation (seq2seq, so has_inputs=True) and the pack_examples documentation says:
"If has_inputs=True, then we are packing sequence-to-sequence examples. [...] Chopping of long sequences is not supported."
So I think chop_long_sequences cannot help here.

martinpopel on 30 Jan 2018

Well, I guess I'd have to chop the long examples into multiple examples myself, in order not to lose training data.

Do you happen to know (roughly) the expected difference in terminal BLEU score between training 1GPU with a batch of 4096 or 8 GPUs in parallel, each with a batch of 4096 (and therefore with an effective batch size of 4096*8)? Or does it not matter?

ndvbd on 30 Jan 2018

@nadavb: effective batch size does matter, at least with regards to the convergence speed (but it seems also with regards to the highest achievable BLEU after infinite training), see e.g. https://github.com/tensorflow/tensor2tensor/issues/444#issuecomment-351391778

martinpopel on 30 Jan 2018

👍1

So what you are essentially saying, is that when we have more X GPUs, we both save X more time, and we reach a higher BLEU.

In that case, everyone should work with as many GPUs in parallel as possible. If working on a cloud-based GPUs, there is no major difference in price if you work on 1 GPU for 10 days, or 10 GPUs for 1 day, so in the Transformer model, according to what you say, we should always aim for larger batches.

ndvbd on 30 Jan 2018

Yes, my experiments suggest that using more GPUs is beneficial if you have a fixed budget and 1 GPU for 10 days is the same price as 10 GPUs for 1 day. But my experiments are still a work in progress.
Now I experiment with higher batch sizes thanks to max_length and Adafactor and I want to see whether even here is the 8 times bigger effective batch size beneficial.
There are some difficulties, e.g. when scaling from 1 GPU to 8 GPUs, it seems you need to adapt the hyperparameters, mostly warmup steps and learning rate (as discussed in #444), but other techniques may be useful as well (e.g. adapting Adam's beta1 or beta2, or using ghost batch normalization instead of standard batch normalization).
Also I have noticed that some Nvidia drivers (384.69) are slow/buggy when training on 8 GPUs (twice as slow as 375.66).

To keep these GitHub issues tidy, I suggest to close this issue (unless you have more questions on the original topic) and possibly continue the discussion in #444.

martinpopel on 30 Jan 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings