Transformers: GPU out of memory with Reformer enwik8 model

Created on 18 Jun 2020  ยท  9Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

I'm trying to run the pretrained model google/reformer-enwik8 but I'm getting CUDA out of memory errors unless I limit the sequences to one-fourth of the model capacity (~16k instead of the 65k).

This happens with a Titan Xp with 12GB RAM; I expected all the tricks of the Reformer to make the model with the original sequence size fit.

The code I'm running:

model = ReformerModelWithLMHead.from_pretrained('google/reformer-enwik8')
model.cuda()

config = model.config
max_len = config.max_position_embeddings
dataset = Enwik8Dataset(
    path, max_len, pad_id=config.pad_token_id,
    eos_id=config.eos_token_id)
loader = DataLoader(dataset, batch_size=1, shuffle=False)

acc_loss = 0
for batch in loader:
    with torch.no_grad():
        batch_loss = model(input_ids=batch, labels=batch)[0]
    acc_loss += batch_loss.mean().item()

acc_loss /= len(dataset)

The Enwik8Dataset inherits from Dataset and does the basic data preprocessing, I can post the code if necessary.


A link to original question on Stack Overflow: https://stackoverflow.com/questions/62373033/gpu-out-of-memory-with-enwik8-reformer-from-huggingface

wontfix

Most helpful comment

Ok, so I found that the main culprit was that the Trainer was storing all model predictions in GPU memory during evaluation at https://github.com/huggingface/transformers/blob/c01480bba3b2f0bd8516679476235f4701c21b3b/src/transformers/trainer.py#L775

Passing prediction_loss_only=False avoided that. By the way, I believe this should be the default value in the Trainer, and that the cat operation could use cpu tensors, in case the validation dataset is big.

All 9 comments

Did you try to train with half precision using the apex/amp package?

No, I didn't. Anyway, this was only evaluating the pretrained model.

I now have installed nvidia apex and tried training a new model with fp16. It ran out of memory after around 10% of the evaluation loop. I didn't expect it, since the model shouldn't need more memory in the middle of evaluation, with all batches having the full sequence length (in my case, 16k).

Is there maybe something to optimize GPU memory usage?

Hmm, lemme check...
What is the sequence length and batch size you use excatly? Also, you can reduce num_hashes to save memory.

I'm using this training setup with the Trainer from huggingface:

    axial_pos_emb_dim = 128, 384
    hidden_dim = sum(axial_pos_emb_dim)

    axial_pos_max = 64, 256  # the product of this is the maximum length
    max_length = axial_pos_max[0] * axial_pos_max[1]

    hidden_dropout = 0.2
    attn_dropout = 0.1
    ff_dim = 2 * hidden_dim
    num_heads = 8
    dim_per_head = hidden_dim // num_heads
    num_layers = 6
    layers = ['local', 'lsh'] * (num_layers // 2)
    chunk_size = 0
    bucket_size = 64
    num_hashes = 2
    vocab_size = 258

    config = ReformerConfig(
        dim_per_head, layers, chunk_size_feed_forward=chunk_size,
        axial_pos_embds_dim=axial_pos_emb_dim,
        axial_pos_shape=axial_pos_max,
        max_position_embeddings=max_length,
        eos_token_id=1, feed_forward_size=ff_dim,
        hidden_dropout_prob=hidden_dropout, hidden_size=hidden_dim,
        lsh_attention_probs_dropout_prob=attn_dropout,
        local_attention_probs_dropout_prob=attn_dropout,
        num_attention_heads=num_heads, num_buckets=None,
        pad_token_id=0, lsh_attn_chunk_length=bucket_size,
        num_hashes=num_hashes, vocab_size=vocab_size)
    model = ReformerModelWithLMHead(config)

    training_args = TrainingArguments(
        'model', do_train=True, do_eval=True,
        do_predict=False, evaluate_during_training=True,
        gradient_accumulation_steps=1,
        learning_rate=0.001,
        logging_dir='model/tensorboard',
        logging_steps=5,
        save_steps=1000,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        fp16=True)

Unless I'm doing something wrong, it's a batch size of 1 (both for training and evaluating) and a sequence length of 16384 (64 * 256).

Ok, so I found that the main culprit was that the Trainer was storing all model predictions in GPU memory during evaluation at https://github.com/huggingface/transformers/blob/c01480bba3b2f0bd8516679476235f4701c21b3b/src/transformers/trainer.py#L775

Passing prediction_loss_only=False avoided that. By the way, I believe this should be the default value in the Trainer, and that the cat operation could use cpu tensors, in case the validation dataset is big.

Out of curiosity, how big is your validation dataset/how large is the in-memory size of preds in that prediction loop?

@julien-c this happened with the enwik8 test set, 5M characters.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rsanjaykamath picture rsanjaykamath  ยท  3Comments

quocnle picture quocnle  ยท  3Comments

0x01h picture 0x01h  ยท  3Comments

HanGuo97 picture HanGuo97  ยท  3Comments

siddsach picture siddsach  ยท  3Comments