Transformers: changing config.axial_pos_shape for 'ReformerModelWithLMHead' when fine-tuning

Created on 25 May 2020 · 18Comments · Source: huggingface/transformers

❓ Questions & Help

I'm trying to fine-tune the Reformer for the language generation task and I padded the sequence lengths to be a multiple of least common multiple chunk_length 64, and now I'm asked to pad the sequence to 524288(512 * 1024), which will give me an out of memory error.

I would like to know a workaround for this, since the error message also gives an alternative to 'pad_to_max_length', which is 'changing config.axial_pos_shape' and specially since this is known to be a memory efficient transformer. Thank you.
A link to original question on Stack Overflow: https://stackoverflow.com/questions/61986452/fine-tuning-reformer-gives-out-of-memory-error-when-sequence-length-is-padded-t

reformer wontfix

Source

D-i-l-r-u-k-s-h-i

👍2

Most helpful comment

It would be awesome if you could upload your training script here - people seem very interested in it :-)

patrickvonplaten on 26 May 2020

👍2

All 18 comments

I would not recommend to set axial_pos_shape to (512 * 1024). In the notebook I just used that to demonstrate how far the limits can be pushed for Reformer. Half a million token is extremely long and usually unnecessary.

Make sure you have read and understood how AxialPostionEmbeddings work: https://huggingface.co/transformers/model_doc/reformer.html#axial-positional-encodings .

For "normal" language modeling it might make much more sense to start from the Reformer-wiken8 model and finetune it: https://huggingface.co/google/reformer-enwik8

patrickvonplaten on 25 May 2020

Greetings,
Would fine tuning https://huggingface.co/google/reformer-enwik8 work normally with run_language_modeling.py script?
Thanks

epetros on 25 May 2020

Hmm, for the most part but you will have to define your own tokenzer function as can be seen here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8

patrickvonplaten on 25 May 2020

So instead of sticking to the script, I would recommend slightly changing this notebook: https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb. Instead of creating the dataset by using a tokenizer, you should use the function linked above. Does that make sense? Also linking: https://github.com/huggingface/transformers/pull/4480. If someone has an easy script for Reformer Char LM it'd be great to post it here or add a notebook.

patrickvonplaten on 25 May 2020

Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?

epetros on 25 May 2020

I would not recommend to set axial_pos_shape to (512 * 1024). In the notebook I just used that to demonstrate how far the limits can be pushed for Reformer. Half a million token is extremely long and usually unnecessary.

I've been using 'google/reformer-crime-and-punishment' model from https://huggingface.co/transformers/model_doc/reformer.html#reformermodelwithlmhead

I get this error after I padded the sequence lengths to be a multiple of least common multiple chunk_length 64.

...
for epoch in range(EPOCHS):
    print(f"EPOCH {epoch} started" + '=' * 30)
    for idx,article in tqdm_notebook(enumerate(article_loader)):

        article_tens = tokenizer.encode(article[0], return_tensors='pt').to(device)

        print(article_tens.shape)
        #multiple of least common multiple chunk_length 64.
        pads_to_be_filled=getNoOfPads(article_tens.size()[1])

        padded_tens= torch.cat((article_tens[0],Variable(torch.zeros((pads_to_be_filled),dtype=torch.long).cuda())) )

        print(padded_tens.unsqueeze(0).shape)

        outputs = model(padded_tens.unsqueeze(0), labels=padded_tens.unsqueeze(0))[0]
        ...

EPOCH 0 started==============================
0/? [00:00<?, ?it/s]
torch.Size([1, 131])
torch.Size([1, 192])

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-81c445097515> in <module>()
     29         print(padded_tens.unsqueeze(0).shape)
     30 
---> 31         outputs = model(padded_tens.unsqueeze(0), labels=padded_tens.unsqueeze(0))[0]
     32         print(outputs)
     33 

7 frames
/usr/local/lib/python3.6/dist-packages/transformers/modeling_reformer.py in forward(self, position_ids)
    127                 reduce(mul, self.axial_pos_shape) == sequence_length
    128             ), "If training, make sure that config.axial_pos_shape factors: {} multiply to sequence length. Got prod({}) != sequence_length: {}. You might want to consider padding your sequence length to {} or changing config.axial_pos_shape.".format(
--> 129                 self.axial_pos_shape, self.axial_pos_shape, sequence_length, reduce(mul, self.axial_pos_shape)
    130             )
    131             if self.dropout > 0:

AssertionError: If training, make sure that config.axial_pos_shape factors: (512, 1024) multiply to sequence length. Got prod((512, 1024)) != sequence_length: 192. You might want to consider padding your sequence length to 524288 or changing config.axial_pos_shape.

If training, make sure that config.axial_pos_shape factors: (512, 1024) multiply to sequence length. Got prod((512, 1024)) != sequence_length: 384. You might want to consider padding your sequence length to 524288 or changing config.axial_pos_shape.

So I guess that is because its default set to (512, 1024), and if so, how can I change it to a smaller value?

ReformerConfig {
"architectures": [
"ReformerModelWithLMHead"
],
"attention_head_size": 64,
"attention_probs_dropout_prob": 0.1,
"attn_layers": [
"local",
"lsh",
"local",
"lsh",
"local",
"lsh"
],
"axial_norm_std": 1.0,
"axial_pos_embds": true,
"axial_pos_embds_dim": [
64,
192
],
"axial_pos_shape": [
512,
1024
],
"chunk_size_feed_forward": 0,
"chunk_size_lm_head": 0,
"eos_token_id": 2,
"feed_forward_size": 512,
"hash_seed": null,
"hidden_act": "relu",
"hidden_dropout_prob": 0.05,
"hidden_size": 256,
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": true,
"layer_norm_eps": 1e-12,
"local_attention_probs_dropout_prob": 0.05,
"local_attn_chunk_length": 64,
"local_num_chunks_after": 0,
"local_num_chunks_before": 1,
"lsh_attention_probs_dropout_prob": 0.0,
"lsh_attn_chunk_length": 64,
"lsh_num_chunks_after": 0,
"lsh_num_chunks_before": 1,
"max_position_embeddings": 524288,
"model_type": "reformer",
"num_attention_heads": 2,
"num_buckets": [
64,
128
],
"num_chunks_after": 0,
"num_chunks_before": 1,
"num_hashes": 1,
"num_hidden_layers": 6,
"output_past": true,
"pad_token_id": 0,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 100
}
},
"vocab_size": 320
}

Given above is the default configuration of the model before training/finetuning

For "normal" language modeling it might make much more sense to start from the Reformer-wiken8 model and finetune it: https://huggingface.co/google/reformer-enwik8

Will try that too.

Thank you.

D-i-l-r-u-k-s-h-i on 26 May 2020

yeah the google/crime-and-punishment is not a good model for fine-tuning. It assumes you use a sequence length of > 500K tokens, which is not really reasonable.

patrickvonplaten on 26 May 2020

Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?

exactly. You should be able to just enwik8 function I linked above. The enwik8 model has a maximum length of ~65K tokens, which is very long but very feasible for reformer.

patrickvonplaten on 26 May 2020

yeah the google/crime-and-punishment is not a good model for fine-tuning. It assumes you use a sequence length of > 500K tokens, which is not really reasonable.

Oh okay. Thank you very much for the clarification. Will try finetuning reformer-enwik8.

D-i-l-r-u-k-s-h-i on 26 May 2020

👍1

It would be awesome if you could upload your training script here - people seem very interested in it :-)

patrickvonplaten on 26 May 2020

👍2

@patrickvonplaten, Sure, will do when everything is sorted.

D-i-l-r-u-k-s-h-i on 26 May 2020

Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?

exactly. You should be able to just enwik8 function I linked above. The enwik8 model has a maximum length of ~65K tokens, which is very long but very feasible for reformer.

From the notebook am struggling to adapt the DataCollator, how to define it properly in this context?
Thanks

epetros on 27 May 2020

Someone effectivelly fine tune on enwiki8 pre trained model? using colab with P100 gpu i was not able to load model yet due to memory limitation

lucashueda on 13 Jun 2020

Unfortunately facing the same issue now.

epetros on 13 Jun 2020

Can you add a link to your notebook here @lucashueda ?

patrickvonplaten on 15 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.