Transformers: changing config.axial_pos_shape for 'ReformerModelWithLMHead' when fine-tuning

Created on 25 May 2020  ยท  18Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

I'm trying to fine-tune the Reformer for the language generation task and I padded the sequence lengths to be a multiple of least common multiple chunk_length 64, and now I'm asked to pad the sequence to 524288(512 * 1024), which will give me an out of memory error.

I would like to know a workaround for this, since the error message also gives an alternative to 'pad_to_max_length', which is 'changing config.axial_pos_shape' and specially since this is known to be a memory efficient transformer. Thank you.
A link to original question on Stack Overflow: https://stackoverflow.com/questions/61986452/fine-tuning-reformer-gives-out-of-memory-error-when-sequence-length-is-padded-t

reformer wontfix

Most helpful comment

It would be awesome if you could upload your training script here - people seem very interested in it :-)

All 18 comments

I would not recommend to set axial_pos_shape to (512 * 1024). In the notebook I just used that to demonstrate how far the limits can be pushed for Reformer. Half a million token is extremely long and usually unnecessary.

Make sure you have read and understood how AxialPostionEmbeddings work: https://huggingface.co/transformers/model_doc/reformer.html#axial-positional-encodings .

For "normal" language modeling it might make much more sense to start from the Reformer-wiken8 model and finetune it: https://huggingface.co/google/reformer-enwik8

Greetings,
Would fine tuning https://huggingface.co/google/reformer-enwik8 work normally with run_language_modeling.py script?
Thanks

Hmm, for the most part but you will have to define your own tokenzer function as can be seen here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8

So instead of sticking to the script, I would recommend slightly changing this notebook: https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb. Instead of creating the dataset by using a tokenizer, you should use the function linked above. Does that make sense? Also linking: https://github.com/huggingface/transformers/pull/4480. If someone has an easy script for Reformer Char LM it'd be great to post it here or add a notebook.

Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?

I would not recommend to set axial_pos_shape to (512 * 1024). In the notebook I just used that to demonstrate how far the limits can be pushed for Reformer. Half a million token is extremely long and usually unnecessary.

I've been using 'google/reformer-crime-and-punishment' model from https://huggingface.co/transformers/model_doc/reformer.html#reformermodelwithlmhead

I get this error after I padded the sequence lengths to be a multiple of least common multiple chunk_length 64.

...
for epoch in range(EPOCHS):
    print(f"EPOCH {epoch} started" + '=' * 30)
    for idx,article in tqdm_notebook(enumerate(article_loader)):

        article_tens = tokenizer.encode(article[0], return_tensors='pt').to(device)

        print(article_tens.shape)
        #multiple of least common multiple chunk_length 64.
        pads_to_be_filled=getNoOfPads(article_tens.size()[1])

        padded_tens= torch.cat((article_tens[0],Variable(torch.zeros((pads_to_be_filled),dtype=torch.long).cuda())) )

        print(padded_tens.unsqueeze(0).shape)

        outputs = model(padded_tens.unsqueeze(0), labels=padded_tens.unsqueeze(0))[0]
        ...
EPOCH 0 started==============================
0/? [00:00<?, ?it/s]
torch.Size([1, 131])
torch.Size([1, 192])

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-81c445097515> in <module>()
     29         print(padded_tens.unsqueeze(0).shape)
     30 
---> 31         outputs = model(padded_tens.unsqueeze(0), labels=padded_tens.unsqueeze(0))[0]
     32         print(outputs)
     33 

7 frames
/usr/local/lib/python3.6/dist-packages/transformers/modeling_reformer.py in forward(self, position_ids)
    127                 reduce(mul, self.axial_pos_shape) == sequence_length
    128             ), "If training, make sure that config.axial_pos_shape factors: {} multiply to sequence length. Got prod({}) != sequence_length: {}. You might want to consider padding your sequence length to {} or changing config.axial_pos_shape.".format(
--> 129                 self.axial_pos_shape, self.axial_pos_shape, sequence_length, reduce(mul, self.axial_pos_shape)
    130             )
    131             if self.dropout > 0:

AssertionError: If training, make sure that config.axial_pos_shape factors: (512, 1024) multiply to sequence length. Got prod((512, 1024)) != sequence_length: 192. You might want to consider padding your sequence length to 524288 or changing config.axial_pos_shape.

If training, make sure that config.axial_pos_shape factors: (512, 1024) multiply to sequence length. Got prod((512, 1024)) != sequence_length: 384. You might want to consider padding your sequence length to 524288 or changing config.axial_pos_shape.

So I guess that is because its default set to (512, 1024), and if so, how can I change it to a smaller value?

ReformerConfig {
"architectures": [
"ReformerModelWithLMHead"
],
"attention_head_size": 64,
"attention_probs_dropout_prob": 0.1,
"attn_layers": [
"local",
"lsh",
"local",
"lsh",
"local",
"lsh"
],
"axial_norm_std": 1.0,
"axial_pos_embds": true,
"axial_pos_embds_dim": [
64,
192
],
"axial_pos_shape": [
512,
1024
],
"chunk_size_feed_forward": 0,
"chunk_size_lm_head": 0,
"eos_token_id": 2,
"feed_forward_size": 512,
"hash_seed": null,
"hidden_act": "relu",
"hidden_dropout_prob": 0.05,
"hidden_size": 256,
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": true,
"layer_norm_eps": 1e-12,
"local_attention_probs_dropout_prob": 0.05,
"local_attn_chunk_length": 64,
"local_num_chunks_after": 0,
"local_num_chunks_before": 1,
"lsh_attention_probs_dropout_prob": 0.0,
"lsh_attn_chunk_length": 64,
"lsh_num_chunks_after": 0,
"lsh_num_chunks_before": 1,
"max_position_embeddings": 524288,
"model_type": "reformer",
"num_attention_heads": 2,
"num_buckets": [
64,
128
],
"num_chunks_after": 0,
"num_chunks_before": 1,
"num_hashes": 1,
"num_hidden_layers": 6,
"output_past": true,
"pad_token_id": 0,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 100
}
},
"vocab_size": 320
}

Given above is the default configuration of the model before training/finetuning

For "normal" language modeling it might make much more sense to start from the Reformer-wiken8 model and finetune it: https://huggingface.co/google/reformer-enwik8

Will try that too.

Thank you.

yeah the google/crime-and-punishment is not a good model for fine-tuning. It assumes you use a sequence length of > 500K tokens, which is not really reasonable.

Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?

exactly. You should be able to just enwik8 function I linked above. The enwik8 model has a maximum length of ~65K tokens, which is very long but very feasible for reformer.

yeah the google/crime-and-punishment is not a good model for fine-tuning. It assumes you use a sequence length of > 500K tokens, which is not really reasonable.

Oh okay. Thank you very much for the clarification. Will try finetuning reformer-enwik8.

It would be awesome if you could upload your training script here - people seem very interested in it :-)

@patrickvonplaten, Sure, will do when everything is sorted.

Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?

exactly. You should be able to just enwik8 function I linked above. The enwik8 model has a maximum length of ~65K tokens, which is very long but very feasible for reformer.

From the notebook am struggling to adapt the DataCollator, how to define it properly in this context?
Thanks

Someone effectivelly fine tune on enwiki8 pre trained model? using colab with P100 gpu i was not able to load model yet due to memory limitation

Unfortunately facing the same issue now.

Can you add a link to your notebook here @lucashueda ?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@lucashueda Did you manage to fine-tune enwiki8 pre trained model ? or other datasets ? Would you mind sharing your Colab?

@epetros did you manage to perform the fine-tuning ?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

0x01h picture 0x01h  ยท  3Comments

hsajjad picture hsajjad  ยท  3Comments

fabiocapsouza picture fabiocapsouza  ยท  3Comments

lemonhu picture lemonhu  ยท  3Comments

fyubang picture fyubang  ยท  3Comments