I'm trying to fine-tune the Reformer for the language generation task and I padded the sequence lengths to be a multiple of least common multiple chunk_length 64, and now I'm asked to pad the sequence to 524288(512 * 1024), which will give me an out of memory error.
I would like to know a workaround for this, since the error message also gives an alternative to 'pad_to_max_length', which is 'changing config.axial_pos_shape' and specially since this is known to be a memory efficient transformer. Thank you.
A link to original question on Stack Overflow: https://stackoverflow.com/questions/61986452/fine-tuning-reformer-gives-out-of-memory-error-when-sequence-length-is-padded-t
I would not recommend to set axial_pos_shape to (512 * 1024). In the notebook I just used that to demonstrate how far the limits can be pushed for Reformer. Half a million token is extremely long and usually unnecessary.
Make sure you have read and understood how AxialPostionEmbeddings work: https://huggingface.co/transformers/model_doc/reformer.html#axial-positional-encodings .
For "normal" language modeling it might make much more sense to start from the Reformer-wiken8 model and finetune it: https://huggingface.co/google/reformer-enwik8
Greetings,
Would fine tuning https://huggingface.co/google/reformer-enwik8 work normally with run_language_modeling.py script?
Thanks
Hmm, for the most part but you will have to define your own tokenzer function as can be seen here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8
So instead of sticking to the script, I would recommend slightly changing this notebook: https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb. Instead of creating the dataset by using a tokenizer, you should use the function linked above. Does that make sense? Also linking: https://github.com/huggingface/transformers/pull/4480. If someone has an easy script for Reformer Char LM it'd be great to post it here or add a notebook.
Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?
I would not recommend to set
axial_pos_shapeto (512 * 1024). In the notebook I just used that to demonstrate how far the limits can be pushed for Reformer. Half a million token is extremely long and usually unnecessary.I've been using 'google/reformer-crime-and-punishment' model from https://huggingface.co/transformers/model_doc/reformer.html#reformermodelwithlmhead
I get this error after I padded the sequence lengths to be a multiple of least common multiple chunk_length 64.
...
for epoch in range(EPOCHS):
print(f"EPOCH {epoch} started" + '=' * 30)
for idx,article in tqdm_notebook(enumerate(article_loader)):
article_tens = tokenizer.encode(article[0], return_tensors='pt').to(device)
print(article_tens.shape)
#multiple of least common multiple chunk_length 64.
pads_to_be_filled=getNoOfPads(article_tens.size()[1])
padded_tens= torch.cat((article_tens[0],Variable(torch.zeros((pads_to_be_filled),dtype=torch.long).cuda())) )
print(padded_tens.unsqueeze(0).shape)
outputs = model(padded_tens.unsqueeze(0), labels=padded_tens.unsqueeze(0))[0]
...
EPOCH 0 started==============================
0/? [00:00<?, ?it/s]
torch.Size([1, 131])
torch.Size([1, 192])
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-11-81c445097515> in <module>()
29 print(padded_tens.unsqueeze(0).shape)
30
---> 31 outputs = model(padded_tens.unsqueeze(0), labels=padded_tens.unsqueeze(0))[0]
32 print(outputs)
33
7 frames
/usr/local/lib/python3.6/dist-packages/transformers/modeling_reformer.py in forward(self, position_ids)
127 reduce(mul, self.axial_pos_shape) == sequence_length
128 ), "If training, make sure that config.axial_pos_shape factors: {} multiply to sequence length. Got prod({}) != sequence_length: {}. You might want to consider padding your sequence length to {} or changing config.axial_pos_shape.".format(
--> 129 self.axial_pos_shape, self.axial_pos_shape, sequence_length, reduce(mul, self.axial_pos_shape)
130 )
131 if self.dropout > 0:
AssertionError: If training, make sure that config.axial_pos_shape factors: (512, 1024) multiply to sequence length. Got prod((512, 1024)) != sequence_length: 192. You might want to consider padding your sequence length to 524288 or changing config.axial_pos_shape.
If training, make sure that config.axial_pos_shape factors: (512, 1024) multiply to sequence length. Got prod((512, 1024)) != sequence_length: 384. You might want to consider padding your sequence length to 524288 or changing config.axial_pos_shape.
So I guess that is because its default set to (512, 1024), and if so, how can I change it to a smaller value?
ReformerConfig {
"architectures": [
"ReformerModelWithLMHead"
],
"attention_head_size": 64,
"attention_probs_dropout_prob": 0.1,
"attn_layers": [
"local",
"lsh",
"local",
"lsh",
"local",
"lsh"
],
"axial_norm_std": 1.0,
"axial_pos_embds": true,
"axial_pos_embds_dim": [
64,
192
],
"axial_pos_shape": [
512,
1024
],
"chunk_size_feed_forward": 0,
"chunk_size_lm_head": 0,
"eos_token_id": 2,
"feed_forward_size": 512,
"hash_seed": null,
"hidden_act": "relu",
"hidden_dropout_prob": 0.05,
"hidden_size": 256,
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": true,
"layer_norm_eps": 1e-12,
"local_attention_probs_dropout_prob": 0.05,
"local_attn_chunk_length": 64,
"local_num_chunks_after": 0,
"local_num_chunks_before": 1,
"lsh_attention_probs_dropout_prob": 0.0,
"lsh_attn_chunk_length": 64,
"lsh_num_chunks_after": 0,
"lsh_num_chunks_before": 1,
"max_position_embeddings": 524288,
"model_type": "reformer",
"num_attention_heads": 2,
"num_buckets": [
64,
128
],
"num_chunks_after": 0,
"num_chunks_before": 1,
"num_hashes": 1,
"num_hidden_layers": 6,
"output_past": true,
"pad_token_id": 0,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 100
}
},
"vocab_size": 320
}
Given above is the default configuration of the model before training/finetuning
For "normal" language modeling it might make much more sense to start from the Reformer-wiken8 model and finetune it: https://huggingface.co/google/reformer-enwik8
Will try that too.
Thank you.
yeah the google/crime-and-punishment is not a good model for fine-tuning. It assumes you use a sequence length of > 500K tokens, which is not really reasonable.
Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?
exactly. You should be able to just enwik8 function I linked above. The enwik8 model has a maximum length of ~65K tokens, which is very long but very feasible for reformer.
yeah the google/crime-and-punishment is not a good model for fine-tuning. It assumes you use a sequence length of > 500K tokens, which is not really reasonable.
Oh okay. Thank you very much for the clarification. Will try finetuning reformer-enwik8.
It would be awesome if you could upload your training script here - people seem very interested in it :-)
@patrickvonplaten, Sure, will do when everything is sorted.
Ok, thanks. So the function _flatten_and_tokenize_ (in the notebook) shall be replaced by the _encode_
function (in the enwik8 model card), am I following right?exactly. You should be able to just enwik8 function I linked above. The enwik8 model has a maximum length of ~65K tokens, which is very long but very feasible for reformer.
From the notebook am struggling to adapt the DataCollator, how to define it properly in this context?
Thanks
Someone effectivelly fine tune on enwiki8 pre trained model? using colab with P100 gpu i was not able to load model yet due to memory limitation
Unfortunately facing the same issue now.
Can you add a link to your notebook here @lucashueda ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@lucashueda Did you manage to fine-tune enwiki8 pre trained model ? or other datasets ? Would you mind sharing your Colab?
@epetros did you manage to perform the fine-tuning ?
Most helpful comment
It would be awesome if you could upload your training script here - people seem very interested in it :-)