I am trying to train a GPT2 model from scratch but I noticed, by looking into the code here https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py that there doesn’t seem to be an implementation for a causal mask. Maybe it is in another repo and I missed it, I also couldn't find ressources on this in the docs.
I could write an ugly for loop and feed each of my sequences one token at a time to the network which would be super unefficient. I could also chop up each of my examples token by token, pad them and feed it like a batch, which is probably faster but doesn’t feel super satisfacting.
Do you know if there is a standard implementation of casal mask that I missed, or another way to do what I am describing ?
PS : I have already read huggingface’s blogpost on training from scratch, but unfortunately it doesn't say much about the implementation of said training :/.
You'd want to look at the run_language_modeling.py script which implements causal language modeling. (do not pass the --mlm flag)
I'm thinking some edit to run_language_modeling.py script maybe would make it work. I don't think just to not pass the --mlm flag you solve the problem @julien-c. Have you found any solution @johncwok? I'm searching the same thing.
@johncwok GPT2 always uses a causal mask. It's quite hidden in the code. This line https://github.com/huggingface/transformers/blob/0a4b1068e1d6c46525082b91a4ba00a09c9270ac/src/transformers/modeling_gpt2.py#L145 creates the causal mask that is then applied to the weights. The naming can definitely be improved here! So no matter what mask you insert it will only be applied in combination with the causal mask.
Also take a look at this line that creates the mask:
https://github.com/huggingface/transformers/blob/0a4b1068e1d6c46525082b91a4ba00a09c9270ac/src/transformers/modeling_gpt2.py#L107
After https://github.com/huggingface/transformers/pull/2715/files is merged, I will do some renaming in the code - seems like a lot of people look for the causal mask in GPT2, CTRL and GPT
Most helpful comment
@johncwok GPT2 always uses a causal mask. It's quite hidden in the code. This line https://github.com/huggingface/transformers/blob/0a4b1068e1d6c46525082b91a4ba00a09c9270ac/src/transformers/modeling_gpt2.py#L145 creates the causal mask that is then applied to the weights. The naming can definitely be improved here! So no matter what mask you insert it will only be applied in combination with the causal mask.
Also take a look at this line that creates the mask:
https://github.com/huggingface/transformers/blob/0a4b1068e1d6c46525082b91a4ba00a09c9270ac/src/transformers/modeling_gpt2.py#L107