I use attention_mask when I do bert.forward(input, attention_mask). But in GPT, when I try to pass a batch of input to OpenAIGPTModel to extract a batch of features, and the lengths of sentences in a batch are different, I have no idea how to do it. Or maybe it doesn't need the mask to be given? If so, is zero the padding_index?
For a quick review, this is the code for bert to extract embeddings.
all_encoder_layers, pooled_output = self.bert(inputs[:, :seq_max_len], token_type_ids=None,
attention_mask=att_mask.to(device))
embeds = torch.cat(all_encoder_layers[-self.bert_n_layers:],-1)
GPT is a causal model so each tokens only attend to the left context and masking is not really needed.
Just mask the output according to your lengths (and be such that each input sample start at the very first left token).
Most helpful comment
GPT is a causal model so each tokens only attend to the left context and masking is not really needed.
Just mask the output according to your lengths (and be such that each input sample start at the very first left token).