the documentation does a great job in explaining the particularities of BERT input features (input_ids, token_types_ids etc …) however for some (if not most) tasks other inputs features are required and I think it would help the users if they were explained with examples.
could we add to the documentation examples of how to get position_ids and head_mask for a given text input ?
I have seen that they are requested in BertForClassification class (in (pytorch_transformers/modeling_bert) and that they are explained in the BERT_INPUTS_DOCSTRING but I have not seen an example of how to get them.
The documentation says
position_ids: Indices of positions of each input sequence tokens in the position embeddings. Selected in the range : [0, config.max_position_embeddings - 1]
head_mask: Mask to nullify selected heads of the self-attention modules.
0 for masked and 1 for not masked
but it is not clear to me how to get them from a given text input.
I experimented with creating input features from a dataframe and I came up with the function below which tries to make explicit each step in input features. I think it could be useful for a tutorial. I would like to add the position_ids and head_mask
q1 = {'text' :["Who was Jim Henson ?",
"Jim Henson was an American puppeteer",
"I love Mom's cooking",
"I love you too !",
"No way",
"This is the kid",
"Yes"
],
'label' : [1, 0, 1, 1, 0, 1, 0]}
import pandas as pd
xdf = pd.DataFrame(q1)
from pytorch_transformers import BertTokenizer
from torch.utils.data import TensorDataset
xtokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def text_to_bertfeatures(df,
col_text,
col_labels=None,
max_length=6,
cls_token='[CLS]',
sep_token='[SEP]'):
''' create a tensordataset with bert input features
input:
- data frame with a column for text and a column for labels
-maximum sequence length
-special tokens
output:
tensor dataset with
**input_ids**: Indices of input sequence tokens in the vocabulary.
**labels** (if specified)
**token_type_ids**: Segment token indices to indicate first and second portions of the inputs. 0 for sentence A and 1 for sentence B
in the glue example they are called *segment_ids*
**attention_mask**: Mask to avoid performing attention on padding token indices. 0 for masked and 1 for not masked
in the glue example they are called *input_mask*
TO DO:
This is for tasks requiring a single "sequence/sentence" input
like classification , it could be modified for two sentences tasks
eventually option to pad left
'''
xlst_text = df[col_text]
# input text with special tokens
x_input_txt_sptokens = [cls_token + ' ' + x + ' ' + sep_token for x in xlst_text]
# input tokens
x_input_tokens = [xtokenizer.tokenize(x_text) for x_text in x_input_txt_sptokens]
# input ids
x_input_ids_int = [xtokenizer.convert_tokens_to_ids(xtoks) for xtoks in x_input_tokens]
# inputs with maximal length
x_input_ids_maxlen = [xtoks[0:max_length] for xtoks in x_input_ids_int]
# Input paaded with zeros on the right
x_input_ids_padded = [xtoks + [0] * (max_length - len(xtoks)) for xtoks in x_input_ids_maxlen]
# token_type_ids
token_type_ids_int = [[1 for x in tok_ids] for tok_ids in x_input_ids_padded]
# attention mask
attention_mask_int = [[int(x > 0) for x in tok_ids] for tok_ids in x_input_ids_padded]
# inputs to tensors
input_ids = torch.tensor(x_input_ids_padded, dtype=torch.long)
token_type_ids = torch.tensor(token_type_ids_int, dtype=torch.long)
attention_mask = torch.tensor(attention_mask_int, dtype=torch.long)
# labels if any:
if col_labels:
labels_int = [int(x) for x in list(df[col_labels])]
labels = torch.tensor(labels_int, dtype=torch.long)
xdset = TensorDataset(input_ids, token_type_ids, attention_mask, labels)
else:
xdset = TensorDataset(input_ids, token_type_ids, attention_mask)
return xdset
text_to_bertfeatures(df = xdf,
col_text = 'text',
col_labels = 'label',
max_length = 6,
cls_token='[CLS]',
sep_token='[SEP]')
Hi,
If you read the documentation here you will see that position_ids and head_mask are not required inputs but are optional.
No need to give them if you don't want to (and you probably don't unless you are doing complex stuff like custom position or head masking).
thanks Thomas.
Very helpful comment about need for this only for custom positioning.
In my case I indeed do not need it.
I am closing the issue to avoid clogging the list of open issues.
P.S: I also take the occasion to thank you (and all other contributors) for this amazing work.
We do not take for granted the fact that the most advanced models are accessible in so short time after their publication. Thank you.
Most helpful comment
thanks Thomas.
Very helpful comment about need for this only for custom positioning.
In my case I indeed do not need it.
I am closing the issue to avoid clogging the list of open issues.
P.S: I also take the occasion to thank you (and all other contributors) for this amazing work.
We do not take for granted the fact that the most advanced models are accessible in so short time after their publication. Thank you.