Transformers: simple example of BERT input features : position_ids and head_mask

Created on 19 Aug 2019 · 2Comments · Source: huggingface/transformers

Background:

the documentation does a great job in explaining the particularities of BERT input features (input_ids, token_types_ids etc …) however for some (if not most) tasks other inputs features are required and I think it would help the users if they were explained with examples.

Question:

could we add to the documentation examples of how to get position_ids and head_mask for a given text input ?

I have seen that they are requested in BertForClassification class (in (pytorch_transformers/modeling_bert) and that they are explained in the BERT_INPUTS_DOCSTRING but I have not seen an example of how to get them.
The documentation says
position_ids: Indices of positions of each input sequence tokens in the position embeddings. Selected in the range : [0, config.max_position_embeddings - 1]
head_mask: Mask to nullify selected heads of the self-attention modules.
0 for masked and 1 for not masked
but it is not clear to me how to get them from a given text input.

example of other inputs features :

I experimented with creating input features from a dataframe and I came up with the function below which tries to make explicit each step in input features. I think it could be useful for a tutorial. I would like to add the position_ids and head_mask

q1 = {'text' :["Who was Jim Henson ?",
              "Jim Henson was an American puppeteer",
              "I love Mom's cooking",
              "I love you too !",
              "No way",
              "This is the kid",
              "Yes"
             ],
'label' : [1, 0, 1, 1, 0, 1, 0]}

import pandas as pd 

xdf = pd.DataFrame(q1)


 from pytorch_transformers import BertTokenizer 
 from torch.utils.data import TensorDataset

 xtokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


def text_to_bertfeatures(df,
                         col_text,
                         col_labels=None,
                         max_length=6,
                         cls_token='[CLS]',
                         sep_token='[SEP]'):
    ''' create a tensordataset with bert input features 
        input: 
        - data frame with a column for text and a column for labels 
        -maximum sequence length 
        -special tokens 
        output:
        tensor dataset with 

        **input_ids**:   Indices of input sequence tokens in the vocabulary.
        **labels** (if specified)
        **token_type_ids**:  Segment token indices to indicate first and second portions of the inputs.   0 for sentence A and 1 for sentence B  
         in the glue example they are called *segment_ids* 
        **attention_mask**: Mask to avoid performing attention on padding token indices.  0 for masked and 1 for not masked
         in the glue example they are called *input_mask*
        TO DO: 
        This is for tasks requiring a single "sequence/sentence" input 
        like classification , it could be modified for two sentences tasks 
        eventually option to pad left 
     '''

    xlst_text = df[col_text]

    # input text with special tokens 
    x_input_txt_sptokens = [cls_token + ' ' + x + ' ' + sep_token for x in xlst_text]

    # input tokens 
    x_input_tokens = [xtokenizer.tokenize(x_text) for x_text in x_input_txt_sptokens]
    # input ids 
    x_input_ids_int = [xtokenizer.convert_tokens_to_ids(xtoks) for xtoks in x_input_tokens]

    # inputs with maximal length 
    x_input_ids_maxlen = [xtoks[0:max_length] for xtoks in x_input_ids_int]
    # Input paaded with zeros on the right 
    x_input_ids_padded = [xtoks + [0] * (max_length - len(xtoks)) for xtoks in x_input_ids_maxlen]

    # token_type_ids  
    token_type_ids_int = [[1 for x in tok_ids] for tok_ids in x_input_ids_padded]
    # attention mask 
    attention_mask_int = [[int(x > 0) for x in tok_ids] for tok_ids in x_input_ids_padded]

    # inputs to tensors
    input_ids = torch.tensor(x_input_ids_padded, dtype=torch.long)
    token_type_ids = torch.tensor(token_type_ids_int, dtype=torch.long)
    attention_mask = torch.tensor(attention_mask_int, dtype=torch.long)

    # labels if any:
    if col_labels:
        labels_int = [int(x) for x in list(df[col_labels])]
        labels = torch.tensor(labels_int, dtype=torch.long)
        xdset = TensorDataset(input_ids, token_type_ids, attention_mask, labels)
    else:
        xdset = TensorDataset(input_ids, token_type_ids, attention_mask)

    return xdset

   text_to_bertfeatures(df = xdf,
               col_text = 'text',
               col_labels = 'label',
               max_length = 6,
               cls_token='[CLS]',
               sep_token='[SEP]')

Source

almugabo

Most helpful comment

thanks Thomas.
Very helpful comment about need for this only for custom positioning.
In my case I indeed do not need it.
I am closing the issue to avoid clogging the list of open issues.

P.S: I also take the occasion to thank you (and all other contributors) for this amazing work.
We do not take for granted the fact that the most advanced models are accessible in so short time after their publication. Thank you.

almugabo on 20 Aug 2019

❤4

All 2 comments

Hi,

If you read the documentation here you will see that position_ids and head_mask are not required inputs but are optional.

No need to give them if you don't want to (and you probably don't unless you are doing complex stuff like custom position or head masking).

thomwolf on 20 Aug 2019

👍1

thanks Thomas.
Very helpful comment about need for this only for custom positioning.
In my case I indeed do not need it.
I am closing the issue to avoid clogging the list of open issues.

almugabo on 20 Aug 2019

❤4

Was this page helpful?

0 / 5 - 0 ratings