Bert: What does the type token mean in modeling.py

Created on 1 Nov 2018  路  5Comments  路  Source: google-research/bert

In the file modeling.py, the BertModel class involves "embedding_postprocessor ", where there is type token used, is this the segment A and segmentB in next sentence prediction? If so, the token vocabulary ("type_vocab_size" ) size should be 2, is that right? THANK YOU.

Most helpful comment

Yes, it's "Segment A" and "Segment B" embeddings. We changed the name for clarity in the paper but forgot to update the code. They are also used for multi-sentence tasks like MultiNLI and SQuAD.

Yes, type_vocab_size is 2 in all of cases (you shouldn't need to set this manually unless you're constructing BertConfig from scratch rather than a json file, if you look at bert_config.json you'll see a line that says "type_vocab_size": 2)

All 5 comments

Yes, it's "Segment A" and "Segment B" embeddings. We changed the name for clarity in the paper but forgot to update the code. They are also used for multi-sentence tasks like MultiNLI and SQuAD.

Yes, type_vocab_size is 2 in all of cases (you shouldn't need to set this manually unless you're constructing BertConfig from scratch rather than a json file, if you look at bert_config.json you'll see a line that says "type_vocab_size": 2)

Yes, it's "Segment A" and "Segment B" embeddings. We changed the name for clarity in the paper but forgot to update the code. They are also used for multi-sentence tasks like MultiNLI and SQuAD.

Yes, type_vocab_size is 2 in all of cases (you shouldn't need to set this manually unless you're constructing BertConfig from scratch rather than a json file, if you look at bert_config.json you'll see a line that says "type_vocab_size": 2)

Thank you for your kind answer, but there is no json file in the github file

The file is found in the pre-trained BERT models, once unzipped.

Yes, if you're pre-training a model from scratch, you technically don't need to download anything.

However, I would recommend downloading a pre-train model from this section (choose BERT-Base, Uncased) and then making sure that you can reproduce the fine-tuning results for at least one task. This will also give you a bert_config.json file that you can use as the basis for your model.

Then why the default value for token_type_vocab is 16 from the code below? do you think it make sense to change it to 2?
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,

Was this page helpful?
0 / 5 - 0 ratings