Bert: What does the type token mean in modeling.py

Created on 1 Nov 2018 · 5Comments · Source: google-research/bert

In the file modeling.py, the BertModel class involves "embedding_postprocessor ", where there is type token used, is this the segment A and segmentB in next sentence prediction? If so, the token vocabulary ("type_vocab_size" ) size should be 2, is that right? THANK YOU.

Source

liweitj47

Most helpful comment

Yes, it's "Segment A" and "Segment B" embeddings. We changed the name for clarity in the paper but forgot to update the code. They are also used for multi-sentence tasks like MultiNLI and SQuAD.

Yes, type_vocab_size is 2 in all of cases (you shouldn't need to set this manually unless you're constructing BertConfig from scratch rather than a json file, if you look at bert_config.json you'll see a line that says "type_vocab_size": 2)

jacobdevlin-google on 1 Nov 2018

👍9

All 5 comments

Yes, it's "Segment A" and "Segment B" embeddings. We changed the name for clarity in the paper but forgot to update the code. They are also used for multi-sentence tasks like MultiNLI and SQuAD.

jacobdevlin-google on 1 Nov 2018

👍9

Yes, it's "Segment A" and "Segment B" embeddings. We changed the name for clarity in the paper but forgot to update the code. They are also used for multi-sentence tasks like MultiNLI and SQuAD.

Yes, type_vocab_size is 2 in all of cases (you shouldn't need to set this manually unless you're constructing BertConfig from scratch rather than a json file, if you look at bert_config.json you'll see a line that says "type_vocab_size": 2)

Thank you for your kind answer, but there is no json file in the github file

liweitj47 on 1 Nov 2018

The file is found in the pre-trained BERT models, once unzipped.

CapitalZe on 1 Nov 2018

Yes, if you're pre-training a model from scratch, you technically don't need to download anything.

However, I would recommend downloading a pre-train model from this section (choose BERT-Base, Uncased) and then making sure that you can reproduce the fine-tuning results for at least one task. This will also give you a bert_config.json file that you can use as the basis for your model.

jacobdevlin-google on 1 Nov 2018

👍1

Then why the default value for token_type_vocab is 16 from the code below? do you think it make sense to change it to 2?
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,