Hi,
I was wondering whether the team could expand BERT so that fine-tuning with newly defined special tokens would be possible - just like the GPT allows.
@thomwolf Could you share your thought with me on that?
Regards,
Adrian.
Hi Adrian, BERT already has a few unused tokens that can be used similarly to the special_tokens
of GPT/GPT-2.
For more details see https://github.com/google-research/bert/issues/9#issuecomment-434796704 and issue #405 for instance.
In case we use an unused special token from the vocabulary, is it enough to finetune a classification task or do we need to train an embedding from scratch? Did anyone already do this?
Two different and somehow related questions I had when looking into the implementation:
1) The Bert paper mentions a (learned) positional embedding. How is this implemented here? examples/extract_features/convert_examples_to_features() defines tokens (representation), input_type_ids (the difference between the first and second sequence) and an input_mask (distinguishing padding/real tokens) but no positional embedding. Is this done internally?
2) Can I use a special token as input_type_ids for Bert? In the classification example, only values of [0,1] are possible and I'm wondering what would happen if I would choose a special token instead? Is this possible with a pretrained embedding or do i need to retrain the whole embedding as a consequence?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hi Adrian, BERT already has a few unused tokens that can be used similarly to the
special_tokens
of GPT/GPT-2.For more details see https://github.com/google-research/bert/issues/9#issuecomment-434796704 and issue #405 for instance.