Transformers: adding vocabulary in OpenAI GPT2 tokenizer issue

Created on 27 Jul 2019  路  2Comments  路  Source: huggingface/transformers

Hi,
I am trying to add few vocabulary tokens to the gpt2 tokenizer
but there seems few problems in adding vocab.

Let's say I want to make sequence like

"__bos__" + sequence A + "__seperator__" + sequence B + "__seperator__" + sequence C + "__eos__"

This means that I have to add "__bos__", "__seperator__", "__eos__" tokens to the tokenizer.
I've found <|endoftext|> token already in the vocab list, but I wanted to use those special
symbols to reflect my special intentions to treat input sequence.
However, when I succesfully added tokens to the vocab list by fixing some of the codes
in the 'tokenization_utils.py' file just like below,

# mark this line of code as a comment
# if self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token):

it works fine at the training stage, but the index mapping went totally different in the
evaluation phase.

Should I use random but unused token which already is in the vocab list of the tokenizer
to replace my special tokens?
For example, if there were some random "^&*" token exist in the vocab list,
use that token as my __bos__ token instead.

Anyway, thank you for providing such a legendary libraries opened!
Thank you very much :)

Most helpful comment

@brendanxwhitaker
thanks for asking !! :)
I found solution thanks to #799 ,
the problem was solved by adding

model.resize_token_embeddings(len(tokenizer))

line when recalling my model !
The problem was that I skipped over the part
where I had to resize the scale of the vocab to that of
when I add new tokens.

Thank you ! :)

All 2 comments

What specifically did you change in tokenization_utils.py?

it works fine at the training stage, but the index mapping went totally different in the
evaluation phase.

Can you elaborate on what you mean? Perhaps post some output? Is it hanging? Or are you just getting wildly poor performance once you move to eval?

@brendanxwhitaker
thanks for asking !! :)
I found solution thanks to #799 ,
the problem was solved by adding

model.resize_token_embeddings(len(tokenizer))

line when recalling my model !
The problem was that I skipped over the part
where I had to resize the scale of the vocab to that of
when I add new tokens.

Thank you ! :)

Was this page helpful?
0 / 5 - 0 ratings