Transformers: Special tokens to pre-trained BART model

Created on 26 Mar 2020 · 8Comments · Source: huggingface/transformers

❓ Questions & Help

Details

Is it possibile to add special tokens to the pre-trained BART model? My text has <s> as sequence separator for sentences. I would like that the encoder will handle it as a whole token, otherwise the model will break it in codes and learn like <s or s> etc. in the same we did for other tokenizers like GPT2Tokenizer?

tokenizer = GPT2Tokenizer.from_pretrained(args.out, 
                                                    unk_token="<unk>",
                                                    bos_token="<s>", 
                                                    eos_token="</s>", 
                                                    pad_token = "<pad>", 
                                                    additional_special_tokens=["<startoflyrics>",  "<endoflyrics>", "<nl>"])

Tokenization

Source

loretoparisi

Most helpful comment

self.tokenizer = AutoTokenizer.from_pretrained(
            self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path,
            cache_dir=cache_dir,
        )
self.model = MODEL_MODES[mode].from_pretrained(
            self.hparams.model_name_or_path,
            from_tf=bool(".ckpt" in self.hparams.model_name_or_path),
            config=self.config,
            cache_dir=cache_dir,
        )
self.tokenizer.add_tokens(['multi-sentence', ':snt1', ':snt2', ':snt3', ':snt4', ':snt5', ':snt5', ':snt6', ':snt7', ':snt8', ':snt9', ':root', ':ARG1', ':mod', ':op1', ':ARG0', ':ARG0-of', ':name', ':op2', ':ARG2', ':ARG1-of', ':purpose', ':prep-in', ':time', ':li', ':quant', ':unit', ':poss', ':ARG3', ':location', ':domain', ':part-of', ':manner', ':polarity', ':condition', ':ARG4', ':extent', ':time-of', ':location-of', ':op3', ':beneficiary', ':topic', ':degree', ':ARG2-of', ':example', ':extent-of', ':month', ':day', ':op4', ':ARG5', ':manner-of', ':concession', ':duration', ':path', ':mode', ':medium', ':ord', ':value', ':destination', ':source', ':direction', ':instrument-of', ':consist-of', ':dayperiod', ':frequency', ':year', ':quant-of', ':weekday', ':compared-to', ':prep-on', ':ARG3-of', ':degree-of', ':prep-as', ':instrument', ':op5', ':prep-from', ':prep-to', ':century', ':era', ':condition-of', ':op6', ':op7', ':concession-of', ':polite', ':age', ':prep-with', ':decade', ':poss-of', ':prep-without', ':prep-in-addition-to', ':accompanier', ':ord-of', ':direction-of', ':prep-against', ':prep-at', ':subevent-of', ':snt10', ':snt11', ':duration-of', ':prep-for', ':source-of', ':frequency-of', ':topic-of', ':season', ':path-of', ':op8', ':op9', ':prep-among', ':prep-on-behalf-of', ':subevent', ':part', ':ARG4-of', ':beneficiary-of', ':scale', ':example-of', ':prep-by', ':range', ':purpose-of', ':destination-of', ':op10', ':op1-of', ':name-of', ':medium-of', ':prep-along-with', ':conj-as-if', ':timezone', ':prep-under', ':accompanier-of', ':age-of', ':op11', ':op12', ':op13', ':op14', ':op15', ':prep-amid', ':prep-toward', ':prep-out-of', ':prep-into', ':domain-of', ':ARG7', ':quarter', ':ARG5-of', ':op16', ':op17', ':op18', ':op19', ':op20', ':ARG8', ':ARG9', ':calendar', ':year2', ':ARG6', ':subset-of', ':prep-with-of'])
self.model.resize_token_embeddings(len(self.tokenizer))

This worked for me

tuhinjubcse on 12 Jun 2020

🚀2

All 8 comments

Hi!
Two points that might be helpful.

The add_special_tokens functionality should work the same as RobertaTokenizer.
<s> is already the bos token, so I don't expect it to be broken up.

Let me know if that resolves your issue, thanks!

sshleifer on 29 Mar 2020

👍1

Not sure how to go about doing this ?
@sshleifer any code example

i see BartTokenizer is essentially RobertaTokenizer which is GPT2Tokenizer

for fine-tuning BART in lightning base we have

self.tokenizer = AutoTokenizer.from_pretrained( self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path, cache_dir=cache_dir, )

Can we add the list of special tokens here ?
If then how ?

tuhinjubcse on 10 Jun 2020

👀1

Are you trying to add tokens to the vocab and give them new ids?
A specific example with what you expect the tokenizer to produce would be helpful.

I tried the following and it doesn't work as OP intended, afaict.

from transformers import BartTokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large',
                                          additional_special_tokens=["<startoflyrics>", 'dringus'])

encoded = tokenizer.encode_plus(' <startoflyrics> dringus')['input_ids'] # [0, 3, 3, 2]
tokenizer.decode(encoded) # '<s><unk><unk></s>'

sshleifer on 10 Jun 2020

Yes @sshleifer i want to add new tokens to the vocab and give them new ids
How to go about doing it ?

tuhinjubcse on 10 Jun 2020

@patrickvonplaten @LysandreJik what is the canonical way to add new non-special tokens?
(1) Is there an easier way than making a new vocab and merges file?
(2) If not, is there an example of how to do that?

sshleifer on 11 Jun 2020

I am only familiar with the add_special_tokens functionality for new tokens that get the "special tokens" treatment.

For normal tokens, one can use add_tokens as far as I know.

patrickvonplaten on 12 Jun 2020

self.tokenizer = AutoTokenizer.from_pretrained(
            self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path,
            cache_dir=cache_dir,
        )
self.model = MODEL_MODES[mode].from_pretrained(
            self.hparams.model_name_or_path,
            from_tf=bool(".ckpt" in self.hparams.model_name_or_path),
            config=self.config,
            cache_dir=cache_dir,
        )
self.tokenizer.add_tokens(['multi-sentence', ':snt1', ':snt2', ':snt3', ':snt4', ':snt5', ':snt5', ':snt6', ':snt7', ':snt8', ':snt9', ':root', ':ARG1', ':mod', ':op1', ':ARG0', ':ARG0-of', ':name', ':op2', ':ARG2', ':ARG1-of', ':purpose', ':prep-in', ':time', ':li', ':quant', ':unit', ':poss', ':ARG3', ':location', ':domain', ':part-of', ':manner', ':polarity', ':condition', ':ARG4', ':extent', ':time-of', ':location-of', ':op3', ':beneficiary', ':topic', ':degree', ':ARG2-of', ':example', ':extent-of', ':month', ':day', ':op4', ':ARG5', ':manner-of', ':concession', ':duration', ':path', ':mode', ':medium', ':ord', ':value', ':destination', ':source', ':direction', ':instrument-of', ':consist-of', ':dayperiod', ':frequency', ':year', ':quant-of', ':weekday', ':compared-to', ':prep-on', ':ARG3-of', ':degree-of', ':prep-as', ':instrument', ':op5', ':prep-from', ':prep-to', ':century', ':era', ':condition-of', ':op6', ':op7', ':concession-of', ':polite', ':age', ':prep-with', ':decade', ':poss-of', ':prep-without', ':prep-in-addition-to', ':accompanier', ':ord-of', ':direction-of', ':prep-against', ':prep-at', ':subevent-of', ':snt10', ':snt11', ':duration-of', ':prep-for', ':source-of', ':frequency-of', ':topic-of', ':season', ':path-of', ':op8', ':op9', ':prep-among', ':prep-on-behalf-of', ':subevent', ':part', ':ARG4-of', ':beneficiary-of', ':scale', ':example-of', ':prep-by', ':range', ':purpose-of', ':destination-of', ':op10', ':op1-of', ':name-of', ':medium-of', ':prep-along-with', ':conj-as-if', ':timezone', ':prep-under', ':accompanier-of', ':age-of', ':op11', ':op12', ':op13', ':op14', ':op15', ':prep-amid', ':prep-toward', ':prep-out-of', ':prep-into', ':domain-of', ':ARG7', ':quarter', ':ARG5-of', ':op16', ':op17', ':op18', ':op19', ':op20', ':ARG8', ':ARG9', ':calendar', ':year2', ':ARG6', ':subset-of', ':prep-with-of'])
self.model.resize_token_embeddings(len(self.tokenizer))

This worked for me

tuhinjubcse on 12 Jun 2020

🚀2

Yes, the add_tokens that @patrickvonplaten and @tuhinjubcse mentionned should get the job done