Is it possibile to add special tokens to the pre-trained BART model? My text has <s> as sequence separator for sentences. I would like that the encoder will handle it as a whole token, otherwise the model will break it in codes and learn like <s or s> etc. in the same we did for other tokenizers like GPT2Tokenizer?
tokenizer = GPT2Tokenizer.from_pretrained(args.out,
unk_token="<unk>",
bos_token="<s>",
eos_token="</s>",
pad_token = "<pad>",
additional_special_tokens=["<startoflyrics>", "<endoflyrics>", "<nl>"])
Hi!
Two points that might be helpful.
add_special_tokens functionality should work the same as RobertaTokenizer. <s> is already the bos token, so I don't expect it to be broken up.Let me know if that resolves your issue, thanks!
Not sure how to go about doing this ?
@sshleifer any code example
i see BartTokenizer is essentially RobertaTokenizer which is GPT2Tokenizer
for fine-tuning BART in lightning base we have
self.tokenizer = AutoTokenizer.from_pretrained(
self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path,
cache_dir=cache_dir,
)
Can we add the list of special tokens here ?
If then how ?
Are you trying to add tokens to the vocab and give them new ids?
A specific example with what you expect the tokenizer to produce would be helpful.
I tried the following and it doesn't work as OP intended, afaict.
from transformers import BartTokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large',
additional_special_tokens=["<startoflyrics>", 'dringus'])
encoded = tokenizer.encode_plus(' <startoflyrics> dringus')['input_ids'] # [0, 3, 3, 2]
tokenizer.decode(encoded) # '<s><unk><unk></s>'
Yes @sshleifer i want to add new tokens to the vocab and give them new ids
How to go about doing it ?
@patrickvonplaten @LysandreJik what is the canonical way to add new non-special tokens?
(1) Is there an easier way than making a new vocab and merges file?
(2) If not, is there an example of how to do that?
I am only familiar with the add_special_tokens functionality for new tokens that get the "special tokens" treatment.
For normal tokens, one can use add_tokens as far as I know.
self.tokenizer = AutoTokenizer.from_pretrained(
self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path,
cache_dir=cache_dir,
)
self.model = MODEL_MODES[mode].from_pretrained(
self.hparams.model_name_or_path,
from_tf=bool(".ckpt" in self.hparams.model_name_or_path),
config=self.config,
cache_dir=cache_dir,
)
self.tokenizer.add_tokens(['multi-sentence', ':snt1', ':snt2', ':snt3', ':snt4', ':snt5', ':snt5', ':snt6', ':snt7', ':snt8', ':snt9', ':root', ':ARG1', ':mod', ':op1', ':ARG0', ':ARG0-of', ':name', ':op2', ':ARG2', ':ARG1-of', ':purpose', ':prep-in', ':time', ':li', ':quant', ':unit', ':poss', ':ARG3', ':location', ':domain', ':part-of', ':manner', ':polarity', ':condition', ':ARG4', ':extent', ':time-of', ':location-of', ':op3', ':beneficiary', ':topic', ':degree', ':ARG2-of', ':example', ':extent-of', ':month', ':day', ':op4', ':ARG5', ':manner-of', ':concession', ':duration', ':path', ':mode', ':medium', ':ord', ':value', ':destination', ':source', ':direction', ':instrument-of', ':consist-of', ':dayperiod', ':frequency', ':year', ':quant-of', ':weekday', ':compared-to', ':prep-on', ':ARG3-of', ':degree-of', ':prep-as', ':instrument', ':op5', ':prep-from', ':prep-to', ':century', ':era', ':condition-of', ':op6', ':op7', ':concession-of', ':polite', ':age', ':prep-with', ':decade', ':poss-of', ':prep-without', ':prep-in-addition-to', ':accompanier', ':ord-of', ':direction-of', ':prep-against', ':prep-at', ':subevent-of', ':snt10', ':snt11', ':duration-of', ':prep-for', ':source-of', ':frequency-of', ':topic-of', ':season', ':path-of', ':op8', ':op9', ':prep-among', ':prep-on-behalf-of', ':subevent', ':part', ':ARG4-of', ':beneficiary-of', ':scale', ':example-of', ':prep-by', ':range', ':purpose-of', ':destination-of', ':op10', ':op1-of', ':name-of', ':medium-of', ':prep-along-with', ':conj-as-if', ':timezone', ':prep-under', ':accompanier-of', ':age-of', ':op11', ':op12', ':op13', ':op14', ':op15', ':prep-amid', ':prep-toward', ':prep-out-of', ':prep-into', ':domain-of', ':ARG7', ':quarter', ':ARG5-of', ':op16', ':op17', ':op18', ':op19', ':op20', ':ARG8', ':ARG9', ':calendar', ':year2', ':ARG6', ':subset-of', ':prep-with-of'])
self.model.resize_token_embeddings(len(self.tokenizer))
This worked for me
Yes, the add_tokens that @patrickvonplaten and @tuhinjubcse mentionned should get the job done
Most helpful comment
This worked for me