it is necessary to add [CLS], [SEP] tokens in case of XLNet transformers?
Thanks!
*I used only the tokenizer.encode() function even if the sample had several sentences and I didn't set any special tokens. I think it was not the right way, isn't? It was done for a classification task.
Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:
A [SEP] B [SEP][CLS]
You can either create those yourself or use the flag add_special_tokens from the encode function as follows:
tokenizer.encode(a, b, add_special_tokens=True)
which will return the correct list of tokens according to the tokenizer you used (which should be the XLNetTokenizer in your case)
@LysandreJik how to deal with more than two sentences? In the same way?
I'm not a dev of this lib but just stumbling upon this whilst searching from something else so I'll reply ;)
I think for more than 2 sentences you can use A [SEP] B [SEP] C [SEP] [CLS] for the encoding, and then specify token_type_ids as explained there to tell the model which token belongs to which segment.
regarding token_type_ids:
@LysandreJik wrote here about two sentences,
https://github.com/huggingface/pytorch-transformers/issues/1208#issuecomment-528515647
If I recall correctly the XLNet model has 0 for the first sequence token_type_ids, 1 for the second sequence, and 2 for the last (cls) token.
what is to do for the third, fourth, fifth ... sentences ? 0 and 1 alternating?
I think you can put 0 for first sentence, 1 for second, 2 for third etc..
but the actual indices do not matter because the encoding is relative (see
XLNet paper section 2.5), the only important thing is that tokens from a
same sentence have the same token_type_ids. XLnet was made this way in
order to handle an arbitrary number of sentences at finetuning. At least
that is the way I understand it.
Le ven. 13 sept. 2019 Ã 15:55, cherepanovic notifications@github.com a
écrit :
regarding token_type_ids:
@LysandreJik https://github.com/LysandreJik wrote here about two
sentences,1208 (comment)
https://github.com/huggingface/pytorch-transformers/issues/1208#issuecomment-528515647
If I recall correctly the XLNet model has 0 for the first sequence
token_type_ids, 1 for the second sequence, and 2 for the last (cls) token.what is to do for the third, fourth, fifth ... sentences ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/pytorch-transformers/issues/1242?email_source=notifications&email_token=AD6A5ZI5KDS2BLIQLBJB5QLQJOLVJA5CNFSM4IVKBXC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6VCZMA#issuecomment-531246256,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AD6A5ZNAFKI6SQPLVVSKYLLQJOLVJANCNFSM4IVKBXCQ
.
Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:
@LysandreJik
you are speaking about sentence pairs, what is to do with several sentences, could you please give an advice
thanks a lot
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:
@LysandreJik
you are speaking about sentence pairs, what is to do with several sentences, could you please give an advicethanks a lot
so have you got the answer about how to deal with more than two sentences?
@sloth2012
so have you got the answer about how to deal with more than two sentences?
no, but I did it in this way [SEP] A.B.C [CLS], in this way
Most helpful comment
I'm not a dev of this lib but just stumbling upon this whilst searching from something else so I'll reply ;)
I think for more than 2 sentences you can use A [SEP] B [SEP] C [SEP] [CLS] for the encoding, and then specify token_type_ids as explained there to tell the model which token belongs to which segment.