Transformers: Special tokens / XLNet

Created on 10 Sep 2019 · 9Comments · Source: huggingface/transformers

it is necessary to add [CLS], [SEP] tokens in case of XLNet transformers?

Thanks!

*I used only the tokenizer.encode() function even if the sample had several sentences and I didn't set any special tokens. I think it was not the right way, isn't? It was done for a classification task.

wontfix

Source

cherepanovic

Most helpful comment

I'm not a dev of this lib but just stumbling upon this whilst searching from something else so I'll reply ;)
I think for more than 2 sentences you can use A [SEP] B [SEP] C [SEP] [CLS] for the encoding, and then specify token_type_ids as explained there to tell the model which token belongs to which segment.

drasros on 13 Sep 2019

👍2

All 9 comments

Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:

A [SEP] B [SEP][CLS]

You can either create those yourself or use the flag add_special_tokens from the encode function as follows:

tokenizer.encode(a, b, add_special_tokens=True)

which will return the correct list of tokens according to the tokenizer you used (which should be the XLNetTokenizer in your case)

LysandreJik on 11 Sep 2019

👍1

@LysandreJik how to deal with more than two sentences? In the same way?

cherepanovic on 12 Sep 2019

drasros on 13 Sep 2019

👍2

regarding token_type_ids:

@LysandreJik wrote here about two sentences,

https://github.com/huggingface/pytorch-transformers/issues/1208#issuecomment-528515647

If I recall correctly the XLNet model has 0 for the first sequence token_type_ids, 1 for the second sequence, and 2 for the last (cls) token.

what is to do for the third, fourth, fifth ... sentences ? 0 and 1 alternating?

cherepanovic on 13 Sep 2019

I think you can put 0 for first sentence, 1 for second, 2 for third etc..
but the actual indices do not matter because the encoding is relative (see
XLNet paper section 2.5), the only important thing is that tokens from a
same sentence have the same token_type_ids. XLnet was made this way in
order to handle an arbitrary number of sentences at finetuning. At least
that is the way I understand it.

Le ven. 13 sept. 2019 à 15:55, cherepanovic notifications@github.com a
écrit :

regarding token_type_ids:

@LysandreJik https://github.com/LysandreJik wrote here about two
sentences,

1208 (comment)

https://github.com/huggingface/pytorch-transformers/issues/1208#issuecomment-528515647

If I recall correctly the XLNet model has 0 for the first sequence
token_type_ids, 1 for the second sequence, and 2 for the last (cls) token.

what is to do for the third, fourth, fifth ... sentences ?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/pytorch-transformers/issues/1242?email_source=notifications&email_token=AD6A5ZI5KDS2BLIQLBJB5QLQJOLVJA5CNFSM4IVKBXC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6VCZMA#issuecomment-531246256,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AD6A5ZNAFKI6SQPLVVSKYLLQJOLVJANCNFSM4IVKBXCQ
.

drasros on 13 Sep 2019

Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:

@LysandreJik
you are speaking about sentence pairs, what is to do with several sentences, could you please give an advice

thanks a lot

cherepanovic on 19 Sep 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 18 Nov 2019

Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:

@LysandreJik
you are speaking about sentence pairs, what is to do with several sentences, could you please give an advice

thanks a lot

so have you got the answer about how to deal with more than two sentences?