Transformers: Special tokens / XLNet

Created on 10 Sep 2019  Â·  9Comments  Â·  Source: huggingface/transformers

it is necessary to add [CLS], [SEP] tokens in case of XLNet transformers?

Thanks!

*I used only the tokenizer.encode() function even if the sample had several sentences and I didn't set any special tokens. I think it was not the right way, isn't? It was done for a classification task.

wontfix

Most helpful comment

I'm not a dev of this lib but just stumbling upon this whilst searching from something else so I'll reply ;)
I think for more than 2 sentences you can use A [SEP] B [SEP] C [SEP] [CLS] for the encoding, and then specify token_type_ids as explained there to tell the model which token belongs to which segment.

All 9 comments

Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:

A [SEP] B [SEP][CLS]

You can either create those yourself or use the flag add_special_tokens from the encode function as follows:

tokenizer.encode(a, b, add_special_tokens=True)

which will return the correct list of tokens according to the tokenizer you used (which should be the XLNetTokenizer in your case)

@LysandreJik how to deal with more than two sentences? In the same way?

I'm not a dev of this lib but just stumbling upon this whilst searching from something else so I'll reply ;)
I think for more than 2 sentences you can use A [SEP] B [SEP] C [SEP] [CLS] for the encoding, and then specify token_type_ids as explained there to tell the model which token belongs to which segment.

regarding token_type_ids:

@LysandreJik wrote here about two sentences,

https://github.com/huggingface/pytorch-transformers/issues/1208#issuecomment-528515647

If I recall correctly the XLNet model has 0 for the first sequence token_type_ids, 1 for the second sequence, and 2 for the last (cls) token.

what is to do for the third, fourth, fifth ... sentences ? 0 and 1 alternating?

I think you can put 0 for first sentence, 1 for second, 2 for third etc..
but the actual indices do not matter because the encoding is relative (see
XLNet paper section 2.5), the only important thing is that tokens from a
same sentence have the same token_type_ids. XLnet was made this way in
order to handle an arbitrary number of sentences at finetuning. At least
that is the way I understand it.

Le ven. 13 sept. 2019 à 15:55, cherepanovic notifications@github.com a
écrit :

regarding token_type_ids:

@LysandreJik https://github.com/LysandreJik wrote here about two
sentences,

1208 (comment)

https://github.com/huggingface/pytorch-transformers/issues/1208#issuecomment-528515647

If I recall correctly the XLNet model has 0 for the first sequence
token_type_ids, 1 for the second sequence, and 2 for the last (cls) token.

what is to do for the third, fourth, fifth ... sentences ?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/pytorch-transformers/issues/1242?email_source=notifications&email_token=AD6A5ZI5KDS2BLIQLBJB5QLQJOLVJA5CNFSM4IVKBXC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6VCZMA#issuecomment-531246256,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AD6A5ZNAFKI6SQPLVVSKYLLQJOLVJANCNFSM4IVKBXCQ
.

Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:

@LysandreJik
you are speaking about sentence pairs, what is to do with several sentences, could you please give an advice

thanks a lot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Hi, in the case of sequence classification, XLNet does indeed use special tokens. For sentence pairs, it looks like this:

@LysandreJik
you are speaking about sentence pairs, what is to do with several sentences, could you please give an advice

thanks a lot

so have you got the answer about how to deal with more than two sentences?

@sloth2012

so have you got the answer about how to deal with more than two sentences?

no, but I did it in this way [SEP] A.B.C [CLS], in this way

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fyubang picture fyubang  Â·  3Comments

zhezhaoa picture zhezhaoa  Â·  3Comments

HansBambel picture HansBambel  Â·  3Comments

yspaik picture yspaik  Â·  3Comments

rsanjaykamath picture rsanjaykamath  Â·  3Comments