Transformers: XLNet Embeddings

Created on 16 Jul 2019  路  21Comments  路  Source: huggingface/transformers

How can I retrieve contextual word vectors for my dataset using XLNet ?
The usage and examples in the documentation do not include any guide to use XLNet.
Thanks.

All 21 comments

I'm currently finishing to add the documentation but just use XLNetModel instead of BertModel in the usage example with BertModel

Thanks a lot, @thomwolf for the quick reply. I'll try it out.

@thomwolf, I tried the following snippet. The similarity score changes every time I run the cell. That is, the embeddings or the weights are changing every time. Is this related to dropout?

config = XLNetConfig.from_pretrained('xlnet-large-cased')

tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetModel(config)
input_ids = torch.tensor(tokenizer.encode("The apple juice is sour.")).unsqueeze(0) 
input_ids_2 = torch.tensor(tokenizer.encode("The orange juice is sweet.")).unsqueeze(0) 

outputs = model(input_ids)
outputs_2 = model(input_ids_2)
last_hidden_states = outputs[0] 
last_hidden_states_2 = outputs_2[0]

apple = last_hidden_states[0][1]
orange = last_hidden_states_2[0][1]

x = apple
y = orange
cos_sim = dot(x.detach().numpy(),y.detach().numpy())/(norm(x.detach().numpy())*norm(y.detach().numpy()))
print(cos_sim)

For me logits values changes as well ... using exactly the same settings as mentioned in the example.

Have you found a way to fix that?

@Oxi84 put model.eval() before you make the predictions. This fixed the problem of changing weights for me.

Thanks. For me it works when call like that:

 tokenizer = XLNetTokenizer.from_pretrained("xlnet-large-cased")
 model = XLNetLMHeadModel.from_pretrained("xlnet-large-cased")
 model.eval()

However accuracy seems to be much lower that for Bert - with the code i wrote here: https://github.com/huggingface/pytorch-transformers/issues/846

Did you find that the accuracy is good or bad? I compared with Bert on few examples for masked word prediction and most of XLNet predicted word with the highest probability do not fit at all.

@kushalj001 hi, how can I get the sentence vector

Hi, so it seems that creating a model with a configuration is primarily the problem here:
model = XLNetLMHeadModel.from_pretrained("xlnet-large-cased")
yields consistent outputs, but
config = XLNetConfig.from_pretrained("xlnet-large-uncased")
model = XLNetModel(config)
does not at all.
My question is, how is it possible to set configuration states (like getting hidden states of the model). I have run the glue STS-B fine tuning code to customize the model which is stored at ./proc_data/sts-b-100, but when I load the model using code like this to get hidden states:

config = XLNetConfig.from_pretrained('./proc_data/sts-b-110/')
config.output_hidden_states=True
tokenizer = XLNetTokenizer.from_pretrained('././proc_data/sts-b-110/')
model = XLNetForSequenceClassification(config)

I get results that vary wildly across runs.

Specifically, I would like to get the hidden states of each layer from the fine tuned model and correlate it to the actual text similarity. I was thinking I'd load the model with XLNetForSequenceClassification, get all the hidden states setting the configuration to output hidden states and do such a correlation. Is my approach incorrect?

Looking at run_glue, it seems that actually outputs[1] is used for prediction? This is confusing because all the examples use [0] and the documentation is not very clear.
outputs = model(**inputs)
tmp_eval_loss, logits = outputs[:2]
From run_glue.py

Ok, I figured the logits and loss issue out - the issue is that for XLNetForSequenceClassification, the second index does in fact have logits while the first has loss.

@thomwolf @Oxi84 while calculating word-embeddings of a document, i.e multiple sentences, is it necessary to pass the document sentence-wise? For my dataset, I removed punctuation as a part of the pre-processing step. So now, my whole document goes into the model. Does this hurt the model's performance? Does it make a lot of difference in terms of capturing the context of words?
Thanks

It should improve acuracy if the text is longer, but still for me Bert is way better ... on 20-40 words long text.

It should improve acuracy if the text is longer, but still for me Bert is way better ... on 20-40 words long text.

Yeah, even for my experiments, BERT simply outperfoms XLNet. Still don't know why though.
When you say "it should improve accuracy", you mean that feeding sentences to calculate word-vec would be better, right?

Did you managed to try tensorflow version of XLNet, there is a chance it might be different from the pytorch version?

Maybe there is some bug, but its unlikely since the bechmark results with the XLnet pytorch are the same. But I gues this would the first thing to try to recheck.

Did you managed to try tensorflow version of XLNet, there is a chance it might be different from the pytorch version?

Any simple way of doing this?

any updates regarding this issue?

@kushalj001 why remove the punctuation ? Is it domain specific or to improve accuracy?

@kushalj001 why remove the punctuation ? Is it domain specific or to improve accuracy?

My dataset had a lot of random punctuation, ie misplaced single and double-quotes.
But also, do punctuations add any valuable information to the text? Apart from the period (which can be used to break a large para into sentences), does keeping other punctuation symbols make sense?

I will close this issue which dates back before we had the clean documentation up here: https://huggingface.co/pytorch-transformers/

Please open a new issue with a clear explanation of your specific problem if you have related issues.

Was this page helpful?
0 / 5 - 0 ratings