Trying to explore the contxtual side of Flair embeddings with a simple example:
# your query
query = 'The capital of Washington'
# some texts
sentences = [
'George Washington addressed his supporters',
'Taking a flight to Washington tonight',
'Arkansaw is a lovely state',
'George Washington was a great president',
]
# first, declare how you want to embed
embeddings = DocumentPoolEmbeddings([FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward')
])
# embed
q = Sentence(query)
embeddings.embed(q)
# use cosine distance
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
for sentence in sentences:
s = Sentence(sentence)
embeddings.embed(s)
prox = cos(q.embedding, s.embedding)
print(query, ' - ', sentence, ' - ', prox)
Results:
The capital of Washington - George Washington addressed his supporters - 0.3869
The capital of Washington - Taking a flight to Washington tonight - 0.4389
The capital of Washington - Arkansaw is a lovely state - 0.3746
The capital of Washington - George Washington was a great president - 0.3629
Would've expected much higher scores on the 'Geo context' sentences
Am I doing something wrong?
Technically, the code looks good. Here are some other comparisons with BERT and ELMo:
| LM | Sentence | Similarity
| ------------------------ | ------------------------------------------ | -----------
| BERT (bert-base-uncased) | George Washington addressed his supporters | 0.6652
| BERT (bert-base-uncased) | Taking a flight to Washington tonight | 0.6186
| BERT (bert-base-uncased) | Arkansaw is a lovely state | 0.5656
| BERT (bert-base-uncased) | George Washington was a great president | 0.6955
| BERT (bert-base-cased) | George Washington addressed his supporters | 0.8641
| BERT (bert-base-cased) | Taking a flight to Washington tonight | 0.8477
| BERT (bert-base-cased) | Arkansaw is a lovely state | 0.8385
| BERT (bert-base-cased) | George Washington was a great president | 0.8622
| BERT (bert-large-uncased)| George Washington addressed his supporters | 0.7823
| BERT (bert-large-uncased)| Taking a flight to Washington tonight | 0.7476
| BERT (bert-large-uncased)| Arkansaw is a lovely state | 0.7185
| BERT (bert-large-uncased)| George Washington was a great president | 0.8058
| BERT (bert-large-cased) | George Washington addressed his supporters | 0.8190
| BERT (bert-large-cased) | Taking a flight to Washington tonight | 0.7761
| BERT (bert-large-cased) | Arkansaw is a lovely state | 0.7934
| BERT (bert-large-cased) | George Washington was a great president | 0.8424
| ELMo | George Washington addressed his supporters | 0.3986
| ELMo | Taking a flight to Washington tonight | 0.4577
| ELMo | Arkansaw is a lovely state | 0.3902
| ELMo | George Washington was a great president | 0.3886
| GPT-1 | George Washington addressed his supporters | 0.8232
| GPT-1 | Taking a flight to Washington tonight | 0.8396
| GPT-1 | Arkansaw is a lovely state | 0.7307
| GPT-1 | George Washington was a great president | 0.8003
| Transformer-XL | George Washington addressed his supporters | 0.2481
| Transformer-XL | Taking a flight to Washington tonight | 0.1841
| Transformer-XL | Arkansaw is a lovely state | 0.3009
| Transformer-XL | George Washington was a great president | 0.2997
ELMo looks quite similar to the result with Flair Embeddings :)
Spelling Arkansas correctly may help the model realize that it's a geolocation.
Also, given the 4 sentences, Flair correctly ranks the "Taking a flight to Washington tonight " as the most important, so I don't see the problem. Maybe you'd like the difference in similarity to be higher.
I'd like to see how TransformerXL and GPT-2 do on this, and maybe even word2vec / fasttext
@stefan-it - thanks for the table that's quite interesting.
@Hellisotherpeople Arkansaw is a town in Wisconsin. Would expect it to pick up on that as geolocation too.
Sorry just realised I said it's a lovely state in the example - I see how that's misleading.
As requested, I added the scores for GPt-1 and Transformer-XL 馃
Thanks @stefan-it
Tried some other examples with Flair - these actually work well:
the bucket and mop are in the closet - he kicked the bucket - 0.5848
the bucket and mop are in the closet - i have yet to cross-off all the items on my bucket list - 0.5263
the bucket and mop are in the closet - the bucket was filled with water - 0.6970
he is currently resting at home - the dog sleeps in the kennel - 0.4730
he is currently resting at home - he lived in a beautiful mansion - 0.5347
he is currently resting at home - the home office issued penalties for late filing - 0.4030
he is currently resting at home - press the home button on your phone - 0.3302
Anyone have any further insight or ideas?
If not I'll close this out later on
Hello @eliehamouche @stefan-it thanks for sharing these results!
Another idea would be not to use the cosine of document vectors as a measure of similarity, but different measures that get document similarity based on word embeddings. An example of this would be the word mover's distance: Like document pool embeddings, it need not be trained so it can be used without supervision. We don't yet have it in Flair, but I think it's probably not difficult to implement and experiment with. It might be interesting to see how well word mover's distance works with different types of contextualized word embeddings.
Hey @alanakbik - sorry for the delay missed the notification
That looks quite interesting actually, I'll do a quick comparison and revert back.
@alanakbik @eliehamouche Hello, really thank you provide the transformerXL embedding, but I have a question, if i train my own transformerXL embedding, it seems that it cannot be integrated into the embedding in Flair, such as elmo, I just provide the option_file and the weight_file.
@songtaoshi I will push a follow-up PR for passing custom models into the newly added embeddings very soon (I've also trained a few XLNet models) :)
@stefan-it Wow great !!!!! thanks for your replying. Really looking forward the new PR.
Most helpful comment
Hello @eliehamouche @stefan-it thanks for sharing these results!
Another idea would be not to use the cosine of document vectors as a measure of similarity, but different measures that get document similarity based on word embeddings. An example of this would be the word mover's distance: Like document pool embeddings, it need not be trained so it can be used without supervision. We don't yet have it in Flair, but I think it's probably not difficult to implement and experiment with. It might be interesting to see how well word mover's distance works with different types of contextualized word embeddings.