Flair: Augmenting pre-trained word embeddings with domain specific data

Created on 22 Mar 2019 · 10Comments · Source: flairNLP/flair

I have a small corpus (about 25 million words) from a very specific domain that is not well represented by any of the large corpora used to train the word embeddings available in Flair.

My idea is to use one of the pre-trained models, say FlairEmbeddings('mix-forward'), and further train it on my smaller corpus such that I get word embeddings that are more relevant for my domain.

Any ideas or tips on how to go about doing that? The closest thing I found was part 9 of the tutorial series, but I am not sure I will get good results from such a small corpus. Which is why I want to see if I can transfer some of the knowledge from pre-trained models into new text data from my domain.

I thought about training a new language model from scratch augmenting my corpus with another large corpus of some sort, but I would like to avoid the longer training if possible since there are so many pre-trained models already available. Any help would be greatly appreciated.

question wontfix

Source

MarcioPorto

👍3

Most helpful comment

Hello @amit8121 you can use the in-built functions by torch to save and load embeddings. Here's an example snippet:

import torch

from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, BertEmbeddings, DocumentPoolEmbeddings

# initialize document embeddings
document_pool_embeddings = DocumentPoolEmbeddings(
    [FlairEmbeddings('news-forward'), BertEmbeddings()]
)

# save embeddings using torch.save
torch.save(document_pool_embeddings, 'docpool.pt')

# load embeddings using torch.load
loaded_embeddings: DocumentPoolEmbeddings = torch.load('docpool.pt')

# print loaded embeddings
print(loaded_embeddings)

# make example sentence
sentence = Sentence('I love Berlin')

# embed example sentence with loaded embeddings
loaded_embeddings.embed(sentence)

# print embeddings
for token in sentence:
    print(token.embedding[:10])

alanakbik on 9 Jul 2019

👍3

All 10 comments

Hello @MarcioPorto yes that is possible. You can load an existing model and then use the LanguageModelTrainer to "continue training" it on your specific corpus for a few epochs. You might need to play around with the learning rate a bit (try using a small learning rate) to get good results, but it should work. Also check ticket #121 for an example code snippet.

Hope this helps - please share your findings, we'd be curious to hear how well the approach works for you!

alanakbik on 23 Mar 2019

@MarcioPorto please share findings! See https://github.com/zalandoresearch/flair/issues/121#issuecomment-484959969

aronszanto on 19 Apr 2019

@aronszanto I actually ended up having to use Gensim helpers for this because I needed to use the subword features offered by fastText, which from my understanding are not yet a part of Flair (https://github.com/zalandoresearch/flair/issues/5). I'll make sure to give updates here if I ever use the method outlined in https://github.com/zalandoresearch/flair/issues/121.

MarcioPorto on 22 Apr 2019

@MarcioPorto @alanakbik @aronszanto

I have a similar issue. So, I'll write here than creating a new issue.

I've used Flair news-forward and Bert base-uncased embeddings together to embed sentences in my data set by wrapping them in DocumentPoolEmbeddings.

At the end of my task, I'd like to save the weights of DocumentPoolEmbeddings object using state_dict and load them in my test script to create embeddings on the fly than re-initialize the DocumentPoolEmbeddings object.

I was able to save the weights, but stuck on how to load them properly.

So far I have this:

DocumentPoolEmbeddings.load_state_dict(state_dict = torch.load('Document_Embeddings_File', map_location = 'cpu'))

But that gives me an error saying that self is missing in initialization call. If I try to initialize the DPE object and then call load_state_dict, it breaks as DPE requires embeddings as parameter in its initialization call.

Any help is much appreciated.

Thanks

amithadiraju1694 on 8 Jul 2019

Hello @amit8121 you can use the in-built functions by torch to save and load embeddings. Here's an example snippet:

import torch

from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, BertEmbeddings, DocumentPoolEmbeddings

# initialize document embeddings
document_pool_embeddings = DocumentPoolEmbeddings(
    [FlairEmbeddings('news-forward'), BertEmbeddings()]
)

# save embeddings using torch.save
torch.save(document_pool_embeddings, 'docpool.pt')

# load embeddings using torch.load
loaded_embeddings: DocumentPoolEmbeddings = torch.load('docpool.pt')

# print loaded embeddings
print(loaded_embeddings)

# make example sentence
sentence = Sentence('I love Berlin')

# embed example sentence with loaded embeddings
loaded_embeddings.embed(sentence)

# print embeddings
for token in sentence:
    print(token.embedding[:10])

alanakbik on 9 Jul 2019

👍3

@alanakbik

Thanks for the response. I'm familiar with the ways in which we save and load the model weights with pytorch. I tried the approach which you mentioned initially, but it didn't work for me; then I tried to load the weights with load_state_dict; this might be the case because I've not given any extension when I saved my weights ex: torch.load('Document_Emebedding_Weights). Will try by saving weights with proper extension.

amithadiraju1694 on 9 Jul 2019

Hello @amit8121 you can use the in-built functions by torch to save and load embeddings. Here's an example snippet:

import torch

from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, BertEmbeddings, DocumentPoolEmbeddings

# initialize document embeddings
document_pool_embeddings = DocumentPoolEmbeddings(
    [FlairEmbeddings('news-forward'), BertEmbeddings()]
)

# save embeddings using torch.save
torch.save(document_pool_embeddings, 'docpool.pt')

# load embeddings using torch.load
loaded_embeddings: DocumentPoolEmbeddings = torch.load('docpool.pt')

# print loaded embeddings
print(loaded_embeddings)

# make example sentence
sentence = Sentence('I love Berlin')

# embed example sentence with loaded embeddings
loaded_embeddings.embed(sentence)

# print embeddings
for token in sentence:
    print(token.embedding[:10])

This works. Not saving the model with .pt extension was the issue in my case.

Thanks !

amithadiraju1694 on 9 Jul 2019

🎉1

@alanakbik

I have a slightly different and interesting issue with document embeddings, not sure if it's entirely relevant in this thread or if it has significant meaning, forgive my ignorance if any.

Per our previous conversations, I've used document embeddings on my own data set with flair and bert embeddings together. I've chosen 256 characters per chunk in Flair embeddings initialization, so I got embeddings of dimensions 4096.

When I checked for cosine_similarity between two embeddings, which belong to different labels, they were about 80 % similar on average. I'm not entirely sure whether cosine_similarity is a good measure to check for similarity of embeddings in such cases. But, intuitively, embeddings of different sentences ( containing different labels ) must be different right ?

Ex: F.cosine_similarity(Actual_Embeddings[4].view(1,4096), Actual_Embeddings[5].view(1,4096))

P.S: Actual_Embeddings is just a torch.Tensor of shape [no_of_samples, 4096]. Few of the target values: tensor([4, 2, 2, 6, 2, 5, 5, 2, 5, 5])

Label at index four is 2 and label at index five is 5. But the cosine similarity between them is 0.7917.

So, to counter this issues, I used Stacked LSTM layers to reduce dimensions to say 1024 and then use that representation to classify labels. Turns out that the reduced dimensions are much similar than flair's document embeddings.

Ex: F.cosine_similarity(Feature_Representations[4].view(1,4096), Feature_Representations[5].view(1,4096))
tensor([0.9972]).

I even validated this by building some Linear layers and classifying the embeddings, the classification layer always predicts single label; I believe that's the case because of almost similar embeddings for every sentence. I understand that this could've happened because of my Classification Layer or my Dimensionality Reduction layer; I can try to tweak those. But what bothers me is that the output of Document_Emebddings from flair also yielded similar embeddings. Ultimately I want a multi-label text classifier; do I need to be worried about the similarity of embeddings at all ?

Or do you suggest me to look in to Flair's Text Classifier over this approach ?

Any help is much appreciated. TIA !

amithadiraju1694 on 16 Jul 2019

Hello @amit8121 I think the problem is that cosine similarity is not very indicative of how useful an embedding will be for classification. A classifier can select which part of the embedding it focuses on to make a classification decision, whereas cosine similarity will use the entire vector and weigh all dimensions equally. Especially with very large embedding vectors like you have the cosine similarity will probably not tell you much.

You could try training a TextClassifier with your data and experiment with different embedding combinations. As a baseline, you can use simple WordEmbeddings and either DocumentPoolEmbeddings or DocumentRNNEmbeddings to train the classifier. If you use DocumentPoolEmbeddings in your baseline, be sure to experiment with linear and non-linear fine tuning as explained here (the part about the fine_tune_mode).

alanakbik on 16 Jul 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.