Flair: Does Flair support document embedding/paragraph embedding?

Created on 26 Dec 2018 · 22Comments · Source: flairNLP/flair

Can we get embeddings for a paragraph/document?

We can get embeddings for a single sentence or a list of sentences. But can we get embeddings for a large paragraph/document?

Do we have anything like bert-as-service? https://github.com/hanxiao/bert-as-service (Mapping a variable-length sentence to a fixed-length vector using BERT model)

feature question

Source

leslyarun

👍5

Most helpful comment

Hi @leslyarun you can get embeddings for a paragraph by putting the entire paragraph into a Sentence object and then using one of the DocumentEmbeddings classes.

Our current document embeddings are derived from word embeddings.

The simplest ones are bag of word embeddings through the DocumentPoolEmbeddings class. These embeddings are meaningful as-is, i.e. with no task specific training:

review = Sentence(
    'Happy Pizza Palace is probably the best pizza restaurant ever. '
    'I go there every week. '
    'I thoroughly recommend it.')

# pooled embeddings (require no task-specific training)
embeddings = DocumentPoolEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')],
    mode='mean',
)

embeddings.embed(review)

print(review.embedding)

More sophisticated embeddings get trained on the downstream task. Here, an LSTM learns how to combine the word embeddings of a sentence or paragraph to get an embedding specifically tailed to the downstream task. This means that without additional training, they are not meaningful.

review = Sentence(
    'Happy Pizza Palace is probably the best pizza restaurant ever. '
    'I go there every week. '
    'I thoroughly recommend it.')

# LSTM embedding (require extra task-specific training)
embeddings = DocumentLSTMEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')],
    hidden_size=512,
)

embeddings.embed(review)

print(review.embedding)

Instead of Flair embeddings, you can pass BERT embeddings to these DocumentEmbeddings classes if you want to try out other embeddings.

I think next to this word-based approach, we should also support @hanxiao 's "bert-as-service" way of directly getting a paragraph embedding - we will look into this and likely add this as a feature very soon!

alanakbik on 27 Dec 2018

🎉6

All 22 comments

Hi @leslyarun you can get embeddings for a paragraph by putting the entire paragraph into a Sentence object and then using one of the DocumentEmbeddings classes.

Our current document embeddings are derived from word embeddings.

The simplest ones are bag of word embeddings through the DocumentPoolEmbeddings class. These embeddings are meaningful as-is, i.e. with no task specific training:

review = Sentence(
    'Happy Pizza Palace is probably the best pizza restaurant ever. '
    'I go there every week. '
    'I thoroughly recommend it.')

# pooled embeddings (require no task-specific training)
embeddings = DocumentPoolEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')],
    mode='mean',
)

embeddings.embed(review)

print(review.embedding)

review = Sentence(
    'Happy Pizza Palace is probably the best pizza restaurant ever. '
    'I go there every week. '
    'I thoroughly recommend it.')

# LSTM embedding (require extra task-specific training)
embeddings = DocumentLSTMEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')],
    hidden_size=512,
)

embeddings.embed(review)

print(review.embedding)

Instead of Flair embeddings, you can pass BERT embeddings to these DocumentEmbeddings classes if you want to try out other embeddings.

alanakbik on 27 Dec 2018

🎉6

@alanakbik Thanks for creating this wonderful library.
If I pass a big paragraph to Sentence(), which has more than 450 tokens, I get error.

So right now we can get embeddings only for a small paragraph(less than 450 tokens)?

Example:

doc = "A large paragraph with more than 450 tokens"

embeddings = DocumentPoolEmbeddings([BertEmbeddings()], mode='mean')
review = Sentence(doc)
embeddings.embed(review) # error on this line

Error:
```/usr/local/lib/python3.6/dist-packages/torch/nn/modules/sparse.py in forward(self, input)
116 return F.embedding(
117 input, self.weight, self.padding_idx, self.max_norm,
--> 118 self.norm_type, self.scale_grad_by_freq, self.sparse)
119
120 def extra_repr(self):

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1452 # remove once script supports set_grad_enabled
1453 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1454 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1455
1456

RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191 ```

leslyarun on 27 Dec 2018

😕1

@leslyarun yes this looks like it's running out of memory since the entire paragraph is sent through a pretty big BERT model. With the same paragraph, does it work with bert-as-service on your machine?

alanakbik on 27 Dec 2018

yup, it's working in bert-as-service @alanakbik

leslyarun on 27 Dec 2018

cool! ok, we'll add it as a new DocumentEmbeddings class, probably already to 0.4.1!

alanakbik on 27 Dec 2018

🎉1

Do we need to pass the paragraph as a list of sentences(strings) or we can pass it as a single string?

rajat907 on 8 Jan 2019

If we want to classify the entire paragraph we always pass a paragraph in one Sentence object even if it consists of multiple sentences.

So something like this:

paragraph = Sentence('This is a sentence. This is another sentence. I love Berlin')

This way, the embedding you obtain with the DocumentEmbeddings are for the whole paragraph you wish to classify.

alanakbik on 9 Jan 2019

👍1

Many thanks for developing this wonderful library!
I recently encountered the same issue as in #392 when I was trying to use BertEmbeddings. Are we going to have a fix in upcoming 0.4.1 release?

gccome on 29 Jan 2019

👀1

Isn't the recommended way to obtain document embeddings from BERT embeddings to use only the first token (i.e. '[CLS]'), rather than pooling all words?

nraw on 27 Mar 2019

👀4

@nraw

I believe the first token is the default behavior from the pooling_operation parameter in the BertEmbeddings class:

class BertEmbeddings(TokenEmbeddings):
    def __init__(self,
             bert_model_or_path: str = 'bert-base-uncased',
             layers: str = '-1,-2,-3,-4',
             pooling_operation: str = 'first'):

"""
:param pooling_operation: how to get from token piece embeddings to token embedding. Either pool them and take the average ('mean') or use first word piece embedding as token embedding ('first)
"""

Alexjmsherman on 27 Mar 2019

@Alexjmsherman Thanks for the answer, but it seems that what you're referring isn't related to how to obtain document embeddings, but how to obtain token embeddings from word piece embeddings.

nraw on 2 Apr 2019

Hello, thank you for this great tool! I am also encountering the same issue @leslyarun was encountering. I am using flair-master code. Do you have any suggestions? Thanks in advance!

berfubuyukoz on 7 Apr 2019

Hey @alanakbik Thanks for that explanation on the two possibilities for getting embeddings for entire documents (Pooling vs. RNN). I have two additional questions regarding the topic.

What I want to do
I'm training a text classifier on pubmed abstracts. I have a training set of around 300 articles but would then like to classify around 20Mio abstracts.

Questions

I was wondering whether I can access the embeddings after training on a downstream task or if they remain "hidden" in the trained model?
How can I efficiently get document embeddings for 20Mio abstracts 😱
I have access to two GPUs in case this counts.

Thanks in advance!

helmersl on 15 Jun 2019

Hello @helmersl yes you can access embeddings after you've trained a model. Let's say you've trained and saved a TextClassifier model. When you load it, you can access the embeddings, like this:

classifier = TextClassifier.load('your/model/file.pt')

# get the embeddings from the trained model
document_embeddings = classifier.document_embeddings

# embed a sentence as always
sentence = Sentence('The grass is green . And the sky is blue .')

document_embeddings.embed(sentence)

print(sentence.get_embedding())

alanakbik on 17 Jun 2019

Ah, awesome, thank you for clarifying, @alanakbik! About the second question, is there any faster/more efficient way to just do, let's say, a list comprehension over the documents I want to embed?

helmersl on 17 Jun 2019

With two GPUs, I would split the dataset and independently process each on one GPU.

To maximize speed, you need to use mini-batching. i.e. split your dataset into lists of sentences (mini-batches) of size 32 or 64 and call document_embeddings.embed(sentences) always on these lists. This way, you more effectively utilize GPU since you're passing whole mini-batches through it. You'll want the mini-batch size as high as possible without reaching the point where you get CUDA out of memory errors (that occurs when too many and too long documents are passed through the GPU at the same time). A good mini-batching parameter is 32 or 64, but potentially higher depending on your dataset and GPU.

alanakbik on 19 Jun 2019

Thanks for the tips, @alanakbik. I'm unfortunately not exactly sure how to specify which GPU to use. I tried to import torch and run the code with

with torch.cuda.device('cuda:1): ...

but unfortunately, the code is still executed on the default GPU ('cuda:0'). Is there a possibility to specify the GPU to use in flair directly?

helmersl on 20 Jun 2019

We have a global variable for this, namely flair.device. You can set flair.device = 'cuda:1' before calling the code which will make everything run on this GPU.

alanakbik on 20 Jun 2019

Ah, I found out :-)
flair.device = torch.device('cuda:1')

helmersl on 20 Jun 2019

🎉1 👍1

Hi @leslyarun you can get embeddings for a paragraph by putting the entire paragraph into a Sentence object and then using one of the DocumentEmbeddings classes.

Our current document embeddings are derived from word embeddings.

The simplest ones are bag of word embeddings through the DocumentPoolEmbeddings class. These embeddings are meaningful as-is, i.e. with no task specific training:
review = Sentence(
    'Happy Pizza Palace is probably the best pizza restaurant ever. '
    'I go there every week. '
    'I thoroughly recommend it.')

# pooled embeddings (require no task-specific training)
embeddings = DocumentPoolEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')],
    mode='mean',
)

embeddings.embed(review)

print(review.embedding) 
More sophisticated embeddings get trained on the downstream task. Here, an LSTM learns how to combine the word embeddings of a sentence or paragraph to get an embedding specifically tailed to the downstream task. This means that without additional training, they are not meaningful.
review = Sentence(
    'Happy Pizza Palace is probably the best pizza restaurant ever. '
    'I go there every week. '
    'I thoroughly recommend it.')

# LSTM embedding (require extra task-specific training)
embeddings = DocumentLSTMEmbeddings(
    [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')],
    hidden_size=512,
)

embeddings.embed(review)

print(review.embedding)
Instead of Flair embeddings, you can pass BERT embeddings to these DocumentEmbeddings classes if you want to try out other embeddings.

I think next to this word-based approach, we should also support @hanxiao 's "bert-as-service" way of directly getting a paragraph embedding - we will look into this and likely add this as a feature very soon!

1- So, in case of using " DocumentPoolEmbeddings([BertEmbeddings()]) ", the vector value of the paragraph is average of word vectors obtained from BERT ? The same way it is calculated using Glove and other pre-trained models?
2- My other question is, would you recommend removing stop-words from the documents before creating Sentence objects out of them ?
Thank you,
Ibrahim

irhallac on 18 Jul 2019

@irhallac I would generally recommend not doing any preprocessing if you train the embeddings on a downstream task since this way it can learn by itself what information to use and what to discard. If you just want the representations without any further training, then removing stopwords is probably a good idea though I haven't tried anything like this yet.

alanakbik on 18 Jul 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.