Can we get embeddings for a paragraph/document?
We can get embeddings for a single sentence or a list of sentences. But can we get embeddings for a large paragraph/document?
Do we have anything like bert-as-service? https://github.com/hanxiao/bert-as-service (Mapping a variable-length sentence to a fixed-length vector using BERT model)
Hi @leslyarun you can get embeddings for a paragraph by putting the entire paragraph into a Sentence object and then using one of the DocumentEmbeddings classes.
Our current document embeddings are derived from word embeddings.
The simplest ones are bag of word embeddings through the DocumentPoolEmbeddings class. These embeddings are meaningful as-is, i.e. with no task specific training:
review = Sentence(
'Happy Pizza Palace is probably the best pizza restaurant ever. '
'I go there every week. '
'I thoroughly recommend it.')
# pooled embeddings (require no task-specific training)
embeddings = DocumentPoolEmbeddings(
[WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')],
mode='mean',
)
embeddings.embed(review)
print(review.embedding)
More sophisticated embeddings get trained on the downstream task. Here, an LSTM learns how to combine the word embeddings of a sentence or paragraph to get an embedding specifically tailed to the downstream task. This means that without additional training, they are not meaningful.
review = Sentence(
'Happy Pizza Palace is probably the best pizza restaurant ever. '
'I go there every week. '
'I thoroughly recommend it.')
# LSTM embedding (require extra task-specific training)
embeddings = DocumentLSTMEmbeddings(
[WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')],
hidden_size=512,
)
embeddings.embed(review)
print(review.embedding)
Instead of Flair embeddings, you can pass BERT embeddings to these DocumentEmbeddings classes if you want to try out other embeddings.
I think next to this word-based approach, we should also support @hanxiao 's "bert-as-service" way of directly getting a paragraph embedding - we will look into this and likely add this as a feature very soon!
@alanakbik Thanks for creating this wonderful library.
If I pass a big paragraph to Sentence(), which has more than 450 tokens, I get error.
So right now we can get embeddings only for a small paragraph(less than 450 tokens)?
Example:
doc = "A large paragraph with more than 450 tokens"
embeddings = DocumentPoolEmbeddings([BertEmbeddings()], mode='mean')
review = Sentence(doc)
embeddings.embed(review) # error on this line
Error:
```/usr/local/lib/python3.6/dist-packages/torch/nn/modules/sparse.py in forward(self, input)
116 return F.embedding(
117 input, self.weight, self.padding_idx, self.max_norm,
--> 118 self.norm_type, self.scale_grad_by_freq, self.sparse)
119
120 def extra_repr(self):
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1452 # remove once script supports set_grad_enabled
1453 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1454 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1455
1456
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191 ```
@leslyarun yes this looks like it's running out of memory since the entire paragraph is sent through a pretty big BERT model. With the same paragraph, does it work with bert-as-service on your machine?
yup, it's working in bert-as-service @alanakbik
cool! ok, we'll add it as a new DocumentEmbeddings class, probably already to 0.4.1!
Do we need to pass the paragraph as a list of sentences(strings) or we can pass it as a single string?
If we want to classify the entire paragraph we always pass a paragraph in one Sentence object even if it consists of multiple sentences.
So something like this:
paragraph = Sentence('This is a sentence. This is another sentence. I love Berlin')
This way, the embedding you obtain with the DocumentEmbeddings are for the whole paragraph you wish to classify.
Many thanks for developing this wonderful library!
I recently encountered the same issue as in #392 when I was trying to use BertEmbeddings. Are we going to have a fix in upcoming 0.4.1 release?
Isn't the recommended way to obtain document embeddings from BERT embeddings to use only the first token (i.e. '[CLS]'), rather than pooling all words?
@nraw
I believe the first token is the default behavior from the pooling_operation parameter in the BertEmbeddings class:
class BertEmbeddings(TokenEmbeddings):
def __init__(self,
bert_model_or_path: str = 'bert-base-uncased',
layers: str = '-1,-2,-3,-4',
pooling_operation: str = 'first'):
"""
:param pooling_operation: how to get from token piece embeddings to token embedding. Either pool them and take the average ('mean') or use first word piece embedding as token embedding ('first)
"""
@Alexjmsherman Thanks for the answer, but it seems that what you're referring isn't related to how to obtain document embeddings, but how to obtain token embeddings from word piece embeddings.
Hello, thank you for this great tool! I am also encountering the same issue @leslyarun was encountering. I am using flair-master code. Do you have any suggestions? Thanks in advance!
Hey @alanakbik Thanks for that explanation on the two possibilities for getting embeddings for entire documents (Pooling vs. RNN). I have two additional questions regarding the topic.
What I want to do
I'm training a text classifier on pubmed abstracts. I have a training set of around 300 articles but would then like to classify around 20Mio abstracts.
Questions
Thanks in advance!
Hello @helmersl yes you can access embeddings after you've trained a model. Let's say you've trained and saved a TextClassifier model. When you load it, you can access the embeddings, like this:
classifier = TextClassifier.load('your/model/file.pt')
# get the embeddings from the trained model
document_embeddings = classifier.document_embeddings
# embed a sentence as always
sentence = Sentence('The grass is green . And the sky is blue .')
document_embeddings.embed(sentence)
print(sentence.get_embedding())
Ah, awesome, thank you for clarifying, @alanakbik! About the second question, is there any faster/more efficient way to just do, let's say, a list comprehension over the documents I want to embed?
With two GPUs, I would split the dataset and independently process each on one GPU.
To maximize speed, you need to use mini-batching. i.e. split your dataset into lists of sentences (mini-batches) of size 32 or 64 and call document_embeddings.embed(sentences) always on these lists. This way, you more effectively utilize GPU since you're passing whole mini-batches through it. You'll want the mini-batch size as high as possible without reaching the point where you get CUDA out of memory errors (that occurs when too many and too long documents are passed through the GPU at the same time). A good mini-batching parameter is 32 or 64, but potentially higher depending on your dataset and GPU.
Thanks for the tips, @alanakbik. I'm unfortunately not exactly sure how to specify which GPU to use. I tried to import torch and run the code with
with torch.cuda.device('cuda:1): ...
but unfortunately, the code is still executed on the default GPU ('cuda:0'). Is there a possibility to specify the GPU to use in flair directly?
We have a global variable for this, namely flair.device. You can set flair.device = 'cuda:1' before calling the code which will make everything run on this GPU.
Ah, I found out :-)
flair.device = torch.device('cuda:1')
Hi @leslyarun you can get embeddings for a paragraph by putting the entire paragraph into a Sentence object and then using one of the
DocumentEmbeddingsclasses.Our current document embeddings are derived from word embeddings.
The simplest ones are bag of word embeddings through the
DocumentPoolEmbeddingsclass. These embeddings are meaningful as-is, i.e. with no task specific training:review = Sentence( 'Happy Pizza Palace is probably the best pizza restaurant ever. ' 'I go there every week. ' 'I thoroughly recommend it.') # pooled embeddings (require no task-specific training) embeddings = DocumentPoolEmbeddings( [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')], mode='mean', ) embeddings.embed(review) print(review.embedding)More sophisticated embeddings get trained on the downstream task. Here, an LSTM learns how to combine the word embeddings of a sentence or paragraph to get an embedding specifically tailed to the downstream task. This means that without additional training, they are not meaningful.
review = Sentence( 'Happy Pizza Palace is probably the best pizza restaurant ever. ' 'I go there every week. ' 'I thoroughly recommend it.') # LSTM embedding (require extra task-specific training) embeddings = DocumentLSTMEmbeddings( [WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')], hidden_size=512, ) embeddings.embed(review) print(review.embedding)Instead of Flair embeddings, you can pass BERT embeddings to these DocumentEmbeddings classes if you want to try out other embeddings.
I think next to this word-based approach, we should also support @hanxiao 's "bert-as-service" way of directly getting a paragraph embedding - we will look into this and likely add this as a feature very soon!
1- So, in case of using " DocumentPoolEmbeddings([BertEmbeddings()]) ", the vector value of the paragraph is average of word vectors obtained from BERT ? The same way it is calculated using Glove and other pre-trained models?
2- My other question is, would you recommend removing stop-words from the documents before creating Sentence objects out of them ?
Thank you,
Ibrahim
@irhallac I would generally recommend not doing any preprocessing if you train the embeddings on a downstream task since this way it can learn by itself what information to use and what to discard. If you just want the representations without any further training, then removing stopwords is probably a good idea though I haven't tried anything like this yet.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hi @leslyarun you can get embeddings for a paragraph by putting the entire paragraph into a Sentence object and then using one of the
DocumentEmbeddingsclasses.Our current document embeddings are derived from word embeddings.
The simplest ones are bag of word embeddings through the
DocumentPoolEmbeddingsclass. These embeddings are meaningful as-is, i.e. with no task specific training:More sophisticated embeddings get trained on the downstream task. Here, an LSTM learns how to combine the word embeddings of a sentence or paragraph to get an embedding specifically tailed to the downstream task. This means that without additional training, they are not meaningful.
Instead of Flair embeddings, you can pass BERT embeddings to these DocumentEmbeddings classes if you want to try out other embeddings.
I think next to this word-based approach, we should also support @hanxiao 's "bert-as-service" way of directly getting a paragraph embedding - we will look into this and likely add this as a feature very soon!