Flair: How to generate embedding for a list of sentence?

Created on 12 Feb 2019  路  7Comments  路  Source: flairNLP/flair

When I have a list of text, and assume each text in the list have the same length, the sentences.get_embedding() method would give me tensor([]). How could I just turn a list of text to word level embeddings? If my input is shape of [#utterances, #words], then I would expect the output to be [#utterances, #words, #embedding dimensions]. How would I achiece this?

Thanks!

question

Most helpful comment

@alanakbik

I think this was a legit question. Isn't there a better way to embed multiple sentences at once? And have them in a tensor with positions padded either at the end or beginning? Embedding one sentence at a time is very inefficient. the package is great, this would improve it even further :)

Have you added this feature in your flair. If so, please provide a sample code.

All 7 comments

Hello @limiao2 - if you use one of the TokenEmbeddings classes on a sentence or list of sentences, each Token (word) in each Sentence gets embedded. Then, you can access the embedding field of each Token to retrieve the word embedding.

Here is some example code:

import torch
from flair.data import Sentence
from flair.embeddings import WordEmbeddings

# load word embeddings
embeddings = WordEmbeddings('glove')

# some example sentence
sentence = Sentence('I love Berlin')

# embed sentences
embeddings.embed(sentence)

# go through each token in sentence
for token in sentence:
    # print embedding of this Token
    print(token.embedding)
    # print shape of embedding of this Token
    print(token.embedding.shape)

# make one tensor of all word embeddings of a sentence
sentence_tensor = torch.cat([token.embedding.unsqueeze(0) for token in sentence], dim=0)

# print tensor shape
print(sentence_tensor.shape) 

This embeds the sentence with GloVe embeddings, then prints out the embedding of each word. Each word embedding in this case has the shape [100], i.e. a 100 dimensional vector.

Then we use torch.cat to concatenate all embeddings of all words in the sentence into a tensor of shape [3, 100] (3 words, each 100 dimensions).

If you want to have a batch of more than one sentence, you need to use torch.cat again to concatenate the tensors of each sentence.

Hope this helps!

Thanks! But would the nested list operation be costly? Is there any method to speed up the batch operation? Lookup on a matrix should be fast

I am not sure how big a cost factor it is - if you have some insights here, please share :) We currently have an ongoing effort to profile the framework in order to reduce CPU usage and increase GPU, but are still working on identifying ways to make everything faster.

I think this was a legit question. Isn't there a better way to embed multiple sentences at once? And have them in a tensor with positions padded either at the end or beginning? Embedding one sentence at a time is very inefficient. the package is great, this would improve it even further :)

@alanakbik

I think this was a legit question. Isn't there a better way to embed multiple sentences at once? And have them in a tensor with positions padded either at the end or beginning? Embedding one sentence at a time is very inefficient. the package is great, this would improve it even further :)

Have you added this feature in your flair. If so, please provide a sample code.

Is there a batch operation option for generating embeddings?
I'm producing BERT embeddings for ~10M sentences and it is going to take ~120 hours to process all the sentences if I go iteratively. I'm using one GPU. Can we speed up this process?

/cc @alanakbik

@prabhakar267
Simply use PyTorch Transformers!
Transformers

Was this page helpful?
0 / 5 - 0 ratings

Related issues

prematurelyoptimized picture prematurelyoptimized  路  3Comments

Rahulvks picture Rahulvks  路  3Comments

mittalsuraj18 picture mittalsuraj18  路  3Comments

Aditya715 picture Aditya715  路  3Comments

alanakbik picture alanakbik  路  3Comments