i am running this code on google colab to get the embedding of docs in "long_desc".
len(long_desc) is 20k
embeddings = TransformerDocumentEmbeddings('bert-base-uncased')
X = torch.empty(size=(len(long_desc), 768)).cuda()
i=0
for text in tqdm(long_desc):
if len(text.split(' ')) < 300 :
print(text)
sentence = Sentence(text)
embeddings.embed(sentence)
X[i] = sentence.get_embedding()
sentence.clear_embeddings()
i += 1
but it gives RuntimeError: CUDA out of memory. after 10-20 iterations in for loop.
You could try putting X on cpu instead since you'll likely have a lot more CPU memory than GPU memory. Also, if you don't intend to fine-tune the embeddings, instantiate with
embeddings = TransformerDocumentEmbeddings('bert-base-uncased', fine_tune=False)
and call
sentence.to('cpu')
after embedding the sentence to that the vector is on cpu before you copy it to X.
@alanakbik Thank you. it's working now. but it become little slow (20K in 40mins). is there any other suggestion to make it fast?
You could try mini-batching, i.e. always give a list of 2 or 4 sentences to the embedding method at the same time. Or more if you have enough GPU memory. Instantiate the embedding with the batch size you want to use:
embeddings = TransformerDocumentEmbeddings('bert-base-uncased', fine_tune=False, batch_size=2)
And always give it this number of sentences in a list at the same time:
embeddings.embed([sentence_1, sentence_2])
Am i doing it right because? it become more slower than before for 2 sentence at a time.
embeddings = TransformerDocumentEmbeddings('bert-large-uncased',fine_tune=False,batch_size=2)
sentence_embeddings = []
for text in tqdm(long_desc):
sents = [Sentence(x) for x in text]
embeddings.embed(sents)
sentence_to_cpu = [x.to('cpu') for x in sents]
sentence_embeddings.extend[x.get_embedding() for x in sents]
Did you solve it?
Most helpful comment
You could try putting X on cpu instead since you'll likely have a lot more CPU memory than GPU memory. Also, if you don't intend to fine-tune the embeddings, instantiate with
and call
after embedding the sentence to that the vector is on cpu before you copy it to X.