Hey guys
I tried to extract embeddings out of corpus using GPU but gpu utilisation is around 7% only so I thought there might be a way of getting all cores of CPU to extract embeddings so that this could save me time
I've a corpus of around 3 mil tokens and embeddings are gonna be saved in a python dictionary. the following is how I used flair to get embeddings in a parallel way via python Pool() but its not working. I think each process should have it's own "StackedEmbedding" obj and doing this needs a lot of memory.
import os
from multiprocessing import Pool, Manager
from flair.embeddings import WordEmbeddings
from flair.data import Sentence
def procces_sents(chunks):
sentences = [Sentence(i) for i in chunks]
stacked_embeddings.embed(sentences)
for sent in sentences:
for token in sent:
if token in my_dict:
pass
else:
pass
stacked_embeddings = StackedEmbeddings([
WordEmbeddings('glove'),
PooledFlairEmbeddings('news-forward-fast'),
PooledFlairEmbeddings('news-backward-fast'),
])
my_dict=Manager().dict()
my_list["1.txt"]
n=32
filepath = "/some/path/"
for filename in my_list:
with open(os.path.join(filepath,filename) ,"r") as m:
all=[]
data = m.readlines()
all = [x.strip() for x in all_data]
all_chunks = [all[i * n:(i + 1) * n] for i in range((len(all) + n - 1) // n )]
p=Pool()
p.map(procces_sents, all_chunks)
p.close()
p.join()
I expect that each core takes a 32 sized chunk of sentences and process them.
"all" is a list of all sentences
"all_chunks" is chunks of all list (each containing 32 sentences)
Hello @ctrboundary , did you try to parallel your task with Joblib?
from joblib import Parallel, delayed
from tqdm import tqdm_notebook
from multiprocessing import cpu_count
def parallelize(iterable, func):
return Parallel(n_jobs=cpu_count() - 1, prefer="threads")(delayed(func)(i) for i in tqdm_notebook(iterable))
parallelize(all_chunks, procces_sents)
the prefer="threads" parameter should avoid to have stacked_embeddings object duplicated in memory.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hello @ctrboundary , did you try to parallel your task with Joblib?
the
prefer="threads"parameter should avoid to havestacked_embeddingsobject duplicated in memory.