I am using Flair for identifying named entities and tagging the text. I wanted to understand what is the best approach to speed up this process.
My text data is around 5 GB with each sentence in a separate line. I am calling Flair with chunks of this data. Regardless of the chunk size, the speed is pretty slow compared to Spacy NER. Of course, using GPU improves the Flair speed significantly, but still substantially lags beyond Spacy. The latter is at least 20 times faster.
This makes me believe that I might not be using Flair in the best possible way. Are there any best practices to increase the speed when using GPU? Please advise.
sbs
Can you share your tagging script? Maybe it can be tweaked (though SpaCy is generally faster).
Here it is:
from itertools import islice
import flair
from flair.models import SequenceTagger
from flair.tokenization import SegtokSentenceSplitter
import logging
logging.basicConfig(format='%(asctime)s :: %(message)s', level=logging.INFO)
ONTONOTES= True
flair.device= torch.device('cuda:0') #or use 'cpu'
if not ONTONOTES:
tagger= SequenceTagger.load('ner')
else:
tagger= SequenceTagger.load('ner-ontonotes')
splitter= SegtokSentenceSplitter()
dir_in= 'sentences/'
dir_out= 'ner-annotated/'
def nerAnnotate(text, ontonotes= True):
if not ontonotes:
entities=['LOC', 'PER', 'ORG', 'MISC']
else:
entities=[
'PERSON', 'NORP', 'FAC', 'ORG',
'GPE', 'LOC','PRODUCT', 'EVENT',
'WORK_OF_ART', 'LAW','LANGUAGE', 'DATE',
'TIME', 'PERCENT','MONEY', 'QUANTITY',
'ORDINAL', 'CARDINAL' ]
for e in entities:
text= text.replace(" <E-" + e + ">", "|" + e)
text= text.replace(" <B-" + e + "> ", "_")
text= text.replace(" <I-" + e + "> ", "_")
text= text.replace(" <S-" + e + ">", "|" + e)
return text
N= 1000
logging.info("processing file: " + 'RS_2019-' + suffix + '_Utterances_Sentences.txt')
f_in= dir_in + 'doc_Sentences.txt'
f_out= dir_out + 'doc_Sentences_nerAnnotated.txt'
with open(f_in, "r") as p_f_in:
with open(f_out, "w") as p_f_out:
i= 1
while True:
lines_gen= list(islice(p_f_in, N))
if not lines_gen:
break
text= '\n'.join([line.strip() for line in lines_gen if line.strip() != ''])
sentences= splitter.split(text)
tagger.predict(sentences)
tagged= [sentence.to_tagged_string() for sentence in sentences]
annotated= nerAnnotate('\n'.join(tagged))
p_f_out.write(annotated)
if (i*N) % 5000 == 0:
logging.info("ner annotated " + str(i*N) + " lines")
i= i + 1
logging.info("DONE!!!")
Thanks! Can you also tell me how your input sentence file is formatted?
Generally, it is always good to use mini-batching, i.e. put a list of 32 or 64 sentences through the predict method at once (as much as your GPU memory permits).
@alanakbik - Thank you. In the input file, each line is a separate sentence (which was obtained using Spacy's sentencizer. As you can see, I am sending 1000 sentences at a time to predict as my GPU 12GB RAM can accommodate this.
In 2 days, Flair has only been able to process 500 MB of text. As a comparison, Spacy NER processed over 50 GB of text in the same time.
Oha, yes that's slow. So I notice that you first concatenate 1000 lines into one large string and then running the sentence splitter. But if you're file already has the sentences split, could you not create a Sentence object for each line and put it in a list? Each time the list has a certain length, you can put it through the predict method.
Also, 1000 sentences seems like a lot, perhaps try a mini-batch size of 32?
Something like this:
N = 32
with open(f_in, "r") as p_f_in:
sentences = []
with open(f_out, "w") as p_f_out:
i = 1
while True:
# read line
line = p_f_in.readline()
# make sentence
sentence = Sentence(line)
sentences.append(sentence)
if len(sentences) == N:
tagger.predict(sentences)
tagged = [sentence.to_tagged_string() for sentence in sentences]
annotated = nerAnnotate('\n'.join(tagged))
p_f_out.write(annotated)
if (i * N) % 100 == 0:
logging.info("ner annotated " + str(i * N) + " lines")
i = i + 1
Also, you could try "ontonotes-fast" and "ner-fast" models instead.
Thanks @alanakbik . I will try your suggestions and keep you posted.
@alanakbik - after using ontonotes-fast and your code suggestion, there is a slight gain of 5 seconds per 5000 lines of processing. I will keep you posted on how long it takes for 500 MB of text.
Also, I wonder if the bottleneck is in the predict method or the string and regex operations afterwards. Could you comment out everything after tagger.predict(sentences) to see if it's the predict that is causing the slowdown?
Thanks, @alanakbik. The regex seems to be pretty fast. It appears to me that tagger.predict(sentences) is the bottleneck. BTW, the Spacy code is almost identical to this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Oha, yes that's slow. So I notice that you first concatenate 1000 lines into one large string and then running the sentence splitter. But if you're file already has the sentences split, could you not create a Sentence object for each line and put it in a list? Each time the list has a certain length, you can put it through the predict method.
Also, 1000 sentences seems like a lot, perhaps try a mini-batch size of 32?
Something like this:
Also, you could try "ontonotes-fast" and "ner-fast" models instead.