Flair: Best practice for speeding up NER tagging?

Created on 1 Dec 2020 · 12Comments · Source: flairNLP/flair

I am using Flair for identifying named entities and tagging the text. I wanted to understand what is the best approach to speed up this process.

My text data is around 5 GB with each sentence in a separate line. I am calling Flair with chunks of this data. Regardless of the chunk size, the speed is pretty slow compared to Spacy NER. Of course, using GPU improves the Flair speed significantly, but still substantially lags beyond Spacy. The latter is at least 20 times faster.

This makes me believe that I might not be using Flair in the best possible way. Are there any best practices to increase the speed when using GPU? Please advise.

sbs

question wontfix

Source

santoshbs

Most helpful comment

Oha, yes that's slow. So I notice that you first concatenate 1000 lines into one large string and then running the sentence splitter. But if you're file already has the sentences split, could you not create a Sentence object for each line and put it in a list? Each time the list has a certain length, you can put it through the predict method.

Also, 1000 sentences seems like a lot, perhaps try a mini-batch size of 32?

Something like this:

N = 32
with open(f_in, "r") as p_f_in:
    sentences = []
    with open(f_out, "w") as p_f_out:
        i = 1
        while True:

            # read line
            line = p_f_in.readline()

            # make sentence
            sentence = Sentence(line)
            sentences.append(sentence)

            if len(sentences) == N:

                tagger.predict(sentences)
                tagged = [sentence.to_tagged_string() for sentence in sentences]

                annotated = nerAnnotate('\n'.join(tagged))

                p_f_out.write(annotated)

            if (i * N) % 100 == 0:
                logging.info("ner annotated " + str(i * N) + " lines")
            i = i + 1

Also, you could try "ontonotes-fast" and "ner-fast" models instead.

alanakbik on 3 Dec 2020

👍3

All 12 comments

Can you share your tagging script? Maybe it can be tweaked (though SpaCy is generally faster).

alanakbik on 1 Dec 2020

Here it is:

from itertools import islice
import flair
from flair.models import SequenceTagger
from flair.tokenization import SegtokSentenceSplitter
import logging

logging.basicConfig(format='%(asctime)s :: %(message)s', level=logging.INFO)

ONTONOTES= True
flair.device= torch.device('cuda:0') #or use 'cpu'
if not ONTONOTES:
    tagger= SequenceTagger.load('ner')
else:
    tagger= SequenceTagger.load('ner-ontonotes')
splitter= SegtokSentenceSplitter()

dir_in= 'sentences/'
dir_out= 'ner-annotated/'

def nerAnnotate(text, ontonotes= True):
    if not ontonotes:
        entities=['LOC', 'PER', 'ORG', 'MISC'] 
    else:
        entities=[
            'PERSON', 'NORP', 'FAC', 'ORG', 
            'GPE', 'LOC','PRODUCT', 'EVENT',
            'WORK_OF_ART', 'LAW','LANGUAGE', 'DATE',
            'TIME', 'PERCENT','MONEY', 'QUANTITY',
            'ORDINAL', 'CARDINAL'                   ] 
    for e in entities:
        text= text.replace(" <E-" + e + ">", "|" + e)
        text= text.replace(" <B-" + e + "> ", "_")
        text= text.replace(" <I-" + e + "> ", "_")
        text= text.replace(" <S-" + e + ">", "|" + e)
    return text

N= 1000
logging.info("processing file: " + 'RS_2019-' + suffix + '_Utterances_Sentences.txt')
f_in=  dir_in + 'doc_Sentences.txt'
f_out= dir_out + 'doc_Sentences_nerAnnotated.txt'

with open(f_in, "r") as p_f_in:
    with open(f_out, "w") as p_f_out:
        i= 1
        while True:
            lines_gen= list(islice(p_f_in, N))
            if not lines_gen:
                break

            text= '\n'.join([line.strip() for line in lines_gen if line.strip() != ''])
            sentences= splitter.split(text)

            tagger.predict(sentences)
            tagged= [sentence.to_tagged_string() for sentence in sentences]    

            annotated= nerAnnotate('\n'.join(tagged))

            p_f_out.write(annotated)

            if (i*N) % 5000 == 0:
                logging.info("ner annotated " + str(i*N) + " lines")
            i= i + 1

    logging.info("DONE!!!")

santoshbs on 2 Dec 2020

Thanks! Can you also tell me how your input sentence file is formatted?

Generally, it is always good to use mini-batching, i.e. put a list of 32 or 64 sentences through the predict method at once (as much as your GPU memory permits).

alanakbik on 2 Dec 2020

👍2

@alanakbik - Thank you. In the input file, each line is a separate sentence (which was obtained using Spacy's sentencizer. As you can see, I am sending 1000 sentences at a time to predict as my GPU 12GB RAM can accommodate this.

santoshbs on 2 Dec 2020

In 2 days, Flair has only been able to process 500 MB of text. As a comparison, Spacy NER processed over 50 GB of text in the same time.

santoshbs on 3 Dec 2020

👍1

Also, 1000 sentences seems like a lot, perhaps try a mini-batch size of 32?

Something like this:

N = 32
with open(f_in, "r") as p_f_in:
    sentences = []
    with open(f_out, "w") as p_f_out:
        i = 1
        while True:

            # read line
            line = p_f_in.readline()

            # make sentence
            sentence = Sentence(line)
            sentences.append(sentence)

            if len(sentences) == N:

                tagger.predict(sentences)
                tagged = [sentence.to_tagged_string() for sentence in sentences]

                annotated = nerAnnotate('\n'.join(tagged))

                p_f_out.write(annotated)

            if (i * N) % 100 == 0:
                logging.info("ner annotated " + str(i * N) + " lines")
            i = i + 1

Also, you could try "ontonotes-fast" and "ner-fast" models instead.

alanakbik on 3 Dec 2020

👍3

Thanks @alanakbik . I will try your suggestions and keep you posted.

santoshbs on 3 Dec 2020

@alanakbik - after using ontonotes-fast and your code suggestion, there is a slight gain of 5 seconds per 5000 lines of processing. I will keep you posted on how long it takes for 500 MB of text.

santoshbs on 3 Dec 2020

Also, I wonder if the bottleneck is in the predict method or the string and regex operations afterwards. Could you comment out everything after tagger.predict(sentences) to see if it's the predict that is causing the slowdown?

alanakbik on 3 Dec 2020

Thanks, @alanakbik. The regex seems to be pretty fast. It appears to me that tagger.predict(sentences) is the bottleneck. BTW, the Spacy code is almost identical to this.

santoshbs on 3 Dec 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 3 Apr 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 3 Apr 2021

Was this page helpful?

0 / 5 - 0 ratings