Here's a minimum viable code to reproduce:
from flair.data import Sentence
from flair.models import SequenceTagger
model = SequenceTagger.load("ner-ontonotes-fast")
full_text = "\"In the 1960s and 1970s...\" Then came Thierry Mugler and Gianni Versace."
sentence = Sentence(full_text, use_tokenizer=True)
model.predict(sentence)
print(f"full text : {full_text}")
print(f"text length: {len(full_text)}")
print("tag\tstart\tend\tto_original_text()")
for entity in sentence.get_spans('ner'):
print(f"{entity.tag}\t{entity.start_pos}\t{entity.end_pos}\t{entity.to_original_text()}")
Output:
$ python predict.py
full text : "In the 1960s and 1970s..." Then came Thierry Mugler and Gianni Versace.
text length: 72
tag start end to_original_text()
DATE 8 13 1960s
DATE 18 23 1970s
PERSON 81 94 ThierryMugler
PERSON 97 110 GianniVersace
Seems the resulting tokens have start_pos and end_pos indexes larger than the real text length. Note also that the method to_original_text() is eating the spaces, so I suppose it is related.
Any ideas about what is causing the trouble?
Hi @JoanEspasa - thanks for pointing this out! But I cannot reproduce this error. What version of Flair are you using?
@tabergma just told me she can reproduce the error, so checking it out now!
Hi @JoanEspasa!
The issue lies in the tokenizer itself. The tokenizer returns the following tokens:
['"', 'In', 'the', '1960s', 'and', '1970s', '...', '.', 'Then', 'came', 'Thierry', 'Mugler', 'and', 'Gianni', 'Versace', '.']
The 8th token . is not in the original text but added by the tokenizer. We observed this issue already a couple of times, but unfortunately we cannot do anything about it.
Now the following happens: We want to get the start index of the . token in the text by calling text.index('.'), which returns 71 as this is the only position in the text that actually matches .. This shifts the start positions of the following tokens. As the start positions of the tokens are now greater than the text length, the algorithm does not work as expected anymore.
I'll look into the issue. Maybe we can find a workaround.
Which version of segtok are you using? I had version 1.5.6 installed. I just updated to version 1.5.7 and the issue is solved.
@tabergma Indeed! I was using segtok-1.5.6. Upgrading to 1.5.7 fixes the issue :smile:
I did not realize that you were using an external tokenizer, I'm sorry. I should have read the documentation more carefully and report to them.
I think this happened to me because in your setup.py there is segtok==1.5.6 and in requirements.txt
segtok==1.5.7. Installing via pip (as I did) seems to follow setup.py instead of requirements.txt.
Thanks again for looking into this @tabergma and @alanakbik :beers:
PD: feel free to close the issue :)
Thanks for verifying! Will update the setup.py to use version segtok==1.5.7.