Flair: Bug in tokenizer?

Created on 22 Nov 2018 · 6Comments · Source: flairNLP/flair

Here's a minimum viable code to reproduce:

from flair.data import Sentence
from flair.models import SequenceTagger

model = SequenceTagger.load("ner-ontonotes-fast")
full_text = "\"In the 1960s and 1970s...\" Then came Thierry Mugler and Gianni Versace."
sentence = Sentence(full_text, use_tokenizer=True)
model.predict(sentence)
print(f"full text  : {full_text}")
print(f"text length: {len(full_text)}")
print("tag\tstart\tend\tto_original_text()")
for entity in sentence.get_spans('ner'):
    print(f"{entity.tag}\t{entity.start_pos}\t{entity.end_pos}\t{entity.to_original_text()}")

Output:

$ python predict.py full text : "In the 1960s and 1970s..." Then came Thierry Mugler and Gianni Versace. text length: 72 tag start end to_original_text() DATE 8 13 1960s DATE 18 23 1970s PERSON 81 94 ThierryMugler PERSON 97 110 GianniVersace
Seems the resulting tokens have start_pos and end_pos indexes larger than the real text length. Note also that the method to_original_text() is eating the spaces, so I suppose it is related.

Any ideas about what is causing the trouble?

bug

Source

JoanEspasa

All 6 comments

Hi @JoanEspasa - thanks for pointing this out! But I cannot reproduce this error. What version of Flair are you using?

alanakbik on 22 Nov 2018

@tabergma just told me she can reproduce the error, so checking it out now!

alanakbik on 22 Nov 2018

👍1

Hi @JoanEspasa!
The issue lies in the tokenizer itself. The tokenizer returns the following tokens:
['"', 'In', 'the', '1960s', 'and', '1970s', '...', '.', 'Then', 'came', 'Thierry', 'Mugler', 'and', 'Gianni', 'Versace', '.']
The 8th token . is not in the original text but added by the tokenizer. We observed this issue already a couple of times, but unfortunately we cannot do anything about it.
Now the following happens: We want to get the start index of the . token in the text by calling text.index('.'), which returns 71 as this is the only position in the text that actually matches .. This shifts the start positions of the following tokens. As the start positions of the tokens are now greater than the text length, the algorithm does not work as expected anymore.
I'll look into the issue. Maybe we can find a workaround.

tabergma on 22 Nov 2018

Which version of segtok are you using? I had version 1.5.6 installed. I just updated to version 1.5.7 and the issue is solved.

tabergma on 22 Nov 2018

❤1

@tabergma Indeed! I was using segtok-1.5.6. Upgrading to 1.5.7 fixes the issue :smile:

I did not realize that you were using an external tokenizer, I'm sorry. I should have read the documentation more carefully and report to them.

I think this happened to me because in your setup.py there is segtok==1.5.6 and in requirements.txt
segtok==1.5.7. Installing via pip (as I did) seems to follow setup.py instead of requirements.txt.

Thanks again for looking into this @tabergma and @alanakbik :beers:

PD: feel free to close the issue :)

JoanEspasa on 23 Nov 2018

Thanks for verifying! Will update the setup.py to use version segtok==1.5.7.

tabergma on 23 Nov 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings