Hello there :)
I ran the following code:
import spacy
text = """"This is about Alice
who visited Wonderland. Alice?"""
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"]) # want NER only
doc = nlp(text)
NE = set()
for ent in doc.ents:
print(ent.text, ent.label_)
NE.add(ent.text)
print("The named entities are:", NE)
The result is:
Alice
PERSON
Wonderland GPE
Alice PERSON
The named entities are: {'Alice\n', 'Wonderland', 'Alice'}
As you see, it includes "\n" as part of an entity found, and as a result treats "Alice" before line brake as different from a regular "Alice".
This bug seems to me similar to #2870, which was claimed to be solved. So was it?
Hmm, I can't reproduce this. (Could it be related to Windows somehow?)
Testing the original snippet on Windows:
text = """"This is about Alice
who visited Wonderland. Alice?"""
gives me the correct
Alice PERSON
Wonderland GPE
Alice PERSON
The named entities are: {'Wonderland', 'Alice'}
But if I alter the string to be
text = """"This is about Alice\n
who visited Wonderland. Alice?"""
I get
Alice
PERSON
Wonderland GPE
Alice PERSON
The named entities are: {'Wonderland', 'Alice\n\n ', 'Alice'}
And if I do
text = """"This is about Alice\n who visited Wonderland. Alice?"""
it does strip it:
Alice PERSON
Wonderland GPE
Alice PERSON
The named entities are: {'Alice', 'Wonderland'}
So not exactly the same behaviour - but something seems to be going on indeed ...
Thank you both! So not on windows it works fine?
If I do text = """"This is about Alice\n who visited Wonderland. Alice?"""
it gives the correct output, otherwise it doesn't (I tries the options svlandeg tested and some more, e.g. inserting a space before the \n).
I can get whitespace within the entity with two newlines followed by a space:
text = """This is about Alice\n\n who visited Wonderland. Alice?"""
Very strange!
The option
text = """"This is about Alice\n\n
who visited Wonderland. Alice?"""
gives me (on windows) the correct output. While with just one \n there (Alice\n), it gives "Alice\n\n".
I dug into the code a little and noticed that indeed this has been addressed before by these edits, but partly reverted shortly after to specifically allow entities to end on white space again. So maybe this is not truely a bug but rather a model prediction error ?
Sorry for the late follow-up. We discussed this internally, and in the end I don't really think this is a bug. At some point (as linked in my previous post) we did try to prevent whitespace in entities, but this would have resulted in some unwanted side-effects during training. Which meant that the whitespace had to be allowed again for some edge cases to work properly.
Anyway, I think your best option here is to preprocess your texts and clean them up slightly if you can, before you run them through the spaCy nlp pipeline. Sorry we don't really have a more satisfying answer :(
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.