Spacy: newline \n is captured as part of named entities?

Created on 11 Dec 2019 · 8Comments · Source: explosion/spaCy

Hello there :)
I ran the following code:

import spacy
text = """"This is about Alice
who visited Wonderland. Alice?"""

nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"]) #  want NER only
doc = nlp(text)
NE = set()
for ent in doc.ents:
    print(ent.text, ent.label_)
    NE.add(ent.text)
print("The named entities are:", NE)

The result is:

Alice
 PERSON
Wonderland GPE
Alice PERSON
The named entities are: {'Alice\n', 'Wonderland', 'Alice'}

As you see, it includes "\n" as part of an entity found, and as a result treats "Alice" before line brake as different from a regular "Alice".

This bug seems to me similar to #2870, which was claimed to be solved. So was it?

Your Environment

Operating System: Windows 7
Python Version Used: 3.7.3
spaCy Version Used: 2.2.3
Environment Information: ?

bug feat / ner

Source

karzmei

All 8 comments

Hmm, I can't reproduce this. (Could it be related to Windows somehow?)

adrianeboyd on 11 Dec 2019

Testing the original snippet on Windows:

text = """"This is about Alice
who visited Wonderland. Alice?"""

gives me the correct

Alice PERSON
Wonderland GPE
Alice PERSON
The named entities are: {'Wonderland', 'Alice'}

But if I alter the string to be

text = """"This is about Alice\n
who visited Wonderland. Alice?"""

I get

Alice

     PERSON
Wonderland GPE
Alice PERSON
The named entities are: {'Wonderland', 'Alice\n\n    ', 'Alice'}

And if I do

text = """"This is about Alice\n who visited Wonderland. Alice?"""

it does strip it:

Alice PERSON
Wonderland GPE
Alice PERSON
The named entities are: {'Alice', 'Wonderland'}

So not exactly the same behaviour - but something seems to be going on indeed ...

svlandeg on 11 Dec 2019

Thank you both! So not on windows it works fine?

If I do text = """"This is about Alice\n who visited Wonderland. Alice?"""
it gives the correct output, otherwise it doesn't (I tries the options svlandeg tested and some more, e.g. inserting a space before the \n).

karzmei on 11 Dec 2019

I can get whitespace within the entity with two newlines followed by a space:

text = """This is about Alice\n\n who visited Wonderland. Alice?"""

Very strange!

adrianeboyd on 11 Dec 2019

The option

text = """"This is about Alice\n\n
who visited Wonderland. Alice?"""

gives me (on windows) the correct output. While with just one \n there (Alice\n), it gives "Alice\n\n".

karzmei on 11 Dec 2019

I dug into the code a little and noticed that indeed this has been addressed before by these edits, but partly reverted shortly after to specifically allow entities to end on white space again. So maybe this is not truely a bug but rather a model prediction error ?

svlandeg on 11 Dec 2019

🚀1

Sorry for the late follow-up. We discussed this internally, and in the end I don't really think this is a bug. At some point (as linked in my previous post) we did try to prevent whitespace in entities, but this would have resulted in some unwanted side-effects during training. Which meant that the whitespace had to be allowed again for some edge cases to work properly.

Anyway, I think your best option here is to preprocess your texts and clean them up slightly if you can, before you run them through the spaCy nlp pipeline. Sorry we don't really have a more satisfying answer :(

svlandeg on 15 Apr 2020

👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.