Spacy: NER tags pure whitespace as entities

Created on 12 Dec 2017 · 7Comments · Source: explosion/spaCy

It is easy to generate cases where the named entity recognizer will tag whitespace as part of the entity. That makes sense when the white space is within the span, but it doesn't make sense when the whitespace is the first or last token in the span, and it especially doesn't make sense if the span only consists of whitespace.

For example:

import en_core_web_sm
from spacy.matcher import Matcher

nlp = en_core_web_sm.load()

text = u'''
A U.S. regulator's preliminary investigation into the biggest oil pipeline spill this year has raised a red flag that could 
trigger an extensive and costly inspection of tens of thousands of miles of underground energy lines.
    '''

doc = nlp(text)

for ent in list(doc.ents):
    print ("Ent Text: {} Ent_type_: {}".format(str(ent), ent.label_))

Generates the following:

Ent Text: 
 Ent_type_: GPE
Ent Text: U.S. Ent_type_: GPE
Ent Text: this year Ent_type_: DATE
Ent Text: 
 Ent_type_: GPE
Ent Text: tens of thousands of miles Ent_type_: QUANTITY

The first and fourth entities are pure whitespace (newlines), but they are tagged as GPE entities.

It's not terribly difficult to filter these whitespace entities out, but I don't think we should have to....thoughts?

Your Environment

Operating System: Linux
Python Version Used: 2.7
spaCy Version Used: 2.0
Environment Information:

feat / ner perf / accuracy

Source

Bri-Will

Most helpful comment

@Bri-Will This is a weakness in the model that should be fixed with data augmentation in future versions. In the meantime, you can add a post-process to the pipeline that unsets them:

def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

nlp.add_pipe(remove_whitespace_entities, after='ner')
doc = nlp(u'Hello\nNew York')
print(doc.ents)
# (New York,)

honnibal on 13 Dec 2017

👍3

All 7 comments

@Bri-Will This is a weakness in the model that should be fixed with data augmentation in future versions. In the meantime, you can add a post-process to the pipeline that unsets them:

def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

nlp.add_pipe(remove_whitespace_entities, after='ner')
doc = nlp(u'Hello\nNew York')
print(doc.ents)
# (New York,)

honnibal on 13 Dec 2017

👍3

Yes, I've already figured out how to remove them...just seems like I shouldn't have to do this. Thanks...

Bri-Will on 13 Dec 2017

A related issue is that white space around an entity is often picked up and this can be problematic for upstream processing. e.g. in this example:

" Decibel: "Metallica balances legacy, longevity, and longitude with Hardwired...

The (mis)tagged entity is ' Decidel' which includes two white space characters before the word. Is there a similar solution to the remove_whitespace_entities() function you defined above? This is more subtle because simply erasing the white space messes with the computation of span offsets, etc..

funnydevnull on 15 Dec 2017

I have what I feel is a related or the same issue posted on StackOverflow back on November 22 that has no answers. I noticed that \n is very commonly tagged as GPE. As above, post-processing could be done in the mean time to remove such entities, however a longer term fix would be best. I should add though, I have trained my own model on my own data and I still get GPEs of \n

demongolem on 15 Dec 2017

👍2

On a related note, a document consisting of only the string "\n\n" is tagged as PERSON. In a document with several of these, they are all tagged as PERSON. Here's a simple example:

Input:

import spacy

nlp = spacy.load('en_core_web_sm')
texts = [
    '\n\n',
    '\n\nI like cheese.\n\nSomething else.',
]

for text in texts:
    print(f'{repr(text)}:')
    doc = nlp(text)
    for ent in doc.ents:
        tup = (ent.text, ent.start_char, ent.end_char, ent.label_)
        print(f"    {repr(tup)}")

Output:

'\n\n':
    ('\n\n', 0, 2, 'PERSON')
'\n\nI like cheese.\n\nSomething else.':
    ('\n\n', 0, 2, 'PERSON')
    ('\n\n', 16, 18, 'PERSON')

pokey on 16 Aug 2018

👍1

See #2870:

Next nightly release will feature a proper fix for this: 1e6725e

Thanks for your patience!

ines on 8 Dec 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] on 7 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to train the NER to recognize addresses

bebelbop · 3Comments

Unintuitive behavior of vocab

TropComplique · 3Comments

Info request: update on availability of German model with word vectors?

smartinsightsfromdata · 3Comments

Details/paper used for recent NER implementation

muzaluisa · 3Comments

does char level features using charCNN are used for NER in spacy?

prashant334 · 3Comments