Spacy: NER tags pure whitespace as entities

Created on 12 Dec 2017  路  7Comments  路  Source: explosion/spaCy


It is easy to generate cases where the named entity recognizer will tag whitespace as part of the entity. That makes sense when the white space is within the span, but it doesn't make sense when the whitespace is the first or last token in the span, and it especially doesn't make sense if the span only consists of whitespace.

For example:

import en_core_web_sm
from spacy.matcher import Matcher

nlp = en_core_web_sm.load()

text = u'''
A U.S. regulator's preliminary investigation into the biggest oil pipeline spill this year has raised a red flag that could 
trigger an extensive and costly inspection of tens of thousands of miles of underground energy lines.
    '''

doc = nlp(text)

for ent in list(doc.ents):
    print ("Ent Text: {} Ent_type_: {}".format(str(ent), ent.label_))

Generates the following:

Ent Text: 
 Ent_type_: GPE
Ent Text: U.S. Ent_type_: GPE
Ent Text: this year Ent_type_: DATE
Ent Text: 
 Ent_type_: GPE
Ent Text: tens of thousands of miles Ent_type_: QUANTITY

The first and fourth entities are pure whitespace (newlines), but they are tagged as GPE entities.

It's not terribly difficult to filter these whitespace entities out, but I don't think we should have to....thoughts?

Your Environment

  • Operating System: Linux
  • Python Version Used: 2.7
  • spaCy Version Used: 2.0
  • Environment Information:
feat / ner perf / accuracy

Most helpful comment

@Bri-Will This is a weakness in the model that should be fixed with data augmentation in future versions. In the meantime, you can add a post-process to the pipeline that unsets them:

def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

nlp.add_pipe(remove_whitespace_entities, after='ner')
doc = nlp(u'Hello\nNew York')
print(doc.ents)
# (New York,)

All 7 comments

@Bri-Will This is a weakness in the model that should be fixed with data augmentation in future versions. In the meantime, you can add a post-process to the pipeline that unsets them:

def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

nlp.add_pipe(remove_whitespace_entities, after='ner')
doc = nlp(u'Hello\nNew York')
print(doc.ents)
# (New York,)

Yes, I've already figured out how to remove them...just seems like I shouldn't have to do this. Thanks...

A related issue is that white space around an entity is often picked up and this can be problematic for upstream processing. e.g. in this example:

"聽 Decibel: "Metallica balances legacy, longevity, and longitude with Hardwired...

The (mis)tagged entity is ' Decidel' which includes two white space characters before the word. Is there a similar solution to the remove_whitespace_entities() function you defined above? This is more subtle because simply erasing the white space messes with the computation of span offsets, etc..

I have what I feel is a related or the same issue posted on StackOverflow back on November 22 that has no answers. I noticed that \n is very commonly tagged as GPE. As above, post-processing could be done in the mean time to remove such entities, however a longer term fix would be best. I should add though, I have trained my own model on my own data and I still get GPEs of \n

On a related note, a document consisting of only the string "\n\n" is tagged as PERSON. In a document with several of these, they are all tagged as PERSON. Here's a simple example:

Input:

import spacy

nlp = spacy.load('en_core_web_sm')
texts = [
    '\n\n',
    '\n\nI like cheese.\n\nSomething else.',
]

for text in texts:
    print(f'{repr(text)}:')
    doc = nlp(text)
    for ent in doc.ents:
        tup = (ent.text, ent.start_char, ent.end_char, ent.label_)
        print(f"    {repr(tup)}")

Output:

'\n\n':
    ('\n\n', 0, 2, 'PERSON')
'\n\nI like cheese.\n\nSomething else.':
    ('\n\n', 0, 2, 'PERSON')
    ('\n\n', 16, 18, 'PERSON')

See #2870:

Next nightly release will feature a proper fix for this: 1e6725e

Thanks for your patience!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TropComplique picture TropComplique  路  3Comments

besirkurtulmus picture besirkurtulmus  路  3Comments

armsp picture armsp  路  3Comments

enerrio picture enerrio  路  3Comments

bebelbop picture bebelbop  路  3Comments