Spacy: newline \n captured in the entity parser?

Created on 22 Oct 2018 · 6Comments · Source: explosion/spaCy

Hello there!

I am more and more excited by spacy, but I found some weird behavior in this example

doc = nlp(u'''This is some crazy test where I dont need an Apple
               Watch to make things bug''')    

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)    

(u'Apple', 45, 50, u'ORG')
(u'\n           Watch', 50, 67, u'PERSON')

Why is the newline even added to the output here? In other examples, I had the newline itself parsed as a GPE entity such as


(u'\n', 1237, 1238, u'GPE')
(u'\n', 1293, 1294, u'GPE')

I am confused here. Is that expected?
Thanks!

bug feat / ner models

Source

randomgambit

All 6 comments

Am I supposed to get rid of the newlines in the text before processing? But that would affect the sentence boundaries I guess?

randomgambit on 25 Oct 2018

We're supposed to have some heuristics during training to prevent this, sorry. The test data doesn't have newlines, so the regression crept back in without it showing up in the evaluation.

I've added an example showing how to hotfix the issue https://github.com/explosion/spaCy/commit/5a4aeb96b72af43aba5a1d7f143214d4083d151f

The output isn't perfectly what you'd want on your example:

Before (Apple,                Watch)
After (Apple,)

We force the space token to be outside the entity, while ideally in your example we'd like it to be inside. You may get better results by pre-processing the newlines out. The textacy library has some useful pre-processing options like this.

honnibal on 28 Oct 2018

👍1

Hey

Want to add that I've just experienced the NER picking up a whitespace. The result was quite catastrophic since I was removing the named entities - so I ended up with the whole document being one word. Very easy to fix on my side by just popping whitespace from the result. But if you're fixing newlines, maybe also check for other types of whitespace, control characters etc. I would think that any of these sorts of characters should be removed automatically as a rule.

Anyway found the issue quite fast so thanks for a good library otherwise.