Spacy: newline \n captured in the entity parser?

Created on 22 Oct 2018  路  6Comments  路  Source: explosion/spaCy

Hello there!

I am more and more excited by spacy, but I found some weird behavior in this example

doc = nlp(u'''This is some crazy test where I dont need an Apple
               Watch to make things bug''')    

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)    

(u'Apple', 45, 50, u'ORG')
(u'\n           Watch', 50, 67, u'PERSON')

Why is the newline even added to the output here? In other examples, I had the newline itself parsed as a GPE entity such as


(u'\n', 1237, 1238, u'GPE')
(u'\n', 1293, 1294, u'GPE')

I am confused here. Is that expected?
Thanks!

bug feat / ner models

All 6 comments

Am I supposed to get rid of the newlines in the text before processing? But that would affect the sentence boundaries I guess?

We're supposed to have some heuristics during training to prevent this, sorry. The test data doesn't have newlines, so the regression crept back in without it showing up in the evaluation.

I've added an example showing how to hotfix the issue https://github.com/explosion/spaCy/commit/5a4aeb96b72af43aba5a1d7f143214d4083d151f

The output isn't perfectly what you'd want on your example:

Before (Apple,                Watch)
After (Apple,)

We force the space token to be outside the entity, while ideally in your example we'd like it to be inside. You may get better results by pre-processing the newlines out. The textacy library has some useful pre-processing options like this.

Hey

Want to add that I've just experienced the NER picking up a whitespace. The result was quite catastrophic since I was removing the named entities - so I ended up with the whole document being one word. Very easy to fix on my side by just popping whitespace from the result. But if you're fixing newlines, maybe also check for other types of whitespace, control characters etc. I would think that any of these sorts of characters should be removed automatically as a rule.

Anyway found the issue quite fast so thanks for a good library otherwise.

Watch out
hotfix is messing up dependency parser. doc does not have noun_chunks.

Next nightly release will feature a proper fix for this: https://github.com/explosion/spaCy/commit/1e6725e9b734862e61081a916baf440697b9971e

Thanks for your patience!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings