Spacy: NER Publication

Created on 1 Sep 2016 · 7Comments · Source: explosion/spaCy

Dear @honnibal

Very nice work with SpaCy. You've built an excellent foundation that I hope continues.

Could you link me to any publication that describes how the NER works in SpaCy?

Thank you.

usage

Source

lababidi

Most helpful comment

Dear @honnibal,

I'm sorry to ask the same question again, but I'm not sure I get how the NER is trained.

Could you provide kind of a sketch of the architecture?
For instance, when you say greedy linear model, what part does it stand for?
Plus, which kind of neural net have you trained? LSTM?
And from what I understand, you do not rely on CRF at all, is that correct?

I'm sorry if my questions are a lil bit fuzzy, but I really would like to understand what's under the hood.

In advance, thanks.

antisrdy on 13 Jul 2017

👍4

All 7 comments

The NER model doesn't exactly match any published system --- but it also wasn't quite novel enough to write up a publication myself.

Some details:

Trained on OntoNotes 5
Greedy linear model
Weights learned with averaged perceptron
Uses transition-based parsing machinery, with BILOU-based transition system
Uses dynamic oracle to learn-to-search
Does not use gazetteer features
Does not use document features
Processes whole documents (does not require sentence boundary detection pre-process)
Uses POS tag and Brown cluster features. I believe the use of these contributes to the current model's case fragility.

syllog1sm on 8 Sep 2016

How much work would be involved for me to incorporate a gazetteer? What was your reasoning for not doing so?

fmailhot on 8 Sep 2016

I've always wanted a gazetteer. I just never got around to it.

Incorporating a per-word gazetteer is totally trivial: all you have to do is set a flag on the Lexeme objects, e.g.

from spacy.symbols import FLAG60 as IS_PERSON

for word in person_names:
    lex = nlp.vocab[word]
    lex.set_flag(IS_PERSON, True)

The machinery is in place for multi-word gazetteer matches as well, although the best option for dealing with this depends on the size of the list and the ambiguity of the entries. If you want rule-based matching, you could use the Matcher and PhraseMatcher classes. To allow ambiguity, I would suggest using the gazetteer to assign features, which the statistical model interprets.

If you've collected a sizeable gazetteer already, would you be willing to share it with others?

syllog1sm on 8 Sep 2016

To allow ambiguity, I would suggest using the gazetteer to assign features, which the statistical model interprets.

Is this something we can do with spaCy or is there infrastructure set in place for the future?

savkov on 13 Sep 2016

Dear @honnibal,

I'm sorry to ask the same question again, but I'm not sure I get how the NER is trained.

I'm sorry if my questions are a lil bit fuzzy, but I really would like to understand what's under the hood.

In advance, thanks.

antisrdy on 13 Jul 2017

👍4

+1. @honnibal would be nice if you could write a blog post about how it works.

wanasit on 6 Aug 2017

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] on 8 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

High similarity scores for antonyms

ajayrfhp · 3Comments

Details/paper used for recent NER implementation

muzaluisa · 3Comments

How to flag sentences with possible multiple meanings

armsp · 3Comments

Usage Examples return TypeError

besirkurtulmus · 3Comments

why the performance of lemmatizing of spacy is so slow compared with nltk

tonywangcn · 3Comments