Spacy: NER Publication

Created on 1 Sep 2016  路  7Comments  路  Source: explosion/spaCy

Dear @honnibal

Very nice work with SpaCy. You've built an excellent foundation that I hope continues.

Could you link me to any publication that describes how the NER works in SpaCy?

Thank you.

usage

Most helpful comment

Dear @honnibal,

I'm sorry to ask the same question again, but I'm not sure I get how the NER is trained.

Could you provide kind of a sketch of the architecture?
For instance, when you say greedy linear model, what part does it stand for?
Plus, which kind of neural net have you trained? LSTM?
And from what I understand, you do not rely on CRF at all, is that correct?

I'm sorry if my questions are a lil bit fuzzy, but I really would like to understand what's under the hood.

In advance, thanks.

All 7 comments

The NER model doesn't exactly match any published system --- but it also wasn't quite novel enough to write up a publication myself.

Some details:

  • Trained on OntoNotes 5
  • Greedy linear model
  • Weights learned with averaged perceptron
  • Uses transition-based parsing machinery, with BILOU-based transition system
  • Uses dynamic oracle to learn-to-search
  • Does not use gazetteer features
  • Does not use document features
  • Processes whole documents (does not require sentence boundary detection pre-process)
  • Uses POS tag and Brown cluster features. I believe the use of these contributes to the current model's case fragility.

How much work would be involved for me to incorporate a gazetteer? What was your reasoning for not doing so?

I've always wanted a gazetteer. I just never got around to it.

Incorporating a per-word gazetteer is totally trivial: all you have to do is set a flag on the Lexeme objects, e.g.

from spacy.symbols import FLAG60 as IS_PERSON

for word in person_names:
    lex = nlp.vocab[word]
    lex.set_flag(IS_PERSON, True)

The machinery is in place for multi-word gazetteer matches as well, although the best option for dealing with this depends on the size of the list and the ambiguity of the entries. If you want rule-based matching, you could use the Matcher and PhraseMatcher classes. To allow ambiguity, I would suggest using the gazetteer to assign features, which the statistical model interprets.

If you've collected a sizeable gazetteer already, would you be willing to share it with others?

To allow ambiguity, I would suggest using the gazetteer to assign features, which the statistical model interprets.

Is this something we can do with spaCy or is there infrastructure set in place for the future?

Dear @honnibal,

I'm sorry to ask the same question again, but I'm not sure I get how the NER is trained.

Could you provide kind of a sketch of the architecture?
For instance, when you say greedy linear model, what part does it stand for?
Plus, which kind of neural net have you trained? LSTM?
And from what I understand, you do not rely on CRF at all, is that correct?

I'm sorry if my questions are a lil bit fuzzy, but I really would like to understand what's under the hood.

In advance, thanks.

+1. @honnibal would be nice if you could write a blog post about how it works.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings