Spacy: Entity Recognition with Custom Vocab Flags

Created on 9 Mar 2020  路  7Comments  路  Source: explosion/spaCy

I cannot find in the documentation for Entity Recognition whether or not custom vocab flags set by vocab.add_flag() are used in downstream parsing, tagging, or NER models. Does this flag affect downstream training?

Which page or section is this issue related to?

https://spacy.io/api/vocab
https://spacy.io/usage/training/#ner

enhancement feat / ner training

All 7 comments

No, custom flags aren't used by any of the core models. The main features included are ORTH, NORM, PREFIX, SUFFIX, SHAPE, plus the word vector if the model includes vectors.

@soso-maitha not the question I'm asking. I'm looking to see what features go into the NER models, not how to set custom training entities.

@adrianeboyd thanks for the info! Are the full features documented anywhere that I missed? Also, would this imply that the results of initial models (parser and tagger) are not used in downstream training of NER models?

Thanks!

The tagger, parser, and NER models are all completely separate in spacy v2, so you can easily train just an NER model.

For the current version (v2.2.3), the features used are defined as cols here (things are currently in transition, which is why the filename is a bit unexpected):

https://github.com/explosion/spaCy/blob/1d6aec805d5c03ad8a039466e98ed3a619e650c4/spacy/ml/_legacy_tok2vec.py

One of the problems with spacy v2 is that it's hard to modify a lot of the model settings without digging into the code and it's hard to make changes without breaking existing models because there are a lot of default values that aren't saved with the models themselves.

This should improve in spacy v3 with the rewrite of spacy's ML library thinc (https://thinc.ai). In spacy v3, it should get easier to define custom models through configuration files and function registries. You can see a preview of how this will look in spacy v3 in the develop branch: https://github.com/explosion/spaCy/tree/5847be6022e615cdea55ca5a7856d203254e7ddf/spacy/ml/models

@adrianeboyd thanks for the info, that's exactly what I wanted to know. In the meantime, I'll find a workaround until v3. Some exciting looking stuff in that branch.

Is it worth adding something to the train docs, maybe here, that denotes the features defined by cols in the _legacy_tok2vec.py file?

I think we want to make that cols definition adjustable in v.3.0. We'll have to revisit a lot of the documentation around the ML models, so that will be part of that in due time :-)

Thanks @svlandeg ! You guys are great!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings