Python version: 2.7.6
Platform: Linux-3.16.0-77-generic-x86_64-with-Ubuntu-14.04-trusty
spaCy version: 2.0.0a17
Models: en, en_core_web_sm, xx_ent_wiki_sm
during training of model. does charCNN used for capturing morphological features from characters?
def train_ner(nlp, train_data, output_dir):
random.seed(0)
optimizer = nlp.begin_training(lambda: [])
nlp.meta['name'] = 'CRIME_LOCATION'
for itn in range(50):
losses = {}
for batch in minibatch(get_gold_parses(nlp.make_doc, train_data), size=3):
docs, golds = zip(*batch)
nlp.update(docs, golds, losses=losses, sgd=optimizer, drop=0.35)
print("under learning")
if not output_dir:
return
Yes, spaCy's NER (and other models) uses subword features, although it doesn't use a character-based CNN to extract them. Instead, the word vectors are learned by concatenating embeddings of NORM, PREFIX, SUFFIX and SHAPE lexical attributes. A hidden layer is then used to allow a non-linear combination of the information in these concatenated vectors. The function for this can be found in spacy._ml.Tok2Vec.
The best reference for this embedding strategy is currently the NER algorithm video: https://www.youtube.com/watch?v=sqDHBH9IjRU
To add to @honnibal's comment above, there's also a section in the API docs that describes the neural network model architecture in more detail: https://spacy.io/api/#nn-model
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Yes, spaCy's NER (and other models) uses subword features, although it doesn't use a character-based CNN to extract them. Instead, the word vectors are learned by concatenating embeddings of
NORM,PREFIX,SUFFIXandSHAPElexical attributes. A hidden layer is then used to allow a non-linear combination of the information in these concatenated vectors. The function for this can be found inspacy._ml.Tok2Vec.The best reference for this embedding strategy is currently the NER algorithm video: https://www.youtube.com/watch?v=sqDHBH9IjRU