I was getting intermittent segmentation faults when training a new entity type, and so I thought I'd update spaCy to see if that helped. Unfortunately, now I get a segfault every single time, except not in training, but on adding entity types.
Follow the spaCy/examples/training/train_new_entity_type.py example with the existing model 'en'. Segmentation fault occurs when adding a new entity label (ner.add_label(label)).
I've attached the segfault log.
Thanks for the report. Are you able to share the examples you used and/or the labels you're adding? And do you have a reproducible example? Segfaults like this are always tricky to debug, so the more specific examples we have, the better.
The minimal reproducible example is the train_new_entity_type.py example script with the 'en' model loaded, with no other changes. That script adds the 'ANIMAL' entity tag. Note that this particular error is only with the nightly build.
The intermittent segmentation faults I referenced happened with other data on the release build, but that issue has been mentioned in the past and is still open - #1969
@iperera do you get the segfault even when just running that example file? It ran fine for me, on a mac using Python 3.7.
Only when specifying an existing model to add to. If I start with a blank model, it runs fine for me.
I also get a segmentation fault using the standard training code when I try add a label to the NER with ner.add_label("FEATURE")
This is on the latest nightly build
def main(model=None, new_model_name='animal', output_dir=None, n_iter=10):
"""Set up the pipeline and entity recognizer, and train the new entity."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
print(nlp.pipe_names)
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# otherwise, get it, so we can add labels to it
else:
ner = nlp.get_pipe('ner')
print("Adding labels")
for label in LABELS:
print(label)
ner.add_label(label) # <- Segfaults here
print(label)
print("Beginning training")
if model is None:
optimizer = nlp.begin_training()
else:
# Note that 'begin_training' initializes the models, so it'll zero out
# existing entity types.
optimizer = nlp.entity.create_optimizer()
# get names of other pipes to disable them during training
print("Disabling pipes")
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(8., 64., 1.001))
# print(f'Number of batches: {len(batches)}')
for batch_num, batch in enumerate(batches):
texts, annotations = zip(*batch)
if batch_num % 1000 == 0:
print(f"Batch {batch_num}")
nlp.update(texts, annotations, sgd=optimizer, drop=0.35,
losses=losses)
print('Losses', losses)
# test the trained model
test_text = 'Do you like horses?'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
print(ent.label_, ent.text)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.meta['name'] = new_model_name # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
for ent in doc2.ents:
print(ent.label_, ent.text)
if __name__ == '__main__':
main(model='en_core_web_md', new_model_name="feature", output_dir="./new_model", n_iter=1)
@nyejon Thanks for the example! I just tested it on the very latest state of develop and can confirm the segfault.
Here's the minimal reproducable version:
import spacy
nlp = spacy.load("en_core_web_sm")
ner = nlp.get_pipe("ner")
ner.add_label("FEATURE")
Fixed :tada: 160b55c5729f
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Fixed :tada: 160b55c5729f