Spacy: ner.add_label to existing model causes segmentation fault: 11

Created on 17 Sep 2018  路  8Comments  路  Source: explosion/spaCy

I was getting intermittent segmentation faults when training a new entity type, and so I thought I'd update spaCy to see if that helped. Unfortunately, now I get a segfault every single time, except not in training, but on adding entity types.

How to reproduce the behaviour

Follow the spaCy/examples/training/train_new_entity_type.py example with the existing model 'en'. Segmentation fault occurs when adding a new entity label (ner.add_label(label)).

Your Environment

  • spaCy version: 2.1.0a1
  • Platform: Darwin-17.7.0-x86_64-i386-64bit
  • Python version: 3.7.0
  • Models: en

I've attached the segfault log.

segfault.txt

bug feat / ner

Most helpful comment

Fixed :tada: 160b55c5729f

All 8 comments

Thanks for the report. Are you able to share the examples you used and/or the labels you're adding? And do you have a reproducible example? Segfaults like this are always tricky to debug, so the more specific examples we have, the better.

The minimal reproducible example is the train_new_entity_type.py example script with the 'en' model loaded, with no other changes. That script adds the 'ANIMAL' entity tag. Note that this particular error is only with the nightly build.

The intermittent segmentation faults I referenced happened with other data on the release build, but that issue has been mentioned in the past and is still open - #1969

@iperera do you get the segfault even when just running that example file? It ran fine for me, on a mac using Python 3.7.

Only when specifying an existing model to add to. If I start with a blank model, it runs fine for me.

I also get a segmentation fault using the standard training code when I try add a label to the NER with ner.add_label("FEATURE")

This is on the latest nightly build

def main(model=None, new_model_name='animal', output_dir=None, n_iter=10):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank('en')  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy

    print(nlp.pipe_names)
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe('ner')

    print("Adding labels")
    for label in LABELS:
        print(label)
        ner.add_label(label)   # <- Segfaults here
        print(label)

    print("Beginning training")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        # Note that 'begin_training' initializes the models, so it'll zero out
        # existing entity types.
        optimizer = nlp.entity.create_optimizer()

    # get names of other pipes to disable them during training
    print("Disabling pipes")
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(8., 64., 1.001))
            # print(f'Number of batches: {len(batches)}')
            for batch_num, batch in enumerate(batches):
                texts, annotations = zip(*batch)
                if batch_num % 1000 == 0:
                    print(f"Batch {batch_num}")
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35,
                           losses=losses)
            print('Losses', losses)

    # test the trained model
    test_text = 'Do you like horses?'
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta['name'] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

if __name__ == '__main__':
    main(model='en_core_web_md', new_model_name="feature", output_dir="./new_model", n_iter=1)

@nyejon Thanks for the example! I just tested it on the very latest state of develop and can confirm the segfault.

Here's the minimal reproducable version:

import spacy

nlp = spacy.load("en_core_web_sm")
ner = nlp.get_pipe("ner")
ner.add_label("FEATURE")

Fixed :tada: 160b55c5729f

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings