Spacy: How to ADD extra named entities

Created on 23 Nov 2015 · 9Comments · Source: explosion/spaCy

Hi,

First, I would like to thank you for your great work.

I was wondering whether there is any way how to add extra named entities like 'animal' to the model.
I was looking into the documentation without any success. All I could currently find in the documentation is the mention that you could add your own entity recogniser but only that it should accept doc and label entities. I have also seen this https://github.com/honnibal/spaCy/issues/144 but it does not provide any example how to retrain the model or how to add your own model. I think it would be to much benefit if examples how to train your model and/or how to specify your own NER entities (and positive or negative examples) would be added to the documentation.

Many thanks,
Jakub

docs

Source

jaksmid

Most helpful comment

As of v0.100, it should be possible to train new classes over the top of the old model. I don't know whether this will actually be nice for accuracy. The API for GoldParse isn't so nice, but for now this should work:

import plac

from spacy.en import English
from spacy.gold import GoldParse


def main(out_loc):
    nlp = English(parser=False) # Avoid loading the parser, for quick load times
    # Run the tokenizer and tagger (but not the entity recognizer)
    doc = nlp.tokenizer(u'Lions and tigers and grizzly bears!')
    nlp.tagger(doc) 

    nlp.entity.add_label('ANIMAL') # <-- New in v0.100

    # Create a GoldParse object. This should have a better API...
    indices = tuple(range(len(doc)))
    words = [w.text for w in doc]
    tags = [w.tag_ for w in doc]
    heads = [0 for _ in doc]
    deps = ['' for _ in doc]
    # This is the only part we care about. We want BILOU format
    ner = ['U-ANIMAL', 'O', 'U-ANIMAL', 'O', 'B-ANIMAL', 'L-ANIMAL', 'O']

    # Create the GoldParse
    annot = GoldParse(doc, (indices, words, tags, heads, deps, ner))

    # Update the weights with the example
    # Here we iterate until we get it entirely correct. In practice this is probably a bad idea.
    # Note that we've added a class to the existing model here! We "resume"
    # training the previous model. Whether this is good or not I can't say, you'll have to
    # experiment.
    loss = nlp.entity.train(doc, annot)
    i = 0
    while loss != 0 and i < 1000:
        loss = nlp.entity.train(doc, annot)
        i += 1
    print("Used %d iterations" % i)

    nlp.entity(doc)
    for ent in doc.ents:
        print(ent.text, ent.label_)
    nlp.entity.model.dump(out_loc)

if __name__ == '__main__':
    plac.call(main)

$ python examples/add_entity_type.py /tmp/animals.model
Used 2 iterations
(u'Lions', u'ANIMAL')
(u'tigers', u'ANIMAL')
(u'grizzly bears', u'ANIMAL')

honnibal on 20 Jan 2016

👍7

All 9 comments

Hey,

All the code for training is there, but the documentation is lacking, and you'll need a substantial amount of training data.

This is the training script, that trains the tagger, parser and NER:

https://github.com/honnibal/spaCy/blob/master/bin/parser/train.py#L82

I agree that there needs to be documentation for this. Sorry for the delay on getting that done.

honnibal on 27 Nov 2015

👍1

Hi,

Many thanks for the reply. Will go through the script at the earliest opportunity.

Cheers,
Jakub

jaksmid on 30 Nov 2015

import plac

from spacy.en import English
from spacy.gold import GoldParse


def main(out_loc):
    nlp = English(parser=False) # Avoid loading the parser, for quick load times
    # Run the tokenizer and tagger (but not the entity recognizer)
    doc = nlp.tokenizer(u'Lions and tigers and grizzly bears!')
    nlp.tagger(doc) 

    nlp.entity.add_label('ANIMAL') # <-- New in v0.100

    # Create a GoldParse object. This should have a better API...
    indices = tuple(range(len(doc)))
    words = [w.text for w in doc]
    tags = [w.tag_ for w in doc]
    heads = [0 for _ in doc]
    deps = ['' for _ in doc]
    # This is the only part we care about. We want BILOU format
    ner = ['U-ANIMAL', 'O', 'U-ANIMAL', 'O', 'B-ANIMAL', 'L-ANIMAL', 'O']

    # Create the GoldParse
    annot = GoldParse(doc, (indices, words, tags, heads, deps, ner))

    # Update the weights with the example
    # Here we iterate until we get it entirely correct. In practice this is probably a bad idea.
    # Note that we've added a class to the existing model here! We "resume"
    # training the previous model. Whether this is good or not I can't say, you'll have to
    # experiment.
    loss = nlp.entity.train(doc, annot)
    i = 0
    while loss != 0 and i < 1000:
        loss = nlp.entity.train(doc, annot)
        i += 1
    print("Used %d iterations" % i)

    nlp.entity(doc)
    for ent in doc.ents:
        print(ent.text, ent.label_)
    nlp.entity.model.dump(out_loc)

if __name__ == '__main__':
    plac.call(main)

$ python examples/add_entity_type.py /tmp/animals.model
Used 2 iterations
(u'Lions', u'ANIMAL')
(u'tigers', u'ANIMAL')
(u'grizzly bears', u'ANIMAL')

honnibal on 20 Jan 2016

👍7

Thanks for the discussion. I'm new both to Python and spaCy (and NLP in general), so apologies in advance if I've missed something obvious here, but I did notice that the example provided by @honnibal doesn't work with the latest version of spaCy running under Python 3.5.

1 The example has:

nlp.entity.train(doc annot)

but that that method is no longer available - the code should be

nlp.entity.update(doc, annot)

When I make that change, I get this error:

File "spacy/syntax/parser.pyx", line 247, in spacy.syntax.parser.Parser.update (spacy/syntax/parser.cpp:7788)
File "spacy/syntax/ner.pyx", line 93, in spacy.syntax.ner.BiluoPushDown.preprocess_gold (spacy/syntax/ner.cpp:4782)
File "spacy/syntax/ner.pyx", line 112, in spacy.syntax.ner.BiluoPushDown.lookup_transition (spacy/syntax/ner.cpp:5145)
TypeError: argument of type 'NoneType' is not iterable

which, from an examination of ner.pyx looks as if the exception is being thrown here:
for i in range(self.n_moves):

I tried passing both BILOU format and entity formats into the GoldParse constructor, with the exact same result.

2 There is also an example train_ner which offers an alternative example of training the Entity Recognizer. This worked for me except that, crucially, I was unable to modify to accept my own Entity Type. Here are my relevant modifications (I had to take the code around 'loss' from the other example to make it work , sort-of)

def train_ner(nlp, train_data, entity_types):
    ner = EntityRecognizer(nlp.vocab, entity_types=entity_types)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)
        loss = nlp.entity.update(doc, gold)
        i = 0
        while loss != 0 and i < 1000:
            loss = nlp.entity.update(doc, gold)


    ner.model.end_training()
    return ner

and

...
    nlp = English()
    sty='DiseaseOrSyndrome'
    nlp.entity.add_label(sty) 
    entity = 'Acute Peptic Ulcer'
    train_data = [
        (
            'Acute peptic ulcer NOS'
          ,[(0, 18, sty)
            ]
        )
...
  ]
    ner = train_ner(nlp, train_data, [sty])

...

and when feeding in a couple of sentences containing 'Acute Peptic Ulcer', the code:

       for ent in parsedDoc.ents:
            print(ent.label, ent.label_, ' '.join(t.orth_ for t in ent))

prints this to the console:

349 ORG acute peptic ulcer
349 ORG acute peptic ulcer

So why don't I see something like:

nnn DiseaseOrSyndrome acute peptic ulcer
nnn DiseaseOrSyndrome acute peptic ulcer

Again, I may be out of my league here, being new to both Python and spaCy, but any help would be much appreciated!

jewellcj on 16 Nov 2016

Why do you think you can no longer use:

nlp.entity.train(doc, annot)

Unless i've missed something this is still present in the most recent version of spaCy. I successfully added new entities and got my results back with a new instantiation of spaCy by using code very similar to honnibal's.

The simplified code to load the saved model looks like this

nlp = spacy.load('en')
# train spacy with custom data

# Add the tags and training data
nlp.entity.add_label(entlabel)
nlp.entity.model.load(trainingfile.model)

My point is really that you need to add the label again when you re-instantiate spaCy. Simply loading the training file is not enough.

DomHudson on 18 Nov 2016

Thanks for the feedback @DomHudson

As far as:
nlp.entity.train(doc, annot)

the exact exception stacjk trace I get with that code in both @honnibal 's above example (whether using BILOU or positional training data) is:

>
_File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main
loss = nlp.entity.train(doc, annot)
AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'train'
File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main
loss = nlp.entity.train(doc, annot)
AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'train'_

I'll have to admit that my experience of Pyhton is limited (I'm primarily a Java developer) and so I may be missing something obvious, but a quick search of the spaCy repository reveals that there is no instance of a train(...) function. I pulled the master branch on 11/16/2016.

Having said all that, my original issue is resolved (thanks in part to your suggestion that I add the label again and reload the training model to 'add' to the spaCy model); I now have a working example that does correctly train with the new entity type, at least intermittently so.

So, for example, after training, I run this test inline, using the simple input text file:

The patient has an acute peptic ulcer.
There is no sign of an acute peptic ulcer.

The test code is:

  `if test_doc is not None:
    nlp=English()
    nlp.entity.add_label(sty) 
    nlp.entity.model.load(str(model_dir / 'model'))

    with open(test_doc, 'r') as filein:
        test_doc_str=filein.read()

    parsedDoc=nlp(test_doc_str)
    for word in parsedDoc:
        print(word.text, word.tag_, word.ent_type_, word.ent_iob)
    print('\nResult of spaCy parse (named entities)')
    for ent in parsedDoc.ents:
        print(ent.label, ent.label_, ' '.join(t.orth_ for t in ent))
    print('\nResult of spaCy parse (noun chunks)')
    for np in parsedDoc.noun_chunks:
        print(np)`

and that yields:

The DT 2
patient NN 2
has VBZ 2
an DT 2
acute JJ DiseaseOrSyndrome 3
peptic JJ DiseaseOrSyndrome 1
ulcer NN DiseaseOrSyndrome 1
. . 2

SP 2
There EX 2
is VBZ 2
no DT 2
sign NN 2
of IN 2
an DT 2
acute JJ DiseaseOrSyndrome 3
peptic JJ DiseaseOrSyndrome 1
ulcer NN DiseaseOrSyndrome 1
. . 2

Result of spaCy parse (named entities)
1510242 DiseaseOrSyndrome acute peptic ulcer
1510242 DiseaseOrSyndrome acute peptic ulcer

Result of spaCy parse (noun chunks)
The patient
an acute peptic ulcer
no sign
an acute peptic ulcer

Unfortunately, this is not consistent, even with loading the prior persisted model. I was wondering whether somewhere along the way I'm not explicitly associating the previously generated config.json with the previously trained and persisted model??

For now I'll have to assume that this is due to a lack of an adequate volume of training data. I will post further in a separate thread as soon as I have this stabilized.

Thanks again for your help.

jewellcj on 22 Nov 2016

@jewellcj - Looks like you were able to train a model to identify DiseaseorSyndrome.
I am still not able to make it work.
Will you be able to share training script? I tried various things suggested above with no luck.

Thanks.

BrijeshKaria on 15 Jan 2017

@BrijeshKaria - yes we were able to train the model (in prototype/tryout only code) initially focusing on just one entity. I guess the key piece of code was this:

def train_ner(nlp, train_data):
    for itn in range(10):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            doc = nlp(raw_text)
            nlp.tagger(doc)
            gold = GoldParse(doc, entities=entity_offsets)
            i = 0
            loss = nlp.entity.update(doc, gold)
            while loss != 0 and i < 1000:
                loss = nlp.entity.update(doc, gold)
                i += 1
    nlp.entity(doc)
    nlp.entity.model.end_training()
    return nlp

where we invoke the above function as follows:

    ...
    sty='T047:DiseaseOrSyndrome'
    nlp.entity.add_label(sty) 
    train_data = [
        (
            'Acute peptic ulcer NOS'
          ,[(0, 18, sty)
            ]
        )
        ,
        (
        'Acute peptic ulcer of duodenum'
          ,[(0, 18, sty)
            ]
        ),
#... etc.

    ]
    nlp=train_ner(nlp, train_data)

We sort of abandoned this however as to effectively train a spaCy model you need to have a very large Gold Standard corpora.

Instead we have been focusing on plain-old entity matching, and we have successfully used the spaCy Matcher with a spaCy Gazetteer generated from our Taxonomy, allowing us to use spaCy to normalize terms, thus improving the accuracy of our internal taxonomy search function.

jewellcj on 17 Jan 2017

👍4

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.