Hi,
First, I would like to thank you for your great work.
I was wondering whether there is any way how to add extra named entities like 'animal' to the model.
I was looking into the documentation without any success. All I could currently find in the documentation is the mention that you could add your own entity recogniser but only that it should accept doc and label entities. I have also seen this https://github.com/honnibal/spaCy/issues/144 but it does not provide any example how to retrain the model or how to add your own model. I think it would be to much benefit if examples how to train your model and/or how to specify your own NER entities (and positive or negative examples) would be added to the documentation.
Many thanks,
Jakub
Hey,
All the code for training is there, but the documentation is lacking, and you'll need a substantial amount of training data.
This is the training script, that trains the tagger, parser and NER:
https://github.com/honnibal/spaCy/blob/master/bin/parser/train.py#L82
I agree that there needs to be documentation for this. Sorry for the delay on getting that done.
Hi,
Many thanks for the reply. Will go through the script at the earliest opportunity.
Cheers,
Jakub
As of v0.100, it should be possible to train new classes over the top of the old model. I don't know whether this will actually be nice for accuracy. The API for GoldParse isn't so nice, but for now this should work:
import plac
from spacy.en import English
from spacy.gold import GoldParse
def main(out_loc):
nlp = English(parser=False) # Avoid loading the parser, for quick load times
# Run the tokenizer and tagger (but not the entity recognizer)
doc = nlp.tokenizer(u'Lions and tigers and grizzly bears!')
nlp.tagger(doc)
nlp.entity.add_label('ANIMAL') # <-- New in v0.100
# Create a GoldParse object. This should have a better API...
indices = tuple(range(len(doc)))
words = [w.text for w in doc]
tags = [w.tag_ for w in doc]
heads = [0 for _ in doc]
deps = ['' for _ in doc]
# This is the only part we care about. We want BILOU format
ner = ['U-ANIMAL', 'O', 'U-ANIMAL', 'O', 'B-ANIMAL', 'L-ANIMAL', 'O']
# Create the GoldParse
annot = GoldParse(doc, (indices, words, tags, heads, deps, ner))
# Update the weights with the example
# Here we iterate until we get it entirely correct. In practice this is probably a bad idea.
# Note that we've added a class to the existing model here! We "resume"
# training the previous model. Whether this is good or not I can't say, you'll have to
# experiment.
loss = nlp.entity.train(doc, annot)
i = 0
while loss != 0 and i < 1000:
loss = nlp.entity.train(doc, annot)
i += 1
print("Used %d iterations" % i)
nlp.entity(doc)
for ent in doc.ents:
print(ent.text, ent.label_)
nlp.entity.model.dump(out_loc)
if __name__ == '__main__':
plac.call(main)
$ python examples/add_entity_type.py /tmp/animals.model
Used 2 iterations
(u'Lions', u'ANIMAL')
(u'tigers', u'ANIMAL')
(u'grizzly bears', u'ANIMAL')
Thanks for the discussion. I'm new both to Python and spaCy (and NLP in general), so apologies in advance if I've missed something obvious here, but I did notice that the example provided by @honnibal doesn't work with the latest version of spaCy running under Python 3.5.
1 The example has:
nlp.entity.train(doc annot)
but that that method is no longer available - the code should be
nlp.entity.update(doc, annot)
When I make that change, I get this error:
File "spacy/syntax/parser.pyx", line 247, in spacy.syntax.parser.Parser.update (spacy/syntax/parser.cpp:7788)
File "spacy/syntax/ner.pyx", line 93, in spacy.syntax.ner.BiluoPushDown.preprocess_gold (spacy/syntax/ner.cpp:4782)
File "spacy/syntax/ner.pyx", line 112, in spacy.syntax.ner.BiluoPushDown.lookup_transition (spacy/syntax/ner.cpp:5145)
TypeError: argument of type 'NoneType' is not iterable
which, from an examination of ner.pyx looks as if the exception is being thrown here:
for i in range(self.n_moves):
I tried passing both BILOU format and entity formats into the GoldParse constructor, with the exact same result.
2 There is also an example train_ner which offers an alternative example of training the Entity Recognizer. This worked for me except that, crucially, I was unable to modify to accept my own Entity Type. Here are my relevant modifications (I had to take the code around 'loss' from the other example to make it work , sort-of)
def train_ner(nlp, train_data, entity_types):
ner = EntityRecognizer(nlp.vocab, entity_types=entity_types)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
loss = nlp.entity.update(doc, gold)
i = 0
while loss != 0 and i < 1000:
loss = nlp.entity.update(doc, gold)
ner.model.end_training()
return ner
and
...
nlp = English()
sty='DiseaseOrSyndrome'
nlp.entity.add_label(sty)
entity = 'Acute Peptic Ulcer'
train_data = [
(
'Acute peptic ulcer NOS'
,[(0, 18, sty)
]
)
...
]
ner = train_ner(nlp, train_data, [sty])
...
and when feeding in a couple of sentences containing 'Acute Peptic Ulcer', the code:
for ent in parsedDoc.ents:
print(ent.label, ent.label_, ' '.join(t.orth_ for t in ent))
prints this to the console:
349 ORG acute peptic ulcer
349 ORG acute peptic ulcer
So why don't I see something like:
nnn DiseaseOrSyndrome acute peptic ulcer
nnn DiseaseOrSyndrome acute peptic ulcer
Again, I may be out of my league here, being new to both Python and spaCy, but any help would be much appreciated!
Why do you think you can no longer use:
nlp.entity.train(doc, annot)
Unless i've missed something this is still present in the most recent version of spaCy. I successfully added new entities and got my results back with a new instantiation of spaCy by using code very similar to honnibal's.
The simplified code to load the saved model looks like this
nlp = spacy.load('en')
# train spacy with custom data
# Add the tags and training data
nlp.entity.add_label(entlabel)
nlp.entity.model.load(trainingfile.model)
My point is really that you need to add the label again when you re-instantiate spaCy. Simply loading the training file is not enough.
Thanks for the feedback @DomHudson
nlp.entity.train(doc, annot)
the exact exception stacjk trace I get with that code in both @honnibal 's above example (whether using BILOU or positional training data) is:
>
_File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main
loss = nlp.entity.train(doc, annot)
AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'train'
File "{redacted}}/spacy/train_ner_from_taxonomy-with-Bilou.py", line 94, in main
loss = nlp.entity.train(doc, annot)
AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'train'_
I'll have to admit that my experience of Pyhton is limited (I'm primarily a Java developer) and so I may be missing something obvious, but a quick search of the spaCy repository reveals that there is no instance of a train(...) function. I pulled the master branch on 11/16/2016.
So, for example, after training, I run this test inline, using the simple input text file:
The patient has an acute peptic ulcer.
There is no sign of an acute peptic ulcer.The test code is:
`if test_doc is not None:
nlp=English()
nlp.entity.add_label(sty)
nlp.entity.model.load(str(model_dir / 'model'))
with open(test_doc, 'r') as filein:
test_doc_str=filein.read()
parsedDoc=nlp(test_doc_str)
for word in parsedDoc:
print(word.text, word.tag_, word.ent_type_, word.ent_iob)
print('\nResult of spaCy parse (named entities)')
for ent in parsedDoc.ents:
print(ent.label, ent.label_, ' '.join(t.orth_ for t in ent))
print('\nResult of spaCy parse (noun chunks)')
for np in parsedDoc.noun_chunks:
print(np)`
and that yields:
The DT 2
patient NN 2
has VBZ 2
an DT 2
acute JJ DiseaseOrSyndrome 3
peptic JJ DiseaseOrSyndrome 1
ulcer NN DiseaseOrSyndrome 1
. . 2
SP 2
There EX 2
is VBZ 2
no DT 2
sign NN 2
of IN 2
an DT 2
acute JJ DiseaseOrSyndrome 3
peptic JJ DiseaseOrSyndrome 1
ulcer NN DiseaseOrSyndrome 1
. . 2Result of spaCy parse (named entities)
1510242 DiseaseOrSyndrome acute peptic ulcer
1510242 DiseaseOrSyndrome acute peptic ulcerResult of spaCy parse (noun chunks)
The patient
an acute peptic ulcer
no sign
an acute peptic ulcer
Unfortunately, this is not consistent, even with loading the prior persisted model. I was wondering whether somewhere along the way I'm not explicitly associating the previously generated config.json with the previously trained and persisted model??
For now I'll have to assume that this is due to a lack of an adequate volume of training data. I will post further in a separate thread as soon as I have this stabilized.
Thanks again for your help.
@jewellcj - Looks like you were able to train a model to identify DiseaseorSyndrome.
I am still not able to make it work.
Will you be able to share training script? I tried various things suggested above with no luck.
Thanks.
@BrijeshKaria - yes we were able to train the model (in prototype/tryout only code) initially focusing on just one entity. I guess the key piece of code was this:
def train_ner(nlp, train_data):
for itn in range(10):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp(raw_text)
nlp.tagger(doc)
gold = GoldParse(doc, entities=entity_offsets)
i = 0
loss = nlp.entity.update(doc, gold)
while loss != 0 and i < 1000:
loss = nlp.entity.update(doc, gold)
i += 1
nlp.entity(doc)
nlp.entity.model.end_training()
return nlp
where we invoke the above function as follows:
...
sty='T047:DiseaseOrSyndrome'
nlp.entity.add_label(sty)
train_data = [
(
'Acute peptic ulcer NOS'
,[(0, 18, sty)
]
)
,
(
'Acute peptic ulcer of duodenum'
,[(0, 18, sty)
]
),
#... etc.
]
nlp=train_ner(nlp, train_data)
We sort of abandoned this however as to effectively train a spaCy model you need to have a very large Gold Standard corpora.
Instead we have been focusing on plain-old entity matching, and we have successfully used the spaCy Matcher with a spaCy Gazetteer generated from our Taxonomy, allowing us to use spaCy to normalize terms, thus improving the accuracy of our internal taxonomy search function.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
As of v0.100, it should be possible to train new classes over the top of the old model. I don't know whether this will actually be nice for accuracy. The API for GoldParse isn't so nice, but for now this should work: