Spacy: Invalid per entity NER accuracy

Created on 15 Jul 2019 · 11Comments · Source: explosion/spaCy

I trained a NER model with some new entity types, My model accuracy is as follows on a test set which shows per entity accuracy as well as total accuracy.
{
"ents_f": 84.65116279069768,
"ents_p": 95.28795811518324,
"ents_per_type": {
"label_1": {
"f": 39.47368421052631,
"p": 24.59016393442623,
"r": 100.0
},
"label_2": {
"f": 90.47619047619048,
"p": 90.47619047619048,
"r": 90.47619047619048
},
"label_3": {
"f": 80.00000000000001,
"p": 80.0,
"r": 80.0
},
"label_4": {
"f": 80.0,
"p": 72.72727272727273,
"r": 88.88888888888889
},
"label_5": {
"f": 14.14141414141414,
"p": 8.045977011494253,
"r": 58.333333333333336
},
"label_6": {
"f": 100.0,
"p": 100.0,
"r": 100.0
},
"label_7": {
"f": 50.0,
"p": 50.0,
"r": 50.0
}
},
"ents_r": 76.15062761506276,
"las": 0.0,
"tags_acc": 0.0,
"token_acc": 100.0,
"uas": 0.0
}

However, If I evaluate same model on the slightly different set which contains same docs as in previous test set but have few more annotation for label-4 and label-5 which were missing for some docs in previous test set, the model gives '0' f-score for label-3 for which no annotation were changed in new test set. The evaluation result on new test set is:
{
"ents_f": 84.65346534653466,
"ents_p": 95.53072625698324,
"ents_per_type": {
"label_1": {
"f": 32.35294117647059,
"p": 19.298245614035086,
"r": 100.0
},
"label_2": {
"f": 94.44444444444444,
"p": 94.44444444444444,
"r": 94.44444444444444
},
"label_3": {
"f": 0.0,
"p": 0.0,
"r": 0.0
},
"label_4": {
"f": 80.0,
"p": 72.72727272727273,
"r": 88.88888888888889
},
"label_5": {
"f": 14.432989690721651,
"p": 8.13953488372093,
"r": 63.63636363636363
},
"label_6": {
"f": 100.0,
"p": 100.0,
"r": 100.0
},
"label_7": {
"f": 50.0,
"p": 50.0,
"r": 50.0
}
},
"ents_r": 76.0,
"las": 0.0,
"tags_acc": 0.0,
"token_acc": 100.0,
"uas": 0.0
}

Also, model is able to identify label_3 in many docs on new test set even though the f-score of label_3 is zero.

Environment Information:

Operating System: Linux-4.15.0-54-generic-x86_64-with-Ubuntu-16.04-xenial
Python Version Used: 3.5.2
spaCy Version Used: 2.1.6

Thanks in adv for help.

bug feat / ner perf / accuracy

Source

FallakAsad

Most helpful comment

Okay, I would do that.

FallakAsad on 17 Jul 2019

👍3

All 11 comments

It's difficult to tell what may be going on considering it can heavily depend on your data & training loop...

The first thing that comes to mind, is that you may have too few samples for label_3 to have reliable performance measures.
What are the number of samples for each label, in both training and test datasets, and in both the previous and current datasets?

svlandeg on 15 Jul 2019

I have total of 95 sample docs, each doc have multiple entities in it. Each data have same 95 docs and count of entity in each data is:
label_1: 92 Both data
label_2: 54 Both data
label_3: 42 Both data
label_4: 108 previous data and 124 in current data
label_5: 118 previous data and 121 in current data
label_6: 7 Both data
label_7: 24 Both data

I tried one more case, where I made a copy of previous data and then in last sample doc of my data set, I added 1 annotation for entity 'label_4'
label_1: 92 Both data
label_2: 54 Both data
label_3: 42 Both data
label_4: 108 previous data and 109 in new data
label_5: 118 Both data
label_6: 7 Both data
label_7: 24 Both data

But when I evaluate model on both data, the results are as follows where accuracy of other entity types are affected after adding 1 additional annotations for entity type label_5 (reducing 5% f,p,r for label_3)
Results on previous data:
{
"ents_f": 84.65116279069768,
"ents_p": 95.28795811518324,
"ents_per_type": {
"label_1": {
"f": 39.47368421052631,
"p": 24.59016393442623,
"r": 100.0
},
"label_2": {
"f": 90.47619047619048,
"p": 90.47619047619048,
"r": 90.47619047619048
},
"label_3": {
"f": 80.00000000000001,
"p": 80.0,
"r": 80.0
},
"label_4": {
"f": 80.0,
"p": 72.72727272727273,
"r": 88.88888888888889
},
"label_5": {
"f": 16.16161616161616,
"p": 9.195402298850574,
"r": 66.66666666666666
},
"label_6": {
"f": 100.0,
"p": 100.0,
"r": 100.0
},
"label_7": {
"f": 50.0,
"p": 50.0,
"r": 50.0
}
},
"ents_r": 76.15062761506276,
"las": 0.0,
"tags_acc": 0.0,
"token_acc": 100.0,
"uas": 0.0
}

Results on new data:
{
"ents_f": 84.63356973995272,
"ents_p": 95.2127659574468,
"ents_per_type": {
"label_1": {
"f": 37.83783783783783,
"p": 23.333333333333332,
"r": 100.0
},
"label_2": {
"f": 90.0,
"p": 90.0,
"r": 90.0
},
"label_3": {
"f": 75.0,
"p": 75.0,
"r": 75.0
},
"label_4": {
"f": 80.0,
"p": 72.72727272727273,
"r": 88.88888888888889
},
"label_5": {
"f": 12.12121212121212,
"p": 6.896551724137931,
"r": 50.0
},
"label_6": {
"f": 100.0,
"p": 100.0,
"r": 100.0
},
"label_7": {
"f": 50.0,
"p": 50.0,
"r": 50.0
}
},
"ents_r": 76.17021276595744,
"las": 0.0,
"tags_acc": 0.0,
"token_acc": 100.0,
"uas": 0.0
}

My evaluation code looks like:

```import plac
import json
from pathlib import Path
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

@plac.annotations(
model=("Model name", "option", "m", str),
test_data=("test data", "option", "test_data", Path)
)

def main(model=None, test_data=None):
nlp = load_model(model)
with open(str(test_data), 'r') as f:
TEST_DATA = json.load(f)
results = evaluate(nlp, TEST_DATA)
print(results)

def load_model(model_dir = None):
print("Loading from", model_dir)
nlp = spacy.load(str(model_dir))
return nlp;

def evaluate(ner_model, examples):
scorer = Scorer()
for input_, annot in examples:
doc_gold_text = ner_model.make_doc(input_)
gold = GoldParse(doc_gold_text, entities=annot['entities'])
pred_value = ner_model(input_)
scorer.score(pred_value, gold)
return scorer.scores

if __name__ == '__main__':
plac.call(main)```

FallakAsad on 16 Jul 2019

Hmm, so do I understand it correctly that for all these results, you kept the training data AND the model exactly the same? You're only changing the test set?

Can the entities for different labels overlap with respect to their textual spans / offsets in text?

svlandeg on 16 Jul 2019

Yes, the training data and model is exactly same. I am only changing test set. No, non of the entity overlaps with respect to their offset in text in test/train sets. The model I am using was trained as follows:

Loaded spacy's en_core_web_sm model
Loaded FastText word embeddings using nlp.vocab.set_vector
Run nlp.resume_training()
Then disabled the all the pipes except for NER
Start the training with 'N' number of iterations

I also found 1 another example where I found per entity accuracy quite confusing. I trained a spacy's blank model 'en' (and loaded FastText word embeddings before calling begin_training()). After training, I evaluated the model on training set which gives me accuracy where total f-score, precision and recall is 100, but some per entity f-score, precision and recall is < 100.

{
"ents_f": 100.0,
"ents_p": 100.0,
"ents_per_type": {
"label_1": {
"f": 73.68421052631578,
"p": 58.333333333333336,
"r": 100.0
},
"label_2": {
"f": 100.0,
"p": 100.0,
"r": 100.0
},
"label_3": {
"f": 100.0,
"p": 100.0,
"r": 100.0
},
"label_4": {
"f": 60.86956521739131,
"p": 43.75,
"r": 100.0
},
"label_5": {
"f": 50.0,
"p": 42.857142857142854,
"r": 60.0
},
"label_6": {
"f": 100.0,
"p": 100.0,
"r": 100.0
},
"label_7": {
"f": 82.75862068965517,
"p": 70.58823529411765,
"r": 100.0
},
"label_8": {
"f": 100.0,
"p": 100.0,
"r": 100.0
},
"label_9": {
"f": 80.0,
"p": 66.66666666666666,
"r": 100.0
},
},
"ents_r": 100.0,
"las": 0.0,
"tags_acc": 0.0,
"token_acc": 100.0,
"uas": 0.0
}

FallakAsad on 17 Jul 2019

Ok - that certainly looks weird. Will look into this!

svlandeg on 17 Jul 2019

Hello, this is the code that performs the computation for the scores per entities:

https://github.com/explosion/spaCy/blob/master/spacy/scorer.py#L163

Maybe the problems lies in the fact that the training is being resumed? It is possible that some information about the counts in the entities are being forgotten?

elbaulp on 17 Jul 2019

I think there is a bug in scorer.py at line https://github.com/explosion/spaCy/blob/master/spacy/scorer.py#L174-L180

if same entity exists multiple times in given doc then this part of code add tuple entries in current_ent in such a way that it add duplicate entries for some annotations. For example, if we have gold data as follows:
[
"Your total amount is 100.00 and sub total amount is 100.00",
{
"entities": [
[21,27,"amount"],
[52, 58, "amount"]
]
}
]
and lets say model was able to predict both of the amount values correctly then:

In 1st iteration of the loop, values will be:
current_ent ={'amount': {(('amount', 5, 5),)}}
gold_ent = {'amount': {(('amount', 5, 5), ('amount', 11, 11))}

In 2nd iteration of the loop, values will be:
current_ent = {(('amount', 5, 5),), (('amount', 5, 5), ('amount', 11, 11))}
gold_ent = {'amount': {(('amount', 5, 5), ('amount', 11, 11))}}

Due to above calculations of current_ent and gold_ent, value for false positives (self.fp) is wrong

This is I think what code at line https://github.com/explosion/spaCy/blob/master/spacy/scorer.py#L174-L180 suppose to look like.

cand_ents.add((ent.label_, first, last))
for x in cand_ents:
    if x[0] == ent.label_:
        current_ent[ent.label_].add(x)
for x in gold_ents:
    if x[0] == ent.label_:
        current_gold[ent.label_].add(x)

Note: the values like 5 or 11 in above examples are just dummy values

FallakAsad on 17 Jul 2019

👍1

Do you think you could try and write that up & test it in a pull request? That would be perfect :-)

svlandeg on 17 Jul 2019

Okay, I would do that.

FallakAsad on 17 Jul 2019

👍3

Hi @FallakAsad @svlandeg

May I know if this is a bug in the entity Scorer?

@FallakAsad Referring to this comment of yours -

I also found 1 another example where I found per entity accuracy quite confusing.
I trained a spacy's blank model 'en' (and loaded FastText word embeddings before calling begin_training()). 
After training, I evaluated the model on training set which gives me accuracy where
total f-score, precision and recall is 100, but some per entity f-score, precision and recall is < 100.

If the overall model score is 100, each of the entity scores should also be 100, right?

I trained a NER model using spacy on 6 custom entities using 5000 samples and later tested it on 500 samples. The result I get didn't make sense to me. Here is the result -

{"uas": 0.0, "las": 0.0, "ents_p": 93.62838106164233, "ents_r":  93.95728476332452, "ents_f":93.79254457050243,
 "ents_per_type": {
 "ENTITY1": {"p": 6.467595956926736, "r": 54.51002227171492, "f":  11.563219748420247},
 "ENTITY2": {"p": 6.272470243289469, "r": 49.219391947411665, "f":  11.126934984520123}, 
 "ENTITY3": {"p": 18.741109530583213, "r": 85.02742820264602, "f":  30.712745497989392}, 
 "ENTITY4": {"p": 13.413228854574788, "r": 70.58823529411765, "f":  22.54284884283916}, 
 "ENTITY5": {"p": 19.481765834932823, "r": 82.85714285714286, "f":  31.546231546231546}, 
"ENTITY6": {"p": 24.822695035460992, "r": 64.02439024390245, "f": 35.77512776831346}},
 "tags_acc": 0.0, "token_acc": 100.0}

According to the above result, the overall F Score of my model ents_f is 93.79, however when I check individual entity scores the F score is quite low for all of them. And this is same for Precision as well. How is this possible?

Here is the Stackoverflow question I posted for this issue in case you want more info.