Spacy: Entity extracted at evaluation doesn't show up using the imported model

Created on 24 Aug 2020  Â·  17Comments  Â·  Source: explosion/spaCy

Hi,

Trained a custom NER model on a our own labelled dataset on financial risk. Did pretraining as well and the model finished with a score of 97.48.

Training pipeline: ner
Starting with blank model 'en'
2511 training docs
267 evaluation docs

============================== Vocab & Vectors ==============================
ℹ 101601 total words in the data (12951 unique)
ℹ No word vectors present in the model

========================== Named Entity Recognition ==========================
ℹ 1 new label, 0 existing labels
0 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
✔ No entities consisting of or starting/ending with punctuation

When evaluating the model on the dev set, entities got picked up just fine, but there is one entity: US Treasury Department’s Office of Foreign Assets Control (at least one I've notices) that doesn't show up in the same sentences when importing and testing the best-model in a notebook:

Screen Shot 2020-08-24 at 08 26 29
Screen Shot 2020-08-24 at 08 27 00

Screen Shot 2020-08-24 at 08 29 03

Than I ran a test on every single sentence (150) containing the missing entity, 3 returned it partially as: Department’s Office of Foreign Assets Control but nothing more.

There are quite a few other entities like: Department of Justice (130), Department of State (70), US Department of the Treasury (40) which contain similar wording, can these potentially conflict the missing entity: US Treasury Department’s Office of Foreign Assets Control? However this still won't answer why this is present in the evaluation sample but missing in production.

Btw, there's a permutation of the missing entity: US Treasury’s Office of Foreign Assets Control which pops up perfectly in any tested sentence, which puzzles me even more.

Using latest version of Spacy.
Thanks.

feat / ner feat / serialize resolved training

All 17 comments

I could imagine the NER model having trouble with longer entities, where it might think that a subset of tokens is an entity instead. However, you should definitely get the same results on the same sentence, for the same model before and after IO.

Just to double check for my understanding. You trained a model, applied it to some dev set for evaluation, get 97.48%, then you store the same model, and you apply it again to the same set? And you're saying that the predictions on the same sentences vary between the first and second time?

When you test it the first time on the dev set, are you actually using the best-model, and not any other model (like the latest one)? Could you share some of the code where it shows that the same sentence gets different predictions before and after IO, and perhaps also the code/log for writing & reading the model from/to file?

Sorry for the many questions, just trying to get to the bottom of this ;-)

Yes, I have prepared the training data to mirror the format of the example set: ner-token-per-line.json and split the data into train.json and dev.json (exactly as in the documentation).

Then used the command line interface to start training:
python3 -m spacy train en ./model train.json dev.json -ne 5 -n 100 -p ner

After this I used the evaluate command to check the results:
python3 -m spacy evaluate ./model/model-best dev.json -dp ./displacy

details:

Time      0.58 s
Words     10575 
Words/s   18355 
TOK       100.00
POS       0.00  
UAS       0.00  
LAS       0.00  
NER P     96.57 
NER R     98.73 
NER F     97.64 
Textcat   0.00  

When viewing the displaCy visualizations of the samples the entity US Treasury Department’s Office of Foreign Assets Control is highlighted.

As a next step I do nlp = spacy.load("model/model-best") to load the model in a blank notebook and pick random documents from the database which were used to train the model.

However the entity US Treasury Department’s Office of Foreign Assets Control doesn't get extracted in any of the documents, not even in the copy-pasted sentence from the displaCy visualization.

Code from the notebook:

nlp = spacy.load("model/model-best")

response = ...
stories = pd.DataFrame(response.json()['stories'])

text = "The US Treasury Department’s Office of Foreign Assets Control (OFAC) designated on 14 October two Turkish ministries and three senior Turkish government officials in response to Turkey’s military action in Syria, pursuant to the US President’s Executive Order of 14 October."
doc = nlp(text or bodytext)

for ent in doc.ents:
  if (ent.label_ == 'ORG'):
    print('entity: ' + ent.text, ent.label_)

print('\n' + doc.text)

Here's the dev.json file which is a fraction (0.2%) of the training data, in case it might help. If you run a search for Department’s it will jump to the I-ORG tag so you can verify the BILUO syntax:

dev.json.zip

When viewing the displaCy visualizations of the samples the entity US Treasury Department’s Office of Foreign Assets Control is highlighted.

How are you running this visualization? Are you visualizing instances from the dev set, or predictions on the dev set?

I just run the: python3 -m spacy evaluate ./model/model-best dev.json -dp ./displacy, than navigate to the ./displacy folder and open the entities.html to see the entities highlighted. The dev.json is the same I've used for training.

when run with -dp, actually shows examples from the gold annotations in the dev set - it doesn't show the model's predictions on the dev set

This explains everything. Finally I can stop debugging :D

I think I would have also expected this code to visualize predictions, not gold annotations?

I was about to say the exact same thing! As reading through the description of evaluate and executing it, I would kinda expect this behaviour as opposed to displaying samples of the gold annotations.

Thanks.

Hi @vedtam : Sorry, I think I was wrong (I deleted my last comment to avoid further confusion, but I think the damage was already done ;-)). It does in fact visualise predictions. But it makes no sense to me that the same model from file would generate one prediction when running the visualization, and another when you call it from code, as the model is being loaded in the same way from file, and I can't imagine what else changed?

Such a pity, and I thought I've figured out this black box. I've spent a good day trying to find the answer to this behaviour and shared all the details around it, but no matter how I train the model or twist the training set, the result is the same; same model, different results. Will take a fresh look at it next morning hopefully the culprit is on my end.

Thanks!

Can you share the model you trained and which exhibits this behaviour? (you can send it over email privately if you prefer)

An email would be nice, and thanks for taking the time for looking into it. :)

You can find my contact info here: https://explosion.ai/about

Hi, just wanted to know if you got my email. Thanks.

Yes, I received it, thanks!

Sorry for the delay. Is this still an issue on your end? I can try to have a look next week.

Actually it is, and so far I didn't have found an answer. It would be great to see why the displaCy visualisation after evaluation pickes up the above entity but its missing from the production results.

Hi @vedtam, I think the culprit here is a different way of tokenization in the training data versus the tokenizer included in your final model. When we're reading in your dev data, we're using the tokens as defined in the data:

              {
                "orth": "On",
                "tag": "-",
                "ner": "O"
              },
              {
                "orth": "30",
                "tag": "-",
                "ner": "O"
              },
              {
                "orth": "October,",
                "tag": "-",
                "ner": "O"
              },
              {
                "orth": "the",
                "tag": "-",
                "ner": "O"
              },
              {
                "orth": "US",
                "tag": "-",
                "ner": "B-ORG"
              },
              {
                "orth": "Treasury",
                "tag": "-",
                "ner": "I-ORG"
              },
              {
                "orth": "Department's",
                "tag": "-",
                "ner": "I-ORG"
              },
              {
                "orth": "Office",
                "tag": "-",
                "ner": "I-ORG"
              },

And this "gold" tokenization will be used when you're running the training loop and the evaluation. However, when you just copy-paste the sentence as a whole and run that through the model, its default tokenizer will be used. In spaCy, punctuation symbols such as "," will be usually split off from the words, so for instance October, would become two tokens instead of one. Similarly, the default tokenizer will split up Department’s into Department and ’s. This also means that the NER receives differently tokenized texts in comparison to how it was trained, which results in some errors as you've experienced.

You can verify all this by running your model on the full sentence first, returning no entities, and then running it after explicitely creating the tokens in a similar fashion to how the tokenization works in the training data. In the latter case, it'll find the entities:

    nlp = spacy.load("model-best")

    text = "On 30 October, the US Treasury Department’s Office of Foreign Assets Control ORG (OFAC) announced that it has quit."
    doc = nlp(text)

    print("ents", doc.ents)
    print([token.text for token in doc])
    print("******************")

    words = ["On", "30", "October,", "the", "US", "Treasury", "Department’s", "Office", "of", "Foreign", "Assets", "Control", "(OFAC)", "announced", "that", "it", "has", "quit."]
    doc = Doc(nlp.vocab, words=words)
    for name, pipe in nlp.pipeline:
        doc = pipe(doc)

    print("ents", doc.ents)
    print([token.text for token in doc])

This outputs:

ents ()
['On', '30', 'October', ',', 'the', 'US', 'Treasury', 'Department', '’s', 'Office', 'of', 'Foreign', 'Assets', 'Control', 'ORG', '(', 'OFAC', ')', 'announced', 'that', 'it', 'has', 'quit', '.']


ents (US Treasury Department’s Office of Foreign Assets Control,)
['On', '30', 'October,', 'the', 'US', 'Treasury', 'Department’s', 'Office', 'of', 'Foreign', 'Assets', 'Control', '(OFAC)', 'announced', 'that', 'it', 'has', 'quit.']

So - long story short, it makes sense that you're experiencing issues with specific entity strings, as their tokenization may be different in training vs. prediction. To avoid this, it would be better to ensure the tokenization of your training data is similar to how the final model will tokenize its texts. For further docs on customizing the tokenizer, you can find the docs at https://spacy.io/usage/linguistic-features#tokenization. Hope this helps!

Hi @svlandeg, I appreciate your help so much! Yes, this makes sense and will help us in preparing our data in such a manner that's consistent from training to production. Can't wait to dive in and do some refactoring.

Thanks again!

Was this page helpful?
0 / 5 - 0 ratings