Hello,
I am using the bert-base-cased model to predict named entities for a bunch of sentences (around 29 900). I am facing 3 main issues :
Model I am using (Bert, XLNet ...): Bert (dbmdz/bert-large-cased-finetuned-conll03-english)
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
I didn't find the official example for this so I made my own script with the TokenClassificationPipeline :
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
from transformers import TokenClassificationPipeline
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
nlp_not_grouped = TokenClassificationPipeline(
model=model,
tokenizer=tokenizer,
grouped_entities=False
)
nlp_grouped = TokenClassificationPipeline(
model=model,
tokenizer=tokenizer,
grouped_entities=True
)
seq1 = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
seq2 = "In addition , the Blabla Group has completed the acquisition of ISO / TS16949 certification ."
seq3 = "Product sales to the PSA Peugeot CitroĆĀ«n group totaled â⬠1 , 893 . 6 million in 2012 , down 8 . 1 % "\
"on a reported basis and 10 . 4 % on a like - for - like basis ."
seq4 = "To prepare as best as possible the decisions falling under its responsibilities , Faurecia Ć¢ā¬ā¢ s Board of"\
" Directors has set up three committees : c Audit Committee ; c Strategy Committee ; c Appointments and Compensation"\
" Committee ."
sequences = [seq1, seq2, seq3, seq4]
for i, seq in enumerate(sequences):
ngrouped, grouped = nlp_not_grouped(seq), nlp_grouped(seq)
print(f"===================== sentence n°{i+1}")
print("---Sentence---")
print(seq)
print("---Not grouped entities---")
for ngent in ngrouped:
print(ngent)
print("---Grouped entities---")
for gent in grouped:
print(gent)
I have about 29 900 sentences. For each sentence I want to predict all the named entities in it and then locate them in the sentence. Once I have an entity, I use a regex to find it in the original sentence (before the tokenization step) like this :
start, stop = re.search(re.escape(ent['word']), sent).span()
Where ent['word'] is the text of an entity found in a sentence. For instance, it can be "London" for the sentence (sent) "London is really a great city". However I do this later with the grouped entities but since there are errors in it many are discarded because re.search() raises an exception (that I catch).
Steps to reproduce the behavior:
You just have to run my script to predict the entities for the four sentences. Here is what I get :
===================== sentence n°1
---Sentence---
Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore veryclose to the Manhattan Bridge.
---Not grouped entities---
{'word': 'Hu', 'score': 0.9995108246803284, 'entity': 'I-ORG', 'index': 1}
{'word': '##gging', 'score': 0.989597499370575, 'entity': 'I-ORG', 'index': 2}
{'word': 'Face', 'score': 0.9979704022407532, 'entity': 'I-ORG', 'index': 3}
{'word': 'Inc', 'score': 0.9993758797645569, 'entity': 'I-ORG', 'index': 4}
{'word': 'New', 'score': 0.9993405938148499, 'entity': 'I-LOC', 'index': 11}
{'word': 'York', 'score': 0.9991927742958069, 'entity': 'I-LOC', 'index': 12}
{'word': 'City', 'score': 0.9993411302566528, 'entity': 'I-LOC', 'index': 13}
{'word': 'D', 'score': 0.986336350440979, 'entity': 'I-LOC', 'index': 19}
{'word': '##UM', 'score': 0.9396238923072815, 'entity': 'I-LOC', 'index': 20}
{'word': '##BO', 'score': 0.9121386408805847, 'entity': 'I-LOC', 'index': 21}
{'word': 'Manhattan', 'score': 0.9839190244674683, 'entity': 'I-LOC', 'index': 29}
{'word': 'Bridge', 'score': 0.9924242496490479, 'entity': 'I-LOC', 'index': 30}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9966136515140533, 'word': 'Hugging Face Inc'}
{'entity_group': 'I-LOC', 'score': 0.9992914994557699, 'word': 'New York City'}
{'entity_group': 'I-LOC', 'score': 0.9460329612096151, 'word': 'DUMBO'}
{'entity_group': 'I-LOC', 'score': 0.9881716370582581, 'word': 'Manhattan Bridge'}
===================== sentence n°2
---Sentence---
In addition , the Blabla Group has completed the acquisition of ISO / TS16949 certification .
---Not grouped entities---
{'word': 'B', 'score': 0.9997261762619019, 'entity': 'I-ORG', 'index': 5}
{'word': '##la', 'score': 0.997683048248291, 'entity': 'I-ORG', 'index': 6}
{'word': '##bla', 'score': 0.99888014793396, 'entity': 'I-ORG', 'index': 7}
{'word': 'Group', 'score': 0.9992784261703491, 'entity': 'I-ORG', 'index': 8}
{'word': 'ISO', 'score': 0.9711909890174866, 'entity': 'I-MISC', 'index': 14}
{'word': 'T', 'score': 0.6591967344284058, 'entity': 'I-ORG', 'index': 16}
{'word': '##S', 'score': 0.658642053604126, 'entity': 'I-MISC', 'index': 17}
{'word': '##16', 'score': 0.5059574842453003, 'entity': 'I-MISC', 'index': 18}
{'word': '##9', 'score': 0.5067382454872131, 'entity': 'I-MISC', 'index': 21}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9988919496536255, 'word': 'Blabla Group'}
{'entity_group': 'I-MISC', 'score': 0.9711909890174866, 'word': 'ISO'}
{'entity_group': 'I-ORG', 'score': 0.6591967344284058, 'word': 'T'}
{'entity_group': 'I-MISC', 'score': 0.5822997689247131, 'word': '##S16'}
===================== sentence n°3
---Sentence---
Product sales to the PSA Peugeot CitroĆĀ«n group totaled â⬠1 , 893 . 6 million in 2012 , down 8 . 1 % on a reported basis and 10 . 4 % on a like - for - like basis .
---Not grouped entities---
{'word': 'PS', 'score': 0.9970256686210632, 'entity': 'I-ORG', 'index': 5}
{'word': '##A', 'score': 0.9927457571029663, 'entity': 'I-ORG', 'index': 6}
{'word': 'P', 'score': 0.9980151653289795, 'entity': 'I-ORG', 'index': 7}
{'word': '##eu', 'score': 0.9897757768630981, 'entity': 'I-ORG', 'index': 8}
{'word': '##ge', 'score': 0.996147871017456, 'entity': 'I-ORG', 'index': 9}
{'word': '##ot', 'score': 0.9928787350654602, 'entity': 'I-ORG', 'index': 10}
{'word': '[UNK]', 'score': 0.5744695067405701, 'entity': 'I-ORG', 'index': 11}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.934436925819942, 'word': 'PSA Peugeot [UNK]'}
===================== sentence n°4
---Sentence---
To prepare as best as possible the decisions falling under its responsibilities , Faurecia Ć¢ā¬ā¢ s Board of Directors has set up three committees : c Audit Committee ; c Strategy Committee ; c Appointments and Compensation Committee .
---Not grouped entities---
{'word': 'F', 'score': 0.9983997941017151, 'entity': 'I-ORG', 'index': 14}
{'word': '##au', 'score': 0.9473735690116882, 'entity': 'I-ORG', 'index': 15}
{'word': '##re', 'score': 0.9604568481445312, 'entity': 'I-ORG', 'index': 16}
{'word': '##cia', 'score': 0.992807149887085, 'entity': 'I-ORG', 'index': 17}
{'word': 'Board', 'score': 0.8452167510986328, 'entity': 'I-ORG', 'index': 20}
{'word': 'of', 'score': 0.5921975374221802, 'entity': 'I-ORG', 'index': 21}
{'word': 'Directors', 'score': 0.6778028607368469, 'entity': 'I-ORG', 'index': 22}
{'word': 'Audi', 'score': 0.9764850735664368, 'entity': 'I-ORG', 'index': 30}
{'word': '##t', 'score': 0.9692177772521973, 'entity': 'I-ORG', 'index': 31}
{'word': 'Committee', 'score': 0.9959701299667358, 'entity': 'I-ORG', 'index': 32}
{'word': 'Strategy', 'score': 0.9705951809883118, 'entity': 'I-ORG', 'index': 35}
{'word': 'Committee', 'score': 0.994032621383667, 'entity': 'I-ORG', 'index': 36}
{'word': 'A', 'score': 0.9764854907989502, 'entity': 'I-ORG', 'index': 39}
{'word': '##oint', 'score': 0.7803319692611694, 'entity': 'I-ORG', 'index': 41}
{'word': '##ments', 'score': 0.7828453779220581, 'entity': 'I-ORG', 'index': 42}
{'word': 'and', 'score': 0.9625542163848877, 'entity': 'I-ORG', 'index': 43}
{'word': 'Co', 'score': 0.9904180765151978, 'entity': 'I-ORG', 'index': 44}
{'word': '##mp', 'score': 0.9140805602073669, 'entity': 'I-ORG', 'index': 45}
{'word': '##ens', 'score': 0.8661588430404663, 'entity': 'I-ORG', 'index': 46}
{'word': '##ation', 'score': 0.9150537252426147, 'entity': 'I-ORG', 'index': 47}
{'word': 'Committee', 'score': 0.9888517260551453, 'entity': 'I-ORG', 'index': 48}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9747593402862549, 'word': 'Faurecia'}
{'entity_group': 'I-ORG', 'score': 0.7050723830858866, 'word': 'Board of Directors'}
{'entity_group': 'I-ORG', 'score': 0.9805576602617899, 'word': 'Audit Committee'}
{'entity_group': 'I-ORG', 'score': 0.9823139011859894, 'word': 'Strategy Committee'}
{'entity_group': 'I-ORG', 'score': 0.9764854907989502, 'word': 'A'}
{'entity_group': 'I-ORG', 'score': 0.9000368118286133, 'word': '##ointments and Compensation Committee'}
For the first sentence (seq1) everything is fine. It's the example of the NER section under Usage section of the documentation : https://huggingface.co/transformers/usage.html#named-entity-recognition
With the other sentences we can see one example of each problem :
{'entity_group': 'I-MISC', 'score': 0.9711909890174866, 'word': 'ISO'}
{'entity_group': 'I-ORG', 'score': 0.6591967344284058, 'word': 'T'}
{'entity_group': 'I-MISC', 'score': 0.5822997689247131, 'word': '##S16'}
In seq 2, there is '##S16' as a word. Obviously, it should have been grouped with the precending entity and form TS16 even maybe 'ISO / TS16949' like this :
{'entity_group': 'I-MISC', 'score': 0.9711909890174866, 'word': 'ISO / TS16949'}
word field{'entity_group': 'I-ORG', 'score': 0.934436925819942, 'word': 'PSA Peugeot [UNK]'}
Because maybe of the ugly written CitroĆĀ«n which stands for CitroĆ«n. The entity found is 'PSA Peugeot [UNK]'. In this case it would be better to just put 'PSA Peugeot' if the last token is identified as [UNK] :
{'entity_group': 'I-ORG', 'score': 0.934436925819942, 'word': 'PSA Peugeot'}
For the last sentence we can see that 'Appointments and Compensation Committee' as be splitted into :
{'entity_group': 'I-ORG', 'score': 0.9764854907989502, 'word': 'A'}
{'entity_group': 'I-ORG', 'score': 0.9000368118286133, 'word': '##ointments and Compensation Committee'}
instead of :
{'entity_group': 'I-ORG', 'score': 0.9000368118286133, 'word': 'Appointments and Compensation Committee'}
The entity is not well grouped but more importantly the 'pp' is missing so even if we decided to blend the two groups we wouldn't get the real entity. This problem was first raised here : #4816. I've actually encountered this problem trying to fix the first one : I noticed some entity grouped like this, miss some syllables. The pipeline with grouped_entity=False already lost the 'pp' :
{'word': 'A', 'score': 0.9764854907989502, 'entity': 'I-ORG', 'index': 39}
{'word': '##oint', 'score': 0.7803319692611694, 'entity': 'I-ORG', 'index': 41}
{'word': '##ments', 'score': 0.7828453779220581, 'entity': 'I-ORG', 'index': 42}
It seems the way the pipeline blends each tokens is not ok because when I predict the label for each tokens with the code example of the documentation, I get this :
[('[CLS]', 'O'), ('To', 'O'), ('prepare', 'O'), ('as', 'O'), ('best', 'O'), ('as', 'I-ORG'), ('possible', 'I-ORG'), ('the', 'I-ORG'), ('decisions', 'I-ORG'), ('falling', 'I-ORG'), ('under', 'I-ORG'), ('its', 'I-ORG'), ('responsibilities', 'O'), (',', 'O'), ('F', 'O'), ('##au', 'O'), ('##re', 'O'), ('##cia', 'O'), ('[UNK]', 'O'), ('s', 'O'), ('Board', 'O'), ('of', 'O'), ('Directors', 'O'), ('has', 'O'), ('set', 'O'), ('up', 'O'), ('three', 'O'), ('committees', 'O'), (':', 'O'), ('c', 'O'), ('Audi', 'O'), ('##t', 'O'), ('Committee', 'O'), (';', 'O'), ('c', 'O'), ('Strategy', 'O'), ('Committee', 'O'), (';', 'O'), ('c', 'O'), ('A', 'O'), ('##pp', 'O'), ('##oint', 'O'), ('##ments', 'O'), ('and', 'O'), ('Co', 'O'), ('##mp', 'O'), ('##ens', 'O'), ('##ation', 'O'), ('Committee', 'O'), ('.', 'O'), ('[SEP]', 'O')]
There are those tokens :
('A', 'O'), ('##pp', 'O'), ('##oint', 'O'), ('##ments', 'O') for 'Appointments'
transformers version: 2.11.0EDIT : Typos
@Nighthyst thanks for summarizing these issues as I also ran into them.
I was digging on this last weekend and I think maybe this could help:
https://github.com/huggingface/tokenizers/pull/200
Provide some more mappings on the Encoding in order to easily identify words after tokenization.
It also exposes a method encode_tokenized on the BaseTokenizer to allow skipping the usual Normalizer and PreTokenizer.
This is especially useful for NER like datasets, where the pre-tokenization has already been done, and we want to attribute labels to pre-tokenized words.
Thanks for bringing this up. I can work on this on a separate PR after merging the PR that resolves the prior issue #4816.
some interesting finding:
Using a fast tokenizer solves the [UNK] issue. using one of your provided examples:
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)
nlp = TokenClassificationPipeline(model=model,
tokenizer=tokenizer,
grouped_entities=False)
t="Product sales to the PSA Peugeot CitroĆĀ«n group totaled â⬠1 , 893 . 6 million in 2012 , down 8 . 1 % on a reported basis and 10 . 4 % on a like - for - like basis ."
nlp(t)
[{'word': 'PS', 'score': 0.9961145520210266, 'entity': 'I-ORG', 'index': 5},
{'word': '##A', 'score': 0.9905584454536438, 'entity': 'I-ORG', 'index': 6},
{'word': 'P', 'score': 0.997616708278656, 'entity': 'I-ORG', 'index': 7},
{'word': '##eu', 'score': 0.9741767644882202, 'entity': 'I-ORG', 'index': 8},
{'word': '##ge', 'score': 0.9928027391433716, 'entity': 'I-ORG', 'index': 9},
{'word': '##ot', 'score': 0.9900722503662109, 'entity': 'I-ORG', 'index': 10},
{'word': 'C', 'score': 0.9574489593505859, 'entity': 'I-ORG', 'index': 11},
{'word': '##it', 'score': 0.824583113193512, 'entity': 'I-ORG', 'index': 12},
{'word': '##ro', 'score': 0.7597800493240356, 'entity': 'I-ORG', 'index': 13},
{'word': '##A', 'score': 0.953075647354126, 'entity': 'I-ORG', 'index': 14},
{'word': 'Ā«', 'score': 0.6135829091072083, 'entity': 'I-ORG', 'index': 15}]
@Nighthyst @dav009 Can you guys check if the above issues still persist after the recent PR merged (#4987)?
Hello @enzoampil,
I updated transformers with master, with the command:
pip install --upgrade git+https://github.com/huggingface/transformers.git
Then I tried your tests and mine:
from transformers import pipeline
NER_MODEL = "mrm8488/bert-spanish-cased-finetuned-ner"
nlp_ner = pipeline("ner", model=NER_MODEL,
grouped_entities=True,
tokenizer=(NER_MODEL, {"use_fast": False}))
t = """Consuelo Araújo Noguera, ministra de cultura del presidente Andrés Pastrana (1998.2002) fue asesinada por las Farc luego de haber permanecido secuestrada por algunos meses."""
nlp_ner(t)
I have the expected output :
[{'entity_group': 'B-PER',
'score': 0.9710702555520194,
'word': 'Consuelo AraĆŗjo Noguera'},
{'entity_group': 'B-PER',
'score': 0.9997273534536362,
'word': 'AndrƩs Pastrana'},
{'entity_group': 'B-ORG', 'score': 0.8589079678058624, 'word': 'Farc'}]
And for your other test :
nlp = pipeline('ner', grouped_entities=False)
nlp("Enzo works at the the UN")
Output :
[{'word': 'En', 'score': 0.9968166351318359, 'entity': 'I-PER', 'index': 1},
{'word': '##zo', 'score': 0.9957635998725891, 'entity': 'I-PER', 'index': 2},
{'word': 'UN', 'score': 0.9986497163772583, 'entity': 'I-ORG', 'index': 7}]
And,
nlp2 = pipeline('ner', grouped_entities=True)
nlp2("Enzo works at the the UN")
Output :
{'entity_group': 'I-PER', 'score': 0.9962901175022125, 'word': 'Enzo'},
{'entity_group': 'I-ORG', 'score': 0.9986497163772583, 'word': 'UN'}]
However with my test :
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
from transformers import TokenClassificationPipeline
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
nlp_not_grouped = TokenClassificationPipeline(
model=model,
tokenizer=tokenizer,
grouped_entities=False
)
nlp_grouped = TokenClassificationPipeline(
model=model,
tokenizer=tokenizer,
grouped_entities=True
)
seq1 = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
seq2 = "In addition , the Blabla Group has completed the acquisition of ISO / TS16949 certification ."
seq3 = "Product sales to the PSA Peugeot CitroĆĀ«n group totaled â⬠1 , 893 . 6 million in 2012 , down 8 . 1 % "\
"on a reported basis and 10 . 4 % on a like - for - like basis ."
seq4 = "To prepare as best as possible the decisions falling under its responsibilities , Faurecia Ć¢ā¬ā¢ s Board of"\
" Directors has set up three committees : c Audit Committee ; c Strategy Committee ; c Appointments and Compensation"\
" Committee ."
sequences = [seq1, seq2, seq3, seq4]
for i, seq in enumerate(sequences):
ngrouped, grouped = nlp_not_grouped(seq), nlp_grouped(seq)
print(f"===================== sentence n°{i+1}")
print("---Sentence---")
print(seq)
print("---Not grouped entities---")
for ngent in ngrouped:
print(ngent)
print("---Grouped entities---")
for gent in grouped:
print(gent)
I have this :
===================== sentence n°1
---Sentence---
Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore veryclose to the Manhattan Bridge.
---Not grouped entities---
{'word': 'Hu', 'score': 0.9995108246803284, 'entity': 'I-ORG', 'index': 1}
{'word': '##gging', 'score': 0.989597499370575, 'entity': 'I-ORG', 'index': 2}
{'word': 'Face', 'score': 0.9979704022407532, 'entity': 'I-ORG', 'index': 3}
{'word': 'Inc', 'score': 0.9993758797645569, 'entity': 'I-ORG', 'index': 4}
{'word': 'New', 'score': 0.9993405938148499, 'entity': 'I-LOC', 'index': 11}
{'word': 'York', 'score': 0.9991927742958069, 'entity': 'I-LOC', 'index': 12}
{'word': 'City', 'score': 0.9993411302566528, 'entity': 'I-LOC', 'index': 13}
{'word': 'D', 'score': 0.986336350440979, 'entity': 'I-LOC', 'index': 19}
{'word': '##UM', 'score': 0.9396238923072815, 'entity': 'I-LOC', 'index': 20}
{'word': '##BO', 'score': 0.9121386408805847, 'entity': 'I-LOC', 'index': 21}
{'word': 'Manhattan', 'score': 0.9839190244674683, 'entity': 'I-LOC', 'index': 29}
{'word': 'Bridge', 'score': 0.9924242496490479, 'entity': 'I-LOC', 'index': 30}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9966136515140533, 'word': 'Hugging Face Inc'}
{'entity_group': 'I-LOC', 'score': 0.9992914994557699, 'word': 'New York City'}
{'entity_group': 'I-LOC', 'score': 0.9460329612096151, 'word': 'DUMBO'}
{'entity_group': 'I-LOC', 'score': 0.9881716370582581, 'word': 'Manhattan Bridge'}
===================== sentence n°2
---Sentence---
In addition , the Blabla Group has completed the acquisition of ISO / TS16949 certification .
---Not grouped entities---
{'word': 'B', 'score': 0.9997261762619019, 'entity': 'I-ORG', 'index': 5}
{'word': '##la', 'score': 0.997683048248291, 'entity': 'I-ORG', 'index': 6}
{'word': '##bla', 'score': 0.99888014793396, 'entity': 'I-ORG', 'index': 7}
{'word': 'Group', 'score': 0.9992784261703491, 'entity': 'I-ORG', 'index': 8}
{'word': 'ISO', 'score': 0.9711909890174866, 'entity': 'I-MISC', 'index': 14}
{'word': 'T', 'score': 0.6591967344284058, 'entity': 'I-ORG', 'index': 16}
{'word': '##S', 'score': 0.658642053604126, 'entity': 'I-MISC', 'index': 17}
{'word': '##16', 'score': 0.5059574842453003, 'entity': 'I-MISC', 'index': 18}
{'word': '##9', 'score': 0.5067382454872131, 'entity': 'I-MISC', 'index': 21}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9988919496536255, 'word': 'Blabla Group'}
{'entity_group': 'I-MISC', 'score': 0.9711909890174866, 'word': 'ISO'}
{'entity_group': 'I-ORG', 'score': 0.6591967344284058, 'word': 'T'}
{'entity_group': 'I-MISC', 'score': 0.5822997689247131, 'word': '##S16'}
{'entity_group': 'I-MISC', 'score': 0.5067382454872131, 'word': '##9'}
===================== sentence n°3
---Sentence---
Product sales to the PSA Peugeot CitroĆĀ«n group totaled â⬠1 , 893 . 6 million in 2012 , down 8 . 1 % on a reported basis and 10 . 4 % on a like - for - like basis .
---Not grouped entities---
{'word': 'PS', 'score': 0.9970256686210632, 'entity': 'I-ORG', 'index': 5}
{'word': '##A', 'score': 0.9927457571029663, 'entity': 'I-ORG', 'index': 6}
{'word': 'P', 'score': 0.9980151653289795, 'entity': 'I-ORG', 'index': 7}
{'word': '##eu', 'score': 0.9897757768630981, 'entity': 'I-ORG', 'index': 8}
{'word': '##ge', 'score': 0.996147871017456, 'entity': 'I-ORG', 'index': 9}
{'word': '##ot', 'score': 0.9928787350654602, 'entity': 'I-ORG', 'index': 10}
{'word': '[UNK]', 'score': 0.5744695067405701, 'entity': 'I-ORG', 'index': 11}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.934436925819942, 'word': 'PSA Peugeot [UNK]'}
===================== sentence n°4
---Sentence---
To prepare as best as possible the decisions falling under its responsibilities , Faurecia Ć¢ā¬ā¢ s Board of Directors has set up three committees : c Audit Committee ; c Strategy Committee ; c Appointments and Compensation Committee .
---Not grouped entities---
{'word': 'F', 'score': 0.9983997941017151, 'entity': 'I-ORG', 'index': 14}
{'word': '##au', 'score': 0.9473735690116882, 'entity': 'I-ORG', 'index': 15}
{'word': '##re', 'score': 0.9604568481445312, 'entity': 'I-ORG', 'index': 16}
{'word': '##cia', 'score': 0.992807149887085, 'entity': 'I-ORG', 'index': 17}
{'word': 'Board', 'score': 0.8452167510986328, 'entity': 'I-ORG', 'index': 20}
{'word': 'of', 'score': 0.5921975374221802, 'entity': 'I-ORG', 'index': 21}
{'word': 'Directors', 'score': 0.6778028607368469, 'entity': 'I-ORG', 'index': 22}
{'word': 'Audi', 'score': 0.9764850735664368, 'entity': 'I-ORG', 'index': 30}
{'word': '##t', 'score': 0.9692177772521973, 'entity': 'I-ORG', 'index': 31}
{'word': 'Committee', 'score': 0.9959701299667358, 'entity': 'I-ORG', 'index': 32}
{'word': 'Strategy', 'score': 0.9705951809883118, 'entity': 'I-ORG', 'index': 35}
{'word': 'Committee', 'score': 0.994032621383667, 'entity': 'I-ORG', 'index': 36}
{'word': 'A', 'score': 0.9764854907989502, 'entity': 'I-ORG', 'index': 39}
{'word': '##oint', 'score': 0.7803319692611694, 'entity': 'I-ORG', 'index': 41}
{'word': '##ments', 'score': 0.7828453779220581, 'entity': 'I-ORG', 'index': 42}
{'word': 'and', 'score': 0.9625542163848877, 'entity': 'I-ORG', 'index': 43}
{'word': 'Co', 'score': 0.9904180765151978, 'entity': 'I-ORG', 'index': 44}
{'word': '##mp', 'score': 0.9140805602073669, 'entity': 'I-ORG', 'index': 45}
{'word': '##ens', 'score': 0.8661588430404663, 'entity': 'I-ORG', 'index': 46}
{'word': '##ation', 'score': 0.9150537252426147, 'entity': 'I-ORG', 'index': 47}
{'word': 'Committee', 'score': 0.9888517260551453, 'entity': 'I-ORG', 'index': 48}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9747593402862549, 'word': 'Faurecia'}
{'entity_group': 'I-ORG', 'score': 0.7050723830858866, 'word': 'Board of Directors'}
{'entity_group': 'I-ORG', 'score': 0.9805576602617899, 'word': 'Audit Committee'}
{'entity_group': 'I-ORG', 'score': 0.9823139011859894, 'word': 'Strategy Committee'}
{'entity_group': 'I-ORG', 'score': 0.9764854907989502, 'word': 'A'}
{'entity_group': 'I-ORG', 'score': 0.9000368118286133, 'word': '##ointments and Compensation Committee'}
It seems like the problem is still here for sentence n°4 : the last group should be "Appointments and Compensation Committee". For sentence n°2 it should be : "TS16949" as MISC or ORG at least it predicts the T in ORG and the other part in MISC. Even if both parts don't have the same entity tag, the ORG part should have been in one group "S16949" at least I think.
Also @dav009 "trick" to solve the [UNK] issue seems to not be working anymore :
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)
nlp = TokenClassificationPipeline(model=model,
tokenizer=tokenizer,
grouped_entities=False)
t="Product sales to the PSA Peugeot CitroĆĀ«n group totaled â⬠1 , 893 . 6 million in 2012 , down 8 . 1 % on a reported basis and 10 . 4 % on a like - for - like basis ."
nlp(t)
Output :
[{'word': 'PS', 'score': 0.9970256686210632, 'entity': 'I-ORG', 'index': 5},
{'word': '##A', 'score': 0.9927457571029663, 'entity': 'I-ORG', 'index': 6},
{'word': 'P', 'score': 0.9980151653289795, 'entity': 'I-ORG', 'index': 7},
{'word': '##eu', 'score': 0.9897757768630981, 'entity': 'I-ORG', 'index': 8},
{'word': '##ge', 'score': 0.996147871017456, 'entity': 'I-ORG', 'index': 9},
{'word': '##ot', 'score': 0.9928787350654602, 'entity': 'I-ORG', 'index': 10},
{'word': '[UNK]',
'score': 0.5744695067405701,
'entity': 'I-ORG',
'index': 11}]
The [UNK] token is back
For sentence 4, this is because the ##pp in āAppointmentsā, is not being tagged as an entity. This will require a separate PR that assumes that all the word pieces attached to a tagged entity token, should also be tagged with the same entity, whether or not it was tagged.
A similar situation is happening in sentence 2. The clue is in the value for āindexā. Youāll notice that the tokens arenāt contiguous and so arenāt being grouped together. This implies that some middle word pieces arenāt being tagged as entities.
For the [UNK] issue, this āmightā be because that word piece token was out of vocabulary and so gets converted to [UNK] at the decoding step.
Since this happens before entity grouping, I think safe to say this is unrelated to entity grouping and is related to how the raw NER forward pass is handled.
Perhaps we can separate this from the above issue? Both will require separate PRās to address.
Actually you're right it seems that sentences n°2 and n°4 are showing a different issue : if the index is not contiguous (because a part is missing in the prediction : "pp" for n°4 and "94" for n°2) then the grouping fails. It's indeed a different issue.
For sentence 4, this is because the ##pp in āAppointmentsā, is not being tagged as an entity. This will require a separate PR that assumes that all the word pieces attached to a tagged entity token, should also be tagged with the same entity, whether or not it was tagged.
Although I agree that it could be solved in a next PR, shouldn't this more 'holistic' view be preferable (and be the default). If one token in a word is 'missed' but the other four (e.g. PER-PER-O-PER-PER) are an entity the whole word is an entity (and not two separate entities). We 'know' what the word-level comprehends the model doesn't
@HHoofs agree that this should be the default. If the "word-level" implementation is submitted as a PR, this should not be the default behaviour and should be explicitly set.
I agree with that, what I meant however was the following case: Italy
Let's say that this consists of three subtokens: _It, a, ly
If the first and last tokens are assigned as Country en the middle as None, it would now result in a splitted output (if I understand correctly).
I would suggest that the outputs of all three subtokens are averaged and than the highest output class is selected.
In pseudo-code, I would suggest the following (order):
```
...
if self.grouped_entities:
word_scores = []
for token in tokens:
# first input should always be a 'new word'
if is_new_word(token):
word_scores.append(score)
score = np.zeros((0,?))
score = np.sum(score, token['score'])
# now you have a list of summed entity scores for each seperate word
word_scores.argmax(axis=-1)
...
else:
return ...
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
some interesting finding:
Using a fast tokenizer solves the
[UNK]issue. using one of your provided examples: