What is the expected result of the NE segmentation for ORG entities with apostrophe or apostrophe "s"?
E.g. "Toyota's headquarters is not here." --> Named entity = Toyota or Toyota's
I found that Flair often includes the apostrophe s in the named entity text and often even confuses country + apostrophe s with an ORG entity, e.g. "China's".
I used the tokenizer that Flair uses. May this also be caused by non-standard apostrophes, like ’ ? E.g. the named entity text for GE’s was GE’s, not just GE. But then Russia's led to the same result: the whole string (Russia's) got detected as one ORG entity, rather than just Russia and as a country.
Hi @pwichmann,
that's a good question! I looked at the CoNLL-2003 dataset for English and found some examples:
Germany NNP I-NP I-LOC
's POS B-NP O
representative NN I-NP O
to TO I-PP O
the DT I-NP O
European NNP I-NP I-ORG
Union NNP I-NP I-ORG
's POS B-NP O
veterinary JJ I-NP O
committee NN I-NP O
so the 's is tokenized and is a new token. It will get the O outside tag then.
Let's take this input sentence as an example:
s = Sentence("Germany's weather.", use_tokenizer=True)
This will tokenize the sentence into four tokens:
Sentence: "Germany 's weather ." - 4 Tokens
Your example sentence will be splitted into the following tokens:
Sentence: "Toyota 's headquarters is not here ." - 7 Tokens
:)
Interesting. I used the segtok.tokenizer to tokenise my text before I feed it into Flair. I do this to make sure I get the token positions right and Flair does not internally mess with my tokens without me seeing the tokenised sentence. And this tokenizer does it differently and does not split apostrophe and apostrophe s.
Do you get the same result if you use:
from segtok.tokenizer import word_tokenizer
print(word_tokenizer("Germany's weather."))
I certainly don't. I only get three tokens. Germany's is one token.
I had read that Flair uses the segtok one internally (https://github.com/zalandoresearch/flair/issues/394).
This is curious. Also, it causes massive headaches at my end because the apostrophes indicate possessives that I need for my relation extraction. If apostrophes and apostrophe s become part of the named entity, they become invisible for my relation classifier.
Hello @pwichmann you can get the same results by calling segtok the same way we are. Specifically, we don't only use the word_tokenizer function, but also use split_contractions to get the apostrophe stuff and split_single to split sentences. Here's an example script:
# your example sentence
example_text = "Germany's weather."
# option 1: only use word_tokenizer
from segtok.tokenizer import word_tokenizer
print(word_tokenizer(example_text))
# option 2: use split_single to detect sentences, then use both split_contractions and word_tokenizer
from segtok.segmenter import split_single
from segtok.tokenizer import split_contractions
tokens = []
sentences = split_single(example_text)
for sentence in sentences:
contractions = split_contractions(word_tokenizer(sentence))
tokens.extend(contractions)
print(tokens)
Thank you so much!