The input:
I sleep about 16 hours per day. I fall asleep accidentally. I'm tired, I have headache and I have problems with concentration. I don't eat anything and than I'm eating a lot.
…is split into:
I fall asleep accidentally.
I'm tired,
I have headache
and I have problems with concentration.
This is surprising to me, especially that processing the second sentence alone (input: “I'm tired, I have headache and I have problems with concentration.”) yields the whole input as one sentence.
Is sentence segmentation perfomed by a trained statistical model?
The German model is also very eager to split sentences even within simple noun phrases, for instance these:
Is overriding sent_start flags before running parser a safe work-around? Or this is likely to result in some sentences that have more than one disjoint dependency graph attached?
I confirm this: overriding tok.is_sent_start flag is a simple work-around to prevent some obvious false positives in sentence boundary detection and it also seems to improve the parse results (when sentence split is prohibited, the parser is forced to make a different decision which often happens to be the correct one).
Even such a simple heuristics that prevent sentence splits between two word-like tokens (e.g. those starting with a letter) improves the performance of the German model.
def _is_wordlike(tok):
return tok.orth_ and tok.orth_[0].isalpha()
def sentence_division_suppresor(doc):
"""Spacy pipeline component that prohibits sentence segmentation between two tokens that start with a letter.
Useful for taming overzealous sentence segmentation in German model, possibly others as well."""
for i, tok in enumerate(doc[:-1]):
if _is_wordlike(tok) and _is_wordlike(doc[i + 1]):
doc[i + 1].is_sent_start = False
return doc
nlp.add_pipe(sentence_division_suppresor, name='sent_fix', before='parser')
I have a similar problem with spacy 2.0.7 (mention it in https://github.com/explosion/spaCy/issues/93)
def sentence_tokenize(text):
doc = nlp(text)
return [sent.string.strip() for sent in doc.sents]
sentence_tokenize("Navn den er kjøpt i er Sigrid Trasti")
['Navn den', 'er kjøpt', 'i er', 'Sigrid Trasti']
We've now added a sentence segmentation section to the docs that explains the different options for customising spaCy's default segmentation strategy:
https://spacy.io/usage/linguistic-features#sbd
We're also hoping that the new models for v2.1.x will do slightly better at this (i.e. produce more accurate parsing results) 👍
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
We've now added a sentence segmentation section to the docs that explains the different options for customising spaCy's default segmentation strategy:
https://spacy.io/usage/linguistic-features#sbd
We're also hoping that the new models for v2.1.x will do slightly better at this (i.e. produce more accurate parsing results) 👍