Spacy: Sentence segmentation splits into clauses on Spacy 2.0.5

Created on 21 Dec 2017 · 5Comments · Source: explosion/spaCy

The input:
I sleep about 16 hours per day. I fall asleep accidentally. I'm tired, I have headache and I have problems with concentration. I don't eat anything and than I'm eating a lot.
…is split into:

I fall asleep accidentally.
I'm tired,
I have headache
and I have problems with concentration.

This is surprising to me, especially that processing the second sentence alone (input: “I'm tired, I have headache and I have problems with concentration.”) yields the whole input as one sentence.

Is sentence segmentation perfomed by a trained statistical model?

lang / en

Source

adam-ra

Most helpful comment

We've now added a sentence segmentation section to the docs that explains the different options for customising spaCy's default segmentation strategy:
https://spacy.io/usage/linguistic-features#sbd

We're also hoping that the new models for v2.1.x will do slightly better at this (i.e. produce more accurate parsing results) 👍

ines on 8 May 2018

👍3

All 5 comments

The German model is also very eager to split sentences even within simple noun phrases, for instance these:

Schmerzen links oben Bauch
häufiger Stuhlgang
Intrakraniales Trauma

Is overriding sent_start flags before running parser a safe work-around? Or this is likely to result in some sentences that have more than one disjoint dependency graph attached?

adam-ra on 15 Jan 2018

I confirm this: overriding tok.is_sent_start flag is a simple work-around to prevent some obvious false positives in sentence boundary detection and it also seems to improve the parse results (when sentence split is prohibited, the parser is forced to make a different decision which often happens to be the correct one).

Even such a simple heuristics that prevent sentence splits between two word-like tokens (e.g. those starting with a letter) improves the performance of the German model.

def _is_wordlike(tok):
    return tok.orth_ and tok.orth_[0].isalpha()

    def sentence_division_suppresor(doc):
        """Spacy pipeline component that prohibits sentence segmentation between two tokens that start with a letter.
        Useful for taming overzealous sentence segmentation in German model, possibly others as well."""
        for i, tok in enumerate(doc[:-1]):
            if _is_wordlike(tok) and _is_wordlike(doc[i + 1]):
                doc[i + 1].is_sent_start = False
        return doc

nlp.add_pipe(sentence_division_suppresor, name='sent_fix', before='parser')

adam-ra on 19 Feb 2018

👍3

I have a similar problem with spacy 2.0.7 (mention it in https://github.com/explosion/spaCy/issues/93)

def sentence_tokenize(text):
doc = nlp(text)
return [sent.string.strip() for sent in doc.sents]

sentence_tokenize("Navn den er kjøpt i er Sigrid Trasti")

['Navn den', 'er kjøpt', 'i er', 'Sigrid Trasti']

lynochka on 16 Mar 2018

We've now added a sentence segmentation section to the docs that explains the different options for customising spaCy's default segmentation strategy:
https://spacy.io/usage/linguistic-features#sbd

We're also hoping that the new models for v2.1.x will do slightly better at this (i.e. produce more accurate parsing results) 👍

ines on 8 May 2018

👍3

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.