Spacy: Issue with custom SBD

Created on 27 Aug 2019 · 9Comments · Source: explosion/spaCy

Hey everyone.

First of all I'd like to thank Spacy creators for such an amazing tool. Unfortunately either I'm doing something wrong or I found a bug.

I've got the following text:

My dark eyes had the look of held back tears. "It’s okay," I said

And what I'm trying to achieve is so that the split for the sentence would happen after '.' and not after '"'. Here's the code I use to try solve this issue, which is based on this manual page (https://spacy.io/usage/linguistic-features#sbd-custom) and some Googling:

import spacy


def custom_sentencizer(doc):
    for t in doc:
        if (t.text == '"') and (not t.whitespace_) and (t.nbor(-1).whitespace_): # and t.nbor(1).is_sent_start:
            doc[t.i].is_sent_start = True
            doc[t.nbor(1).i].is_sent_start = False

    return doc


nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger'])
nlp.max_length = 1500000
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.add_pipe(custom_sentencizer, after="sentencizer")



input = "My dark eyes had the look of held back tears. \"It’s okay,\" I said"
doc = nlp(input)
for sentence in doc.sents:
    print(sentence.text)

Any help would be greatly appreciated.

About my environment:

Ubuntu 18.04
Python 3.7
spacy==2.1.1

feat / doc feat / pipeline usage

Source

gsoul

Most helpful comment

I would follow the structure/approach of the newer Sentencizer instead when creating a custom approach:

https://github.com/explosion/spaCy/blob/b91425f80394b1f0494fd280458bb70b92b19236/spacy/pipeline/pipes.pyx#L1357

(I would really like to develop an SBD option that's somewhere in between the sentencizer (dumb but very fast) and the parser (much better but very slow), but there's nothing concrete to share yet.)

adrianeboyd on 28 Aug 2019

👍2

All 9 comments

Hmm, your example seems to work for me. I get the output:

My dark eyes had the look of held back tears.
"It’s okay," I said

I'm not sure what changed, but try upgrading to the newest version of spacy (2.1.8)?

adrianeboyd on 28 Aug 2019

It works for me with the newer version, thank you!

As a side question: are there any docs/advice on the different ways to do SBD. Documentation advises setting "is_sent_start" property of the token, while SentenceSegmenter (https://github.com/explosion/spaCy/blob/414f5270b330e35d3cb16bbdae433f0debb3d310/spacy/pipeline.pyx#L40) yields spans basically.

gsoul on 28 Aug 2019

I would follow the structure/approach of the newer Sentencizer instead when creating a custom approach:

https://github.com/explosion/spaCy/blob/b91425f80394b1f0494fd280458bb70b92b19236/spacy/pipeline/pipes.pyx#L1357

(I would really like to develop an SBD option that's somewhere in between the sentencizer (dumb but very fast) and the parser (much better but very slow), but there's nothing concrete to share yet.)

adrianeboyd on 28 Aug 2019

👍2

Yes, definitely make it a pipeline component that takes a Doc and sets the is_sent_start property on the tokens.

ines on 28 Aug 2019

👍1

@ines sorry for the confusion, but you liked advice by @adrianeboyd and gave exactly the opposite advice )). So should I implement a custom Segmenter strategy or set is_sent_start property in the pipeline?

gsoul on 28 Aug 2019

No, what Ines described is how the Sentencizer works. It's best for integration with the rest of spacy if your custom approach is a pipeline component that sets is_sent_start.

adrianeboyd on 28 Aug 2019

👍1

Thank you!

gsoul on 28 Aug 2019

Just in case someone will see this issue later. Eventually I decided to use syntok (https://pypi.org/project/syntok/) with spacy. Here's the code:

import spacy
from spacy.pipeline import SentenceSegmenter
import syntok.segmenter as segmenter


def custom_sentencizer(doc):
    start = 0
    t_i = 0
    for paragraph in segmenter.analyze(doc.text):
        for sentence in paragraph:
            while t_i < len(doc) and sentence[0].offset >= doc[t_i].idx:
                if sentence[0].offset <= doc[t_i].idx + len(doc[t_i].text) and start < t_i:
                    yield doc[start:t_i]
                    start = t_i

                t_i += 1
    if start < len(doc):
        yield doc[start:len(doc)]

nlp = spacy.load('en', disable=['parser', 'tagger'])
nlp.add_pipe(SentenceSegmenter({}, strategy=custom_sentencizer))

gsoul on 10 Sep 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.