Spacy: Issue with custom SBD

Created on 27 Aug 2019  路  9Comments  路  Source: explosion/spaCy

Hey everyone.

First of all I'd like to thank Spacy creators for such an amazing tool. Unfortunately either I'm doing something wrong or I found a bug.

I've got the following text:

My dark eyes had the look of held back tears. "It鈥檚 okay," I said

And what I'm trying to achieve is so that the split for the sentence would happen after '.' and not after '"'. Here's the code I use to try solve this issue, which is based on this manual page (https://spacy.io/usage/linguistic-features#sbd-custom) and some Googling:

import spacy


def custom_sentencizer(doc):
    for t in doc:
        if (t.text == '"') and (not t.whitespace_) and (t.nbor(-1).whitespace_): # and t.nbor(1).is_sent_start:
            doc[t.i].is_sent_start = True
            doc[t.nbor(1).i].is_sent_start = False

    return doc


nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger'])
nlp.max_length = 1500000
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.add_pipe(custom_sentencizer, after="sentencizer")



input = "My dark eyes had the look of held back tears. \"It鈥檚 okay,\" I said"
doc = nlp(input)
for sentence in doc.sents:
    print(sentence.text)

Any help would be greatly appreciated.

About my environment:

  • Ubuntu 18.04
  • Python 3.7
  • spacy==2.1.1
feat / doc feat / pipeline usage

Most helpful comment

I would follow the structure/approach of the newer Sentencizer instead when creating a custom approach:

https://github.com/explosion/spaCy/blob/b91425f80394b1f0494fd280458bb70b92b19236/spacy/pipeline/pipes.pyx#L1357

(I would really like to develop an SBD option that's somewhere in between the sentencizer (dumb but very fast) and the parser (much better but very slow), but there's nothing concrete to share yet.)

All 9 comments

Hmm, your example seems to work for me. I get the output:

My dark eyes had the look of held back tears.
"It鈥檚 okay," I said

I'm not sure what changed, but try upgrading to the newest version of spacy (2.1.8)?

It works for me with the newer version, thank you!

As a side question: are there any docs/advice on the different ways to do SBD. Documentation advises setting "is_sent_start" property of the token, while SentenceSegmenter (https://github.com/explosion/spaCy/blob/414f5270b330e35d3cb16bbdae433f0debb3d310/spacy/pipeline.pyx#L40) yields spans basically.

I would follow the structure/approach of the newer Sentencizer instead when creating a custom approach:

https://github.com/explosion/spaCy/blob/b91425f80394b1f0494fd280458bb70b92b19236/spacy/pipeline/pipes.pyx#L1357

(I would really like to develop an SBD option that's somewhere in between the sentencizer (dumb but very fast) and the parser (much better but very slow), but there's nothing concrete to share yet.)

Yes, definitely make it a pipeline component that takes a Doc and sets the is_sent_start property on the tokens.

@ines sorry for the confusion, but you liked advice by @adrianeboyd and gave exactly the opposite advice )). So should I implement a custom Segmenter strategy or set is_sent_start property in the pipeline?

No, what Ines described is how the Sentencizer works. It's best for integration with the rest of spacy if your custom approach is a pipeline component that sets is_sent_start.

Thank you!

Just in case someone will see this issue later. Eventually I decided to use syntok (https://pypi.org/project/syntok/) with spacy. Here's the code:

import spacy
from spacy.pipeline import SentenceSegmenter
import syntok.segmenter as segmenter


def custom_sentencizer(doc):
    start = 0
    t_i = 0
    for paragraph in segmenter.analyze(doc.text):
        for sentence in paragraph:
            while t_i < len(doc) and sentence[0].offset >= doc[t_i].idx:
                if sentence[0].offset <= doc[t_i].idx + len(doc[t_i].text) and start < t_i:
                    yield doc[start:t_i]
                    start = t_i

                t_i += 1
    if start < len(doc):
        yield doc[start:len(doc)]

nlp = spacy.load('en', disable=['parser', 'tagger'])
nlp.add_pipe(SentenceSegmenter({}, strategy=custom_sentencizer))

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings