Spacy: parser not obeying is_sent_start == False (regression)

Created on 19 Sep 2018 · 4Comments · Source: explosion/spaCy

Not only is spacy nightly not obeying is_sent_start it also is producing a bad sentence segmentation.

How to reproduce the behaviour

import spacy

text = 'When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.'


def sbd_component(doc):
    doc[0].is_sent_start = True
    for i, token in enumerate(doc[1:]):
        # define sentence start after space token
        if doc[i-1].is_space:
            doc[i].is_sent_start = True
        else:
            doc[i].is_sent_start = False
    return doc

# Bad builtin sbd
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for i, sent in enumerate(doc.sents):
    print(i, sent.text)

# Spacy isn't honoring is_sent_start and producing the same poor sbd
nlp.add_pipe(sbd_component, before='parser')  # insert before the parser
doc = nlp(text)
for i, sent in enumerate(doc.sents):
    print(i, sent.text)

I expected
1 When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.
1 When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.

But get this instead
1 When
2 we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us
1 When
2 we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us

When I run this on the binder (https://spacy.io/usage/processing-pipelines#component-example1) it works as expected.

Your Environment

spaCy version: 2.1.0a1
Platform: Darwin-17.7.0-x86_64-i386-64bit
Python version: 3.7.0
Models: en_core_web_md, es_core_news_md, en_core_web_lg, en_core_web_sm

bug feat / parser 🌙 nightly

Source

christian-storm

All 4 comments

Thanks, I've been chasing this bug for a while on develop. I think it's occurring in set_heads_from_children. The parse and sentence boundaries actually disagree here.

honnibal on 19 Sep 2018

The issue arises when we have non-projective dependencies (aka crossing brackets). The parser is constrained to produce only projective trees, but there's a pre- and post- processing trick to make the parser predict non-projective analyses. After deprojectivisation, we run the set_children_from_heads routine, which was written with the assumption that the parse is projective --- but this assumption is no longer true, causing the error.

honnibal on 19 Sep 2018

❤1

Thanks again for the test case. Fixed now! This had held up the experiments on the universal dependencies corpus, as there are many more non-projective parses there.

honnibal on 19 Sep 2018

🎉1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.