Spacy: parser not obeying is_sent_start == False (regression)

Created on 19 Sep 2018  路  4Comments  路  Source: explosion/spaCy

Not only is spacy nightly not obeying is_sent_start it also is producing a bad sentence segmentation.

How to reproduce the behaviour

import spacy

text = 'When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.'


def sbd_component(doc):
    doc[0].is_sent_start = True
    for i, token in enumerate(doc[1:]):
        # define sentence start after space token
        if doc[i-1].is_space:
            doc[i].is_sent_start = True
        else:
            doc[i].is_sent_start = False
    return doc

# Bad builtin sbd
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for i, sent in enumerate(doc.sents):
    print(i, sent.text)

# Spacy isn't honoring is_sent_start and producing the same poor sbd
nlp.add_pipe(sbd_component, before='parser')  # insert before the parser
doc = nlp(text)
for i, sent in enumerate(doc.sents):
    print(i, sent.text)

I expected
1 When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.
1 When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.

But get this instead
1 When
2 we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us
1 When
2 we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us

When I run this on the binder (https://spacy.io/usage/processing-pipelines#component-example1) it works as expected.

Your Environment

  • spaCy version: 2.1.0a1
  • Platform: Darwin-17.7.0-x86_64-i386-64bit
  • Python version: 3.7.0
  • Models: en_core_web_md, es_core_news_md, en_core_web_lg, en_core_web_sm
bug feat / parser 馃寵 nightly

All 4 comments

Thanks, I've been chasing this bug for a while on develop. I think it's occurring in set_heads_from_children. The parse and sentence boundaries actually disagree here.

The issue arises when we have non-projective dependencies (aka crossing brackets). The parser is constrained to produce only projective trees, but there's a pre- and post- processing trick to make the parser predict non-projective analyses. After deprojectivisation, we run the set_children_from_heads routine, which was written with the assumption that the parse is projective --- but this assumption is no longer true, causing the error.

Thanks again for the test case. Fixed now! This had held up the experiments on the universal dependencies corpus, as there are many more non-projective parses there.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nadachaabani1 picture nadachaabani1  路  3Comments

enerrio picture enerrio  路  3Comments

peterroelants picture peterroelants  路  3Comments

ahalterman picture ahalterman  路  3Comments

ajayrfhp picture ajayrfhp  路  3Comments