Not only is spacy nightly not obeying is_sent_start it also is producing a bad sentence segmentation.
import spacy
text = 'When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.'
def sbd_component(doc):
doc[0].is_sent_start = True
for i, token in enumerate(doc[1:]):
# define sentence start after space token
if doc[i-1].is_space:
doc[i].is_sent_start = True
else:
doc[i].is_sent_start = False
return doc
# Bad builtin sbd
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for i, sent in enumerate(doc.sents):
print(i, sent.text)
# Spacy isn't honoring is_sent_start and producing the same poor sbd
nlp.add_pipe(sbd_component, before='parser') # insert before the parser
doc = nlp(text)
for i, sent in enumerate(doc.sents):
print(i, sent.text)
I expected
1 When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.
1 When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.
But get this instead
1 When
2 we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us
1 When
2 we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us
When I run this on the binder (https://spacy.io/usage/processing-pipelines#component-example1) it works as expected.
Thanks, I've been chasing this bug for a while on develop. I think it's occurring in set_heads_from_children. The parse and sentence boundaries actually disagree here.
The issue arises when we have non-projective dependencies (aka crossing brackets). The parser is constrained to produce only projective trees, but there's a pre- and post- processing trick to make the parser predict non-projective analyses. After deprojectivisation, we run the set_children_from_heads routine, which was written with the assumption that the parse is projective --- but this assumption is no longer true, causing the error.
Thanks again for the test case. Fixed now! This had held up the experiments on the universal dependencies corpus, as there are many more non-projective parses there.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.