I am playing around building new Docs in a pipeline to remove sentences (https://stackoverflow.com/questions/58368208/filtering-out-sentences-between-sentence-segmentation-and-other-spacy-components?noredirect=1#comment103103941_58368208).
I noticed that I could not make SENT_START work with from_array. I expect to have the same sentences that what I had previously and end up with one sentence by token.
import spacy
from spacy.tokens import Doc
from spacy import attrs
import numpy as np
ATTRIBUTES_TO_RESTORE = [attrs.ORTH, attrs.LEMMA, attrs.SHAPE, attrs.IS_STOP, attrs.DEP, attrs.LOWER, attrs.POS, attrs.TAG, attrs.IS_ALPHA, attrs.SENT_START]
def copy_doc(doc: Doc) -> Doc:
return Doc(
doc.vocab,
words=list(map(str, doc)),
spaces=list(map(lambda token: token.whitespace_ != '', doc))
).from_array(ATTRIBUTES_TO_RESTORE, doc.to_array(ATTRIBUTES_TO_RESTORE))
nlp = spacy.load('en_core_web_sm')
doc = nlp("This is a short sentence. Too short. This is a really really long sentence with a lot of junk.")
doc_copy = copy_doc(doc)
for attr in ATTRIBUTES_TO_RESTORE:
print(attrs.NAMES[attr])
# Expect no errors, raise for attrs.SENT_START
np.testing.assert_array_equal(doc.to_array([attr]), doc_copy.to_array([attr]))
EDIT:
After some fiddling, it seems like I should not be using attrs.SENT_START but attrs.HEAD in my case as I use dep.
It seems like I should probably have an error message :
https://github.com/explosion/spaCy/blob/f2d224756b95e6351b4dbff3367a6f823156c010/spacy/tokens/doc.pyx#L785-L786
So :
ATTRIBUTES_TO_RESTORE = [attrs.SENT_START] works if I don't have HEAD and/or DEPATTRIBUTES_TO_RESTORE = [attrs.HEAD, attrs.DEP] works if I don't have SENT_STARTI guess that the previous snippet should be replaced by something like :
if SENT_START in attrs and (HEAD in attrs or DEP in attrs):
raise ValueError(Errors.E032)
Should I do a PR ?
Hi @Wirg, thanks for the report and the detailed analysis! I think you're right: your original case should have thrown an error because SENT_START shouldn't be used if the document is parsed. It would be good to make this function more robust and the error message more transparent. If you create the PR - feel free to include your original case as a regression test in https://github.com/explosion/spaCy/tree/master/spacy/tests/regression
If you haven't seen it already, check out Span.as_doc() which shows how to handle a lot of the related issues (SENT_START vs. HEAD/DEP, adjusting dependency arcs, etc.).
Just coming back to this...
I think that the original check (for HEAD) is correct because DEP alone doesn't provide any sentence boundaries, so the conflict is only between SENT_START and HEAD.
The problem is that DEP alone sets self.is_parsed and then set_children_from_heads() sees lots of 0 heads and modifies all the sentence boundaries. I think is_parsed should just be determined from HEAD and not DEP.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
If you haven't seen it already, check out
Span.as_doc()which shows how to handle a lot of the related issues (SENT_START vs. HEAD/DEP, adjusting dependency arcs, etc.).https://github.com/explosion/spaCy/blob/258eb9e064acf085828a327f0ab2a3295057b9c0/spacy/tokens/span.pyx#L203-L242