Spacy: "Copying" a Doc with to_array/from_array does not yield the same SENT_START

Created on 14 Oct 2019 · 4Comments · Source: explosion/spaCy

Context

I am playing around building new Docs in a pipeline to remove sentences (https://stackoverflow.com/questions/58368208/filtering-out-sentences-between-sentence-segmentation-and-other-spacy-components?noredirect=1#comment103103941_58368208).
I noticed that I could not make SENT_START work with from_array. I expect to have the same sentences that what I had previously and end up with one sentence by token.

How to reproduce the behaviour

import spacy
from spacy.tokens import Doc
from spacy import attrs
import numpy as np

ATTRIBUTES_TO_RESTORE = [attrs.ORTH, attrs.LEMMA, attrs.SHAPE, attrs.IS_STOP, attrs.DEP, attrs.LOWER, attrs.POS, attrs.TAG, attrs.IS_ALPHA, attrs.SENT_START]

def copy_doc(doc: Doc) -> Doc:
    return Doc(
        doc.vocab,
        words=list(map(str, doc)),
        spaces=list(map(lambda token: token.whitespace_ != '', doc))
    ).from_array(ATTRIBUTES_TO_RESTORE, doc.to_array(ATTRIBUTES_TO_RESTORE))


nlp = spacy.load('en_core_web_sm')
doc = nlp("This is a short sentence. Too short. This is a really really long sentence with a lot of junk.")
doc_copy = copy_doc(doc)
for attr in ATTRIBUTES_TO_RESTORE:
    print(attrs.NAMES[attr])
    # Expect no errors, raise for attrs.SENT_START
    np.testing.assert_array_equal(doc.to_array([attr]), doc_copy.to_array([attr]))

Your Environment

spaCy version: 2.2.0
Platform: Linux-4.4.0-18362-Microsoft-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.7

EDIT:
After some fiddling, it seems like I should not be using attrs.SENT_START but attrs.HEAD in my case as I use dep.
It seems like I should probably have an error message :
https://github.com/explosion/spaCy/blob/f2d224756b95e6351b4dbff3367a6f823156c010/spacy/tokens/doc.pyx#L785-L786

So :

ATTRIBUTES_TO_RESTORE = [attrs.SENT_START] works if I don't have HEAD and/or DEP
ATTRIBUTES_TO_RESTORE = [attrs.HEAD, attrs.DEP] works if I don't have SENT_START

I guess that the previous snippet should be replaced by something like :

        if SENT_START in attrs and (HEAD in attrs or DEP in attrs):
            raise ValueError(Errors.E032)

Should I do a PR ?

bug feat / doc

Source

Wirg

👍1

Most helpful comment

If you haven't seen it already, check out Span.as_doc() which shows how to handle a lot of the related issues (SENT_START vs. HEAD/DEP, adjusting dependency arcs, etc.).

https://github.com/explosion/spaCy/blob/258eb9e064acf085828a327f0ab2a3295057b9c0/spacy/tokens/span.pyx#L203-L242

adrianeboyd on 18 Oct 2019

👍2

All 4 comments

Hi @Wirg, thanks for the report and the detailed analysis! I think you're right: your original case should have thrown an error because SENT_START shouldn't be used if the document is parsed. It would be good to make this function more robust and the error message more transparent. If you create the PR - feel free to include your original case as a regression test in https://github.com/explosion/spaCy/tree/master/spacy/tests/regression

svlandeg on 15 Oct 2019

👍1

If you haven't seen it already, check out Span.as_doc() which shows how to handle a lot of the related issues (SENT_START vs. HEAD/DEP, adjusting dependency arcs, etc.).

https://github.com/explosion/spaCy/blob/258eb9e064acf085828a327f0ab2a3295057b9c0/spacy/tokens/span.pyx#L203-L242

adrianeboyd on 18 Oct 2019

👍2

Just coming back to this...

I think that the original check (for HEAD) is correct because DEP alone doesn't provide any sentence boundaries, so the conflict is only between SENT_START and HEAD.

The problem is that DEP alone sets self.is_parsed and then set_children_from_heads() sees lots of 0 heads and modifies all the sentence boundaries. I think is_parsed should just be determined from HEAD and not DEP.