Spacy: Merging spans with space causes token attribute errors

Created on 22 Feb 2019 · 4Comments · Source: explosion/spaCy

🍒 Issue: When a span consisting of a non-space token followed by a space token is merged, the attributes (specifically, pos, tag, and dep) values are unexpected, i.e., they deviate from the following description of Span.merge(): "By default, attributes are inherited from the syntactic root token of the span" (from https://spacy.io/api/span).

The implementation goes as follows:

🍒 Relevant parts of the code:

[...]

matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': '+'}, {'IS_SPACE': True}]
matcher.add('ExtraSpace', None, pattern)

[...]

matches = matcher(doc)
# e.g., [(17638521375924138847, 5, 7), (17638521375924138847, 10, 12)] 

[...]

with doc.retokenize() as retokenizer:
    for match_id, start, end in matches:
        print(f'start, end: {start}, {end}')
        matched_span = doc[start: end]
        print(f'matched_span: {matched_span}')
        retokenizer.merge(matched_span)

I added a component to match pattern to the nlp pipeline. This component merges spans matching pattern (see code above). This has an unexpected consequence for the attributes of the merged spans. Compare the output before and after the component is enabled:

🍒 When the component is disabled:

In [150]: text
Out[150]: 'She has  a studio she works at across the street from the mental  hospital.'

# Using the en_coref_md model. 

In [152]: docbefore = nlp(text, disable=['space_remover'])

# Note tokens 2 and 14 (pos is `SPACE`). 

In [153]: [(t.i, t, t.pos_, t.tag_, t.dep_, t.head) for t in docbefore]
Out[153]:
[(0, She, 'PRON', 'PRP', 'nsubj', has),
 (1, has, 'VERB', 'VBZ', 'ROOT', has), ◀️
 (2,  , 'SPACE', '', '', has), ◀️ 
 (3, a, 'DET', 'DT', 'det', studio),
 (4, studio, 'NOUN', 'NN', 'dobj', has),
 (5, she, 'PRON', 'PRP', 'nsubj', works),
 (6, works, 'VERB', 'VBZ', 'relcl', studio),
 (7, at, 'ADP', 'IN', 'prep', works),
 (8, across, 'ADP', 'IN', 'prep', works),
 (9, the, 'DET', 'DT', 'det', street),
 (10, street, 'NOUN', 'NN', 'pobj', across),
 (11, from, 'ADP', 'IN', 'prep', works),
 (12, the, 'DET', 'DT', 'det', hospital),
 (13, mental, 'ADJ', 'JJ', 'amod', hospital), ◀️
 (14,  , 'SPACE', '', '', mental), ◀️ 
 (15, hospital, 'NOUN', 'NN', 'pobj', from),
 (16, ., 'PUNCT', '.', 'punct', has)]

🍒 Now, when the component is enabled:

In [154]: docafter = nlp(text)
start, end: 1, 3
matched_span: has
start, end: 13, 15
matched_span: mental

In [155]: [(t.i, t, t.pos_, t.tag_, t.dep_, t.head) for t in docafter]
Out[155]:
[(0, She, 'PRON', 'PRP', 'nsubj', has  ),
 (1, has  , 'X', 'FW', 'ROOT', has  ), ◀️ 
 (2, a, 'DET', 'DT', 'det', studio),
 (3, studio, 'NOUN', 'NN', 'dobj', has  ),
 (4, she, 'PRON', 'PRP', 'nsubj', works),
 (5, works, 'VERB', 'VBZ', 'relcl', studio),
 (6, at, 'ADP', 'IN', 'prep', works),
 (7, across, 'ADP', 'IN', 'prep', works),
 (8, the, 'DET', 'DT', 'det', street),
 (9, street, 'NOUN', 'NN', 'pobj', across),
 (10, from, 'ADP', 'IN', 'prep', works),
 (11, the, 'DET', 'DT', 'det', hospital),
 (12, mental  , 'PROPN', 'NNP', 'compound', hospital), ◀️ 
 (13, hospital, 'NOUN', 'NN', 'pobj', from),
 (14, ., 'PUNCT', '.', 'punct', has  )]

Note the attributes for the merged spans (now tokens 1 and 12) are not what you expect (cf. tokens 1 and 13 in Out[153] above).

Is this expected? Or is it a bug?

As a workaround, I also tried converting the Doc to an array along the lines discussed here, but that seems to have ramifications for tokenization we don't want, so that's a less preferred solution.

bug feat / doc

Source

dzenilee

All 4 comments

Well...I agree that it's unexpected, but it does match up with the definition of "root" that's implemented. The Span root is implemented as the word with the shortest path to the root of the sentence. If the space is attached to the following word in the dependency parse, and this makes it attach with a shorter path to the root of the sentence, then you'll get the behaviour you're seeing. Maybe we should have an extra penalty for spaces in this calculation?

honnibal on 23 Feb 2019

👍1

Maybe we should have an extra penalty for spaces in this calculation?

Yes, this sounds like something we can try.

dzenilee on 1 Mar 2019

I thought that this was possibly fixed by #4219, but after looking at the details I don't think so.

I would have expected the behavior modified in #4219 (using the first token rather than the root) to work correctly here and I don't actually see how the original example even got the attributes like 'X'/'FW' for 'has ' that don't come from any token in the span, so that's pretty weird, like there was some accidental incrementing/decrementing of the tag symbol.

I don't think it's due to the space being the root somehow, either, since then you'd expect something more like:

1 has  SPACE _SP ROOT has

So, all I can say is that it seems to work now?

I modified the pattern to match any token:

pattern = [{}, {'IS_SPACE': True}]

Before:

0 She PRON PRP nsubj has
1 has VERB VBZ ROOT has
2   SPACE _SP  has
3 a DET DT det studio
4 studio NOUN NN dobj has
5 she PRON PRP nsubj works
6 works VERB VBZ relcl studio
7 at ADP IN prep works
8 across ADP IN prep works
9 the DET DT det street
10 street NOUN NN pobj across
11 from ADP IN prep works
12 the DET DT det hospital
13 mental ADJ JJ amod hospital
14   SPACE _SP  mental
15 hospital NOUN NN pobj from
16 . PUNCT . punct has

And now after:

0 She PRON PRP nsubj has  
1 has   VERB VBZ ROOT has  
2 a DET DT det studio
3 studio NOUN NN dobj has  
4 she PRON PRP nsubj works
5 works VERB VBZ relcl studio
6 at ADP IN prep works
7 across ADP IN prep works
8 the DET DT det street
9 street NOUN NN pobj across
10 from ADP IN prep works
11 the DET DT det hospital
12 mental   ADJ JJ amod hospital
13 hospital NOUN NN pobj from
14 . PUNCT . punct has

If I change the pattern to:

pattern = [{'IS_SPACE': True}, {}]

Edited (mixed up my virtual envs, whoops), now with #4219, the root instead of the space (as first token) attributes are chosen:

0 She PRON PRP nsubj has
1 has VERB VBZ ROOT has
2  a DET DT det studio
3 studio NOUN NN dobj has
4 she PRON PRP nsubj works
5 works VERB VBZ relcl studio
6 at ADP IN prep works
7 across ADP IN prep works
8 the DET DT det street
9 street NOUN NN pobj across
10 from ADP IN prep works
11 the DET DT det  hospital
12 mental ADJ JJ amod  hospital
13  hospital NOUN NN pobj from
14 . PUNCT . punct has