Spacy: Noun phrase merge is failing

Created on 13 May 2016  路  15Comments  路  Source: explosion/spaCy

This is now failing:

>>> doc = nlp('The cat sat on the mat')
>>> for np in doc.noun_chunks:
        np.merge(np.root.tag_, np.text, np.root.ent_type_)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-409-f6294d1a1cf8> in <module>()
      1 doc = nlp('The cat sat on the mat')
----> 2 for np in doc.noun_chunks:
      3     np.merge(np.root.tag_, np.text, np.root.ent_type_)

/Users/yaser/miniconda3/envs/spacy/lib/python3.5/site-packages/spacy/tokens/doc.pyx in noun_chunks (spacy/tokens/doc.cpp:7745)()

/Users/yaser/miniconda3/envs/spacy/python3.5/site-packages/spacy/syntax/iterators.pyx in english_noun_chunks (spacy/syntax/iterators.cpp:1559)()

/Users/yaser/miniconda3/envs/spacy/lib/python3.5/site-packages/spacy/tokens/doc.pyx in spacy.tokens.doc.Doc.__getitem__ (spacy/tokens/doc.cpp:4853)()

IndexError: list index out of range
bug

Most helpful comment

Ah, this was dumb, sorry --- I didn't have time to really look at this, now that I have it's obvious there's a problem. Actually I'm not sure how the code was working _before_. I think there was always a bug here.

Please work around this for now by doing for np in list(doc.noun_chunks). The problem is that we're changing the tokenisation out from underneath the iterator we're yielding from, and this is causing problems.

I think this is always going to be hard to get right, and I'm going to change the noun chunks code to accumulate the spans before it yields them.

All 15 comments

Thanks. v0.101.0?

I am still on 0.100.7.

Could you try 0.101.0?

Thanks guys for the amazing work. Although, Same issue while running v0.101.0. Hope you can fix it soon ;)

Ah, this was dumb, sorry --- I didn't have time to really look at this, now that I have it's obvious there's a problem. Actually I'm not sure how the code was working _before_. I think there was always a bug here.

Please work around this for now by doing for np in list(doc.noun_chunks). The problem is that we're changing the tokenisation out from underneath the iterator we're yielding from, and this is causing problems.

I think this is always going to be hard to get right, and I'm going to change the noun chunks code to accumulate the spans before it yields them.

Sorry for being late to get back to you but your tip works. Thanks a lot!

This is still failing for me on 0.101.0. The workaround with for np in list(doc.noun_chunks) works.

Fixed on the master branch, and the fix will be released in 1.0.

Hi, I'm not sure whether this is a related issue, but I'm using the merge functionality to merge tokens from certain noun phrases. I've set two rules to capture tokens that match "unemployment" and "unemployment rate". In a sentence like "Brazil's unemployment rate is 13%." I get two matches: [unemployment] and [unemployment, rate]. I pretty much copied the code from the lightning tour stated below:

def merge_phrases(matcher, doc, match_idx, matches):
    """
    Merge a phrase: ['unemployment', 'rate'] => ['unemployment rate'].
    Be careful: token indices will be changed.
    To avoid problems, merge all the phrases once we're called on the last match.

    :param matcher: spacy.matcher.Matcher: token matcher
    :param doc: spacy.tokens.Document: parsed doc
    :param match_idx: int: token index of current match
    :param matches: list: of spacy.tokens.Token
    :return: list
    """
    if match_idx != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start: end]) for ent_id, label, start, end in matches]
    logger.info("Got Spans %s", spans)
    for ent_id, label, span in spans:
        logger.info("Merging ent_id %s label %s span %s", ent_id, label, span)
        span.merge(label=label, tag='NNP' if label else span.root.tag_)
    logger.info("Merged Spans %s", spans)```

Now, if i leave the logger.info("Merged Spans %s", spans) after the for loop in the sentence above, I'd get something like:

-- Logging error ---
Traceback (most recent call last):
  File "/Users/thiago/miniconda2/envs/factutils/lib/python3.5/logging/__init__.py", line 980, in emit
    msg = self.format(record)
  File "/Users/thiago/miniconda2/envs/factutils/lib/python3.5/logging/__init__.py", line 830, in format
    return fmt.format(record)
  File "/Users/thiago/miniconda2/envs/factutils/lib/python3.5/logging/__init__.py", line 567, in format
    record.message = record.getMessage()
  File "/Users/thiago/miniconda2/envs/factutils/lib/python3.5/logging/__init__.py", line 330, in getMessage
    msg = msg % self.args
  File "spacy/tokens/span.pyx", line 72, in spacy.tokens.span.Span.__repr__ (spacy/tokens/span.cpp:4024)
  File "spacy/tokens/span.pyx", line 191, in spacy.tokens.span.Span.text.__get__ (spacy/tokens/span.cpp:6574)
  File "spacy/tokens/span.pyx", line 198, in spacy.tokens.span.Span.text_with_ws.__get__ (spacy/tokens/span.cpp:6727)
  File "spacy/tokens/span.pyx", line 87, in __iter__ (spacy/tokens/span.cpp:4472)
  File "spacy/tokens/span.pyx", line 130, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5060)
IndexError: Error calculating span: Can't find end

Note that if I comment out the last logging statement, not exception is thrown (there's more stack trace if needed). The funny thing is that the first logging statement gives me this 2017-01-29 20:28:56,240 | INFO | text.py-0043 | Got Spans [(1510408, 0, unemployment), (1510409, 0, unemployment rate)] (which are the tokens I want to merge). I'm raising this here because on the example I posted, the results of matcher.match becomes [unemployment rate] and [unemployment rate, is] (where as I expected only the former).

Hey does anyone still have a problem with this? I am running Spacy 1.8 and still run into this error with and without the workaround of casting it to a list

@patcollis34 the safer way I approached the issue was to leave the tokens unmerged, so the matcher just returns a list of matches and then write a function that removes overlapping matches contained in bigger matches.

This issue should not have been closed because it is still present in Spacy 2.0 alpha. Merging tokens (compounds, entities, matches, etc.) often results in this IndexError: Error calculating span: Can't find start.

In my experience this is cause by multiple merge calls in one loop.

For example, I could call this:

for np in doc.noun_chunks:
    np.merge(tag=np.root.tag_, lemma=np.lemma_, ent_type=np.root.ent_type_)

Now, say in the same loop I wanted to call a different merge whereby if all the POS tags weren't of the same type, I sub out the mismatched tags and merge only those that have a homogeneous structure. My code would look something like this:

for np in doc.noun_chunks:
    _tags = []

    current_segment = []
    for idx, _part in enumerate(np):
         _tags.append(_part.tag_)

         if _part.tag_ in ['NNP', 'NNS','NN']:
             current_segment.append(_part)
         elif len(current_segment) > 1:
             segments.append(current_segment)
             current_segment = []

        if idx == (len(np) - 1) and len(current_segment) > 1:
            segments.append(current_segment)

     if len(set(_tags)) == 1:
         np.merge(tag=np.root.tag_, lemma=np.lemma_, ent_type=np.root.ent_type_)
     elif len(segments) > 0:
         for _segment in segments:
                if len(_segment) > 1:
                    doc.merge(start_idx=_segment[0].idx, end_idx=_segment[len(_segment) - 1].idx + len(_segment[len(_segment) - 1]),
                              tag=_segment[0].tag_,
                              lemma=' '.join([token.text for token in _segment]),
                              ent_type=_segment[0].ent_type_)

This will fail. It will probably succeed in some cases solely because nothing was merged in the wrong place :) But it will fail eventually since you will be modifying the state of the actual document whilst iterating over it.

It's safer to extract one of the merges and do the merge outside of the loop.

e.g.

segments = []

for np in doc.noun_chunks:
    _tags = []

    current_segment = []
    for idx, _part in enumerate(np):
        _tags.append(_part.tag_)

        if _part.tag_ in AspectBasedSentimentAnalyser.T_TYPE or _part.lemma in ['south', 'north', 'east',
                                                                                'west']:
            current_segment.append(_part)
        elif len(current_segment) > 1:
            segments.append(current_segment)
            current_segment = []

        if idx == (len(np) - 1) and len(current_segment) > 1:
            segments.append(current_segment)

    if len(set(_tags)) == 1:
        np.merge(tag=np.root.tag_, lemma=np.lemma_, ent_type=np.root.ent_type_)

if len(segments) > 0:
    for _segment in segments:
        if len(_segment) > 1:
            doc.merge(start_idx=_segment[0].idx,
                      end_idx=_segment[len(_segment) - 1].idx + len(_segment[len(_segment) - 1]),
                      tag=_segment[0].tag_,
                      lemma=' '.join([token.text for token in _segment]),
                      ent_type=_segment[0].ent_type_)

try this merge_file
merge_file.txt

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings