Spacy: Segmentation fault when using span.as_doc() method

Created on 3 May 2019  路  7Comments  路  Source: explosion/spaCy

How to reproduce the behaviour


I am trying to parse a number of reddit comments, dumped from pushshift.io. A file with a sample of these comments (about 130,000) can be found here: comment_sample_100k.csv.

The following script is a heavily shortened version of the one I'm using to do the parsing: spacy_failure.py. There are two custom components in the pipeline: the tokenizer and the sentence boundary setter. The script builds the custom Language object, reads lines in csv format from stdin, parses them, and throws away the result. Running the script looks like:

$ python failure_script.py < comment_sample_100k.csv

More than half the time, this command will eventually segfault. It's nondeterministic - sometimes it crashes in 30 seconds, other times it runs for 10 minutes, and sometimes it finishes successfully.

This is about as short as I could make the script while reproducing the error. Lots of seemingly unrelated lines will prevent the error from occurring. Certain small details that prevent the error are marked with comments. Certain other lines that aren't critical to the error but seem to increase the crash rate are also marked. Occasionally, instead of a segfault, the CSV reader will be corrupted and raise an exception.

My best guess is the issue is related to resource cleanup of the object returned by sent.as_doc().to_bytes(). Changing where the Sentence object gets garbage collected seems to change the outcome.

I'm using spaCy 2.1.0, but I've replicated it on 2.1.3 as well.

Info about spaCy

  • spaCy version: 2.1.0
  • Platform: Darwin-18.2.0-x86_64-i386-64bit
  • Python version: 3.7.2
  • Models: en
bug feat / doc

Most helpful comment

@dpraul : yep, you're right. I tried fixing this in https://github.com/explosion/spaCy/pull/3969: instead of simply removing all head information from the span (doc), we can just keep the ones that refer to tokens inside the span.
So I am hopeful that this PR would fix your issue (and others above), too.

All 7 comments

Thanks for the report.

Could you try avoiding the span.as_doc() call? I've had trouble with this before, and it doesn't work the way it was originally intended. Originally I wanted it to be a zero-copy operation, but that didn't work. I'm very suspicious that this could be where the bug is, as it's a quite untested method that had bugs previously.

You might find the serialization code here useful: https://github.com/explosion/spaCy/blob/master/spacy/tokens/_serialize.py This lets you serialize a collection of Doc objects in an efficient format. It should let you then reconstruct your sentences, given the Doc objects.

Thanks for the response! In my own testing the code ran fine without the as_doc call, so I'm glad to hear that's the most likely source of the issue. I can work around calling that method.

I confirm there is something happening with the as_doc method :/ The pb seems to be in the Span.to_array method.

I used to do model(span.as_doc()) to apply a model like NER or TextCategorizer on a span but it seems to be breaking now. What would be the best way to apply a model on a span ?

Also running into this issue. Similar to @thomasopsomer, we're trying to run a Matcher on each entity in doc.ents to do some extra processing

Did some more investigation here. It seems that the issue only shows itself when the DependencyParser component is enabled in the pipeline. With it disabled, we aren't able to reproduce the segfault.

EDIT: Did some more digging. Made a replica of Span.as_doc() and removed HEAD from the list of attrs and the segfaults stop don't happen!

@dpraul : yep, you're right. I tried fixing this in https://github.com/explosion/spaCy/pull/3969: instead of simply removing all head information from the span (doc), we can just keep the ones that refer to tokens inside the span.
So I am hopeful that this PR would fix your issue (and others above), too.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ajayrfhp picture ajayrfhp  路  3Comments

ank-26 picture ank-26  路  3Comments

armsp picture armsp  路  3Comments

peterroelants picture peterroelants  路  3Comments

besirkurtulmus picture besirkurtulmus  路  3Comments