"Unhandled exception at 0x00007FFC12181517 (token.cp36-win_amd64.pyd) in python.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000402603FF8)."
Iterating through a sentence causes a crash when PyCharm's debugger attempts to break after the first word (first word->second word->crash). Attached is the memory dump from Python after it crashed. If the dump with the heap would be useful, I can send it but it is over 2 GB.
import spacy
nlp = spacy.load('en_core_web_lg')
sentence = "Ashley graded Bob's term paper after he completed his assignment and she finished her group project with June."
text = nlp(sentence)
for token in text:
print(token.text, token.ent_id, token.lemma_, token.pos_, token.tag_, token.dep_) #break here
@cmckain Do you have any insight about the likely implications of this?
I don't regularly use debuggers, and I've never used PyCharm, so I'm not sure whether this points towards a deeper issue. If the crash is exposing a memory error in spaCy (e.g. a use after free or out-of-bounds access), obviously we're very interested in that! But if it's just that we hit some tight stack-size limit in this tool that we don't hit in regular execution, I don't think that's a problem we'd work on.
I have another problem with non-English models of Spacy 2.0 and PyCharm's debugger.
With the English model, I run and debug my project as usual.
With other models (I tried Portuguese and Spanish) I can run the project normally but when I try to use the debugger it halts on the line where I load the model:
nlp = spacy.load('pt')
or even:
nlp = spacy.load('pt', disable=['parser', 'tagger', 'ner'])
For some days I thought it was a PyCharm problem but after reinstall and even downgrade PyCharm I found that the source of the problem is the loading of the Spacy model. By chance, I left the debugger stopped in that line and discover that it restarts and follow to the next line after 10-15 minutes!!!
Is there any difference in the load process of these models that can explain this incredible delay?
It can't be a memory problem since the models' size is similar and I have enough memory (I run the project with the three models loaded without any problem).
This is really puzzling me...
Any tip would be welcome!
spaCy version 2.0.3
Platform Windows-10-10.0.17046-SP0
Python version 3.6.3
Models en, es, pt
TYPE NAME MODEL VERSION
package pt-core-news-sm pt_core_news_sm 2.0.0
package es-core-news-sm es_core_news_sm 2.0.0
package en-core-web-sm en_core_web_sm 2.0.0
link en en_core_web_sm 2.0.0
link es es_core_news_sm 2.0.0
link pt pt_core_news_sm 2.0.0
Although I haven't stepped through the assembly line-by-line yet, I would infer that the debugger is somehow forcing the program into an unbreakable loop or, perhaps, some recursion which calls the same function over and over again. My first notice of this problem was when a depreciation warning kept printing until the program crashed. Oddly enough, the debugger works the first time you open a Spacy data structure but the second attempt (both on a child structure and something else) causes a crash. Perhaps the debugger is maintaining a lock on some of the data and the program just keeps failing to get it back and, as a result, just crashes?
I can confirm @ruiEnca's problem. Using the above code, here are my timed results with the debugger on and off:
en_core_web_lg: 0:00:18.781000, 0:00:26.921000
en_core_web_sm: 0:00:03.859000, 0:00:04.188000
es_core_news_sm: 0:03:12.625000, 0:00:03.812000
I think I figured it out, @honnibal. In the file "spaCy/spacy/tokens/token.pyx", it sets up the 'get' and 'set' functions of the property "sent_start". If the word is literally the beginning of the sentence, it returns false but, if not, it returns the value of "sent_start" which calls the 'get' of "sent_start" which returns the value of "sent_start" and on and on. The stack overflow is because the function never stops calling itself after the first word. For most people, this isn't a problem because they don't need to call a depreciated function but the debugger does (since it lists all possible properties). My previous observation that it only occurred on the second listing was because I always tested the first word first and the second word second; when I removed the logic and had "return self.sent_start" always run, the program failed regardless of what word I chose to debug. My temporary solution was to change line 356 to "return True" although I'm not sure what it normally showed in the past. I would recommend that the removal of that property take place sooner rather than later. For @ruiEnca's issue, I would guess that the debugger is forcing some extra code to run that isn't normally run during startup as the debugger tries to load everything that it can.
@honnibal @cmckain I found the place where the debugger stops for some minutes while loading a non-English model. It is in the import_file function of compat.py in line 119:
spec.loader.exec_module(module)
This function is called from load_model and load_model_from_link in util.py.
With spacy.load('en') everything is ok. With other languages (I tried pt and es) it halts.
I can't spot the difference between loading the English model and the others.
spaCy version 2.0.4
Location C:\Anaconda\lib\site-packages\spacy-2.0.4-py3.6-win-amd64.egg\spacy
Platform Windows-10-10.0.17046-SP0
Python version 3.6.3
Models en, en_core_web_lg, en_core_web_md, es, pt
TYPE NAME MODEL VERSION
package pt-core-news-sm pt_core_news_sm 2.0.0
package es-core-news-sm es_core_news_sm 2.0.0
package en-core-web-sm en_core_web_sm 2.0.0
package en-core-web-md en_core_web_md 2.0.0
package en-core-web-lg en_core_web_lg 2.0.0
link en en_core_web_sm 2.0.0
link en_core_web_lg en_core_web_lg 2.0.0
link en_core_web_md en_core_web_md 2.0.0
link es es_core_news_sm 2.0.0
link pt pt_core_news_sm 2.0.0
@cmckain Thanks!! Was on holidays for most of December, so just getting back to this now. I've fixed the infinite loop -- I meant to write self.c.sent_start, but wrote self.sent_start...
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.