Trying to use export_phrases function on a phrases model.
Instead getting Runtime error
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
documents = ["I am interested in machine learning projects",
"machine learning projects can be useful sometimes",
"I love working on machine learning projects",
"interested does not mean the same thing as likes",
"i am interested in blockchain"]
sentence_stream = [doc.split(" ") for doc in documents]
bigrams = Phrases(sentence_stream, min_count=2, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
trigrams = Phrases(bigrams[sentence_stream], min_count=2, threshold=1)
trigrams.export_phrases()
RuntimeError Traceback (most recent call last)
<ipython-input-190-0f7e41471301> in <module>
----> 1 trigrams.export_phrases()
~\Anaconda3\lib\site-packages\gensim\models\phrases.py in export_phrases(self)
716 """
717 result, source_vocab = {}, self.vocab
--> 718 for token in source_vocab:
719 unigrams = token.split(self.delimiter)
720 if len(unigrams) < 2:
RuntimeError: dictionary changed size during iteration
Please provide the output of:
Windows-10-10.0.17763-SP0
Python 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.19.5
SciPy 1.5.2
gensim 4.0.0beta
FAST_VERSION 0
OK, I can see the issue. Phrases employs defaultdict, which modifies its content on (unsuccessful) access: self.vocab[not_in_vocab] changes self.vocab.
The proper fix will be to get rid of this unwanted mutation 鈥撀爀ither by replacing vocab[x] by vocab.get(x, 0) in all const functions, or (better, easier to reason about) by replacing the entire defaultdict by a plain dict.
@thalishsajeed are you up for it?
@piskvorky Yep, I vote for replacing defaultdict. If you agree with that, I'll make a PR with the change. (after checking for any other unwanted issues)
Yes, please go ahead.
You can re-use #3030 , no need to start a new PR.
@piskvorky Is it okay to replace instances of the following code pattern
self.vocab[word] + = 1
with
self.vocab[word] = vocab.get(word, 0) + 1
I don't know of any way to do updates in-place in a regular dict.
There's also this line - which needs to be modified for similar reasons and making sure generator isn't called twice if you use .get method.
Still comfortable moving to dict right? Also I read some SO posts that defaultdict is more performant that dict, wondering if that is still the case and needs to be considered for this change.
Yes, that's the way to do it!
And yes, I'd expect defaultdict to be (slightly) more performant than dict. But having correct, readable code is more important.
If we ever optimize Phrases, it'll be by translating its code to Cython / C, for a proper 10x+ boost. Not chasing a few percent here and there with defaultdict.
@piskvorky Thanks! Well, I'm done with the changes, can you point me to how I can run the unit test suite locally?
Hmm. @mpenkov will python setup.py test work, for local testing? I don't see that on the Developer page.
I usually do pytest gensim.
You may need to do something like pip install -e .[test] to get the dependencies installed first.