Gensim: Runtime error in phrases.py

Created on 20 Jan 2021  路  8Comments  路  Source: RaRe-Technologies/gensim

Problem description

Trying to use export_phrases function on a phrases model.
Instead getting Runtime error

Steps/code/corpus to reproduce

from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

documents = ["I am interested in machine learning projects", 
             "machine learning projects can be useful sometimes",
            "I love working on machine learning projects",
            "interested does not mean the same thing as likes",
            "i am interested in blockchain"]

sentence_stream = [doc.split(" ") for doc in documents]
bigrams = Phrases(sentence_stream, min_count=2, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
trigrams = Phrases(bigrams[sentence_stream], min_count=2, threshold=1)
trigrams.export_phrases()
RuntimeError                              Traceback (most recent call last)
<ipython-input-190-0f7e41471301> in <module>
----> 1 trigrams.export_phrases()

~\Anaconda3\lib\site-packages\gensim\models\phrases.py in export_phrases(self)
    716         """
    717         result, source_vocab = {}, self.vocab
--> 718         for token in source_vocab:
    719             unigrams = token.split(self.delimiter)
    720             if len(unigrams) < 2:

RuntimeError: dictionary changed size during iteration

Versions

Please provide the output of:

Windows-10-10.0.17763-SP0
Python 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.19.5
SciPy 1.5.2
gensim 4.0.0beta
FAST_VERSION 0
bug

All 8 comments

OK, I can see the issue. Phrases employs defaultdict, which modifies its content on (unsuccessful) access: self.vocab[not_in_vocab] changes self.vocab.

The proper fix will be to get rid of this unwanted mutation 鈥撀爀ither by replacing vocab[x] by vocab.get(x, 0) in all const functions, or (better, easier to reason about) by replacing the entire defaultdict by a plain dict.

@thalishsajeed are you up for it?

@piskvorky Yep, I vote for replacing defaultdict. If you agree with that, I'll make a PR with the change. (after checking for any other unwanted issues)

Yes, please go ahead.

You can re-use #3030 , no need to start a new PR.

@piskvorky Is it okay to replace instances of the following code pattern

self.vocab[word] + = 1

with

self.vocab[word] = vocab.get(word, 0) + 1

I don't know of any way to do updates in-place in a regular dict.

There's also this line - which needs to be modified for similar reasons and making sure generator isn't called twice if you use .get method.

https://github.com/RaRe-Technologies/gensim/blob/a21d9cc768598640f38e4bd03d368f8712a9aa77/gensim/models/phrases.py#L596

Still comfortable moving to dict right? Also I read some SO posts that defaultdict is more performant that dict, wondering if that is still the case and needs to be considered for this change.

Yes, that's the way to do it!

And yes, I'd expect defaultdict to be (slightly) more performant than dict. But having correct, readable code is more important.

If we ever optimize Phrases, it'll be by translating its code to Cython / C, for a proper 10x+ boost. Not chasing a few percent here and there with defaultdict.

@piskvorky Thanks! Well, I'm done with the changes, can you point me to how I can run the unit test suite locally?

Hmm. @mpenkov will python setup.py test work, for local testing? I don't see that on the Developer page.

I usually do pytest gensim.

You may need to do something like pip install -e .[test] to get the dependencies installed first.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Laubeee picture Laubeee  路  3Comments

jeradf picture jeradf  路  4Comments

volj1 picture volj1  路  4Comments

Jianqiang picture Jianqiang  路  3Comments

ahmedbhabbas picture ahmedbhabbas  路  4Comments