Gensim: Bug in Phrases.export_phrases()

Created on 16 Jul 2016 · 4Comments · Source: RaRe-Technologies/gensim

A small bug in bigram.export_phrases(sentences) causes it to return a maximum of one bigram per sentence.
For example:

from gensim.models import Phrases

sentences = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey','human','interface']]

bigram = Phrases(sentences, min_count=1, threshold=2)
for phrase, score in bigram.export_phrases(sentences):
    print(u'{0}   {1}'.format(phrase, score))

Returns:

>>> human interface   3.44444444444
response time   7.75
response time   7.75
graph minors   5.16666666667
graph minors   5.16666666667

The last sentence has two bigrams, but only the first is returned. This happens because this line
https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/phrases.py#L216
prevents further iteration within a sentence.
If it's commented out, it correctly returns:

>>> human interface   3.44444444444
response time   7.75
response time   7.75
graph minors   5.16666666667
graph minors   5.16666666667
human interface   3.44444444444

bug difficulty easy

Source

jeradf

Most helpful comment

As reported on discussion list (https://groups.google.com/forum/#!topic/gensim/N0nMD95N6Iw), the fix as applied wasn't really effective. I think my suggestion is still valid: the line just needs to be at the end of the loop, outside any conditionals. (Specifically, de-indented two levels.) The test-for-two-returned-phrases that was added didn't really probe the behavior of export_phrases(), but __getitem__() (which AFAICT didn't have the problem, so the added test wouldn't have failed even before the applied-change to export_phrases().)

Reopening. Best approach will be to make a valid failing test first, then try the de-indentation as a fix.

gojomo on 15 May 2017

👍2

All 4 comments

Good find!

I believe the intent of last_bigram is to prevent a word from being in two overlapping bigrams; the bug may be that last_bigram is not toggled back to False at the end of each non-bigram iteration.

So perhaps best fix is to add continue after the toggle-on, and at bottom of (non-continued/non-bigrammed) loop block, add a last_bigram = False line.

gojomo on 16 Jul 2016

Hi everyone!
Today I worked on this. I am sending a pull request.