A small bug in bigram.export_phrases(sentences) causes it to return a maximum of one bigram per sentence.
For example:
from gensim.models import Phrases
sentences = [
['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey','human','interface']]
bigram = Phrases(sentences, min_count=1, threshold=2)
for phrase, score in bigram.export_phrases(sentences):
print(u'{0} {1}'.format(phrase, score))
Returns:
>>> human interface 3.44444444444
response time 7.75
response time 7.75
graph minors 5.16666666667
graph minors 5.16666666667
The last sentence has two bigrams, but only the first is returned. This happens because this line
https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/phrases.py#L216
prevents further iteration within a sentence.
If it's commented out, it correctly returns:
>>> human interface 3.44444444444
response time 7.75
response time 7.75
graph minors 5.16666666667
graph minors 5.16666666667
human interface 3.44444444444
Good find!
I believe the intent of last_bigram is to prevent a word from being in two overlapping bigrams; the bug may be that last_bigram is not toggled back to False at the end of each non-bigram iteration.
So perhaps best fix is to add continue after the toggle-on, and at bottom of (non-continued/non-bigrammed) loop block, add a last_bigram = False line.
Hi everyone!
Today I worked on this. I am sending a pull request.
As reported on discussion list (https://groups.google.com/forum/#!topic/gensim/N0nMD95N6Iw), the fix as applied wasn't really effective. I think my suggestion is still valid: the line just needs to be at the end of the loop, outside any conditionals. (Specifically, de-indented two levels.) The test-for-two-returned-phrases that was added didn't really probe the behavior of export_phrases(), but __getitem__() (which AFAICT didn't have the problem, so the added test wouldn't have failed even before the applied-change to export_phrases().)
Reopening. Best approach will be to make a valid failing test first, then try the de-indentation as a fix.
Fixed in #1362
Most helpful comment
As reported on discussion list (https://groups.google.com/forum/#!topic/gensim/N0nMD95N6Iw), the fix as applied wasn't really effective. I think my suggestion is still valid: the line just needs to be at the end of the loop, outside any conditionals. (Specifically, de-indented two levels.) The test-for-two-returned-phrases that was added didn't really probe the behavior of
export_phrases(), but__getitem__()(which AFAICT didn't have the problem, so the added test wouldn't have failed even before the applied-change toexport_phrases().)Reopening. Best approach will be to make a valid failing test first, then try the de-indentation as a fix.