Gensim: Dictionary.filter_extremes does not work properly

Created on 30 May 2019  路  6Comments  路  Source: RaRe-Technologies/gensim

Problem description

What are you trying to achieve? What is the expected result? What are you seeing instead?

I am using this function to filter out low frequency and high frequency words. However, the result dictionary size does not match my expectation.
The word frequency and word count for the corpus is:
[(1, 1441563), (2, 211515), (3, 77050), (4, 38364), ...]

id2word = gensim.corpora.Dictionary(texts, prune_at=2e6)
id2word.filter_extremes(no_below=4, no_above=0.5, keep_n=None)

the removed word frequency and word count is
[(1, 1441563), (2, 211515), (3, 77050), (4, 9)]
I don't understand why there are 9 words that appear 4 times in the corpus are filtered out.

Steps/code/corpus to reproduce

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").

import gensim, json


with open(bug_data_file, 'r', encoding='utf-8') as r:
    unique_texts = json.load(r)


def stat_list(item_list):
    dic = dict()
    for item in item_list:
        dic[item] = dic.get(item, 0) + 1
    return dic


def merge_count(to_dic, from_dic):
    for key, val in from_dic.items():
        to_dic[key] = to_dic.get(key, 0) + val


def stat_freq_count(vocab_freq):
    # return freq : # of words has this frequecy
    return stat_list(vocab_freq.values())


def stat_vocabulary_freq(texts):
    # compute the word frequency to facilitate vovabulary filtering
    vocab_freq = dict()
    for doc in texts:
        merge_count(vocab_freq, stat_list(doc))
    return vocab_freq


def _create_corpus_using(texts, id2word):
    # create corpus using the given dictionary
    # useful for creating corpus for subset of texts
    corpus = [id2word.doc2bow(text) for text in texts]
    filtered_corpus = list(item for item in corpus if item)
    print('raw text doc count:', len(texts), \
          'filtered:', len(filtered_corpus), \
          'ratio %.2f' % (len(filtered_corpus) / len(texts)))
    return filtered_corpus, id2word


def create_corpus(texts, filter_limit):
    # Create Dictionary
    # the vocabulary size should below  default keep_n=100k
    id2word = gensim.corpora.Dictionary(texts, prune_at=2e6)
    total = len(id2word)
    print('raw token count:', total)
    if filter_limit:
        no_below, no_above = filter_limit
        id2word.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
        print('filter vocabulary with (no_below, no_above)', filter_limit)
        print('keep token count:', len(id2word), \
              'keep ratio %.2f' % (len(id2word) / total))
    # create corpus
    # Term Document Frequency
    return _create_corpus_using(texts, id2word)

### filter corpus
no_below = 4
no_above = 0.5
filtered_corpus, id2word = create_corpus(unique_texts, (no_below, no_above))

### before filtering
vocab_freq = stat_vocabulary_freq(unique_texts)
freq_count = stat_freq_count(vocab_freq)
sorted_fc = sorted(freq_count.items(), key=lambda x:x[0])
print('list of (word frequency, word count)')
print(sorted_fc)

### after filtering
print('expected vocabulary size:', sum(c for _, c in sorted_fc[no_below - 1:]))

chosen_vocab = set(id2word.values())
filtered_vf = dict()
for w, freq in vocab_freq.items():
    if w not in chosen_vocab:
        filtered_vf[w] = freq
sorted_fvf = sorted(stat_freq_count(filtered_vf).items(), key=lambda x:x[0])
print('list of (word frequency, word count)')
print(sorted_fvf)
filtered_w = list(v for v, f in filtered_vf.items() if f == 4)
print('filtered out word', filtered_w)

key message in the output

list of (word frequency, word count)
[(1, 1441563), (2, 211515), (3, 77050), (4, 38364), (5, 22585), ...]

raw token count: 1857642
filter vocabulary with (no_below, no_above) (4, 0.5)
keep token count: 127505 keep ratio 0.07

expected vocabulary size: 127514
list of (word frequency, word count)
[(1, 1441563), (2, 211515), (3, 77050), (4, 9)]

The result token count is 127505, however, I expect this number to be 127514 based on 4 and 0.5. The missing 9 words are all appeared 4 times in the corpus.

json file to the corpus is here

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Linux-4.4.0-130-generic-x86_64-with-debian-stretch-sid
Python 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 0

Most helpful comment

Could this be related to the fact that filter_extremes works with document frequencies ("in how many documents does a word appear?"), whereas your code seems to calculate corpus frequencies ("how many times does a word appear in a corpus?"). The two are not identical.

All 6 comments

Thank you for your report.

Could you please reduce your example to the bare minimum required to reproduce your problem? Is all the code you posted really necessary?

Hi, most of the code is used to show the statistics. The minimum code related to gensim is

import gensim, json
with open(bug_data_file, 'r', encoding='utf-8') as r:
    unique_texts = json.load(r)
# expect
dic = dict()
for doc in unique_texts:
    for item in doc:
        dic[item] = dic.get(item, 0) + 1
expect = 0
for k, v in dic.items():
    if v >= 4:
        expect += 1
# actual
id2word = gensim.corpora.Dictionary(unique_texts, prune_at=2e6)
id2word.filter_extremes(no_below=4, no_above=0.5, keep_n=None)
print('expect', expect, 'actual', len(id2word))

but this tells nothing about which words are missing.

Could this be related to the fact that filter_extremes works with document frequencies ("in how many documents does a word appear?"), whereas your code seems to calculate corpus frequencies ("how many times does a word appear in a corpus?"). The two are not identical.

Yes, but a simple observation is that all the missing words appear 4 times in the corpus. They can't not reach the limit of document frequency. The total number of documents is above 2000.

I checked your corpus and the 9 removed words with corpus frequency 4 all have a document frequency of 3:

import json, itertools, collections, gensim

corpus = json.load(open('bug_filter_extremes_data.json'))
corpus_freqs = collections.Counter(itertools.chain.from_iterable(corpus))
doc_freqs = collections.Counter(itertools.chain.from_iterable(set(doc) for doc in corpus))
d = gensim.corpora.Dictionary(corpus)
d.filter_extremes(no_below=4, no_above=0.5, keep_n=None)
missing = [token for token in corpus_freqs if corpus_freqs[token] == 4 and token not in d.token2id]
[(token, corpus_freqs[token], doc_freqs[token]) for token in missing]

[(u'valimutlu', 4, 3),
 (u'indiegala', 4, 3),
 (u'esperanzagomez', 4, 3),
 (u'ginodemi', 4, 3),
 (u'ca_dem', 4, 3),
 (u'lolesports', 4, 3),
 (u'holgermu', 4, 3),
 (u'socialmedia4def', 4, 3),
 (u'djchuckie', 4, 3)]

This means each of these words appears once in two documents and twice in one document, for a total corpus count of 4 but a document count of 3, and thus fall below your filter_extremes(no_below=4) threshold.

Closing this as "works as expected, no bug".

I see. Thank you.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ahmedbhabbas picture ahmedbhabbas  路  4Comments

johann-petrak picture johann-petrak  路  3Comments

franciscojavierarceo picture franciscojavierarceo  路  3Comments

coopwilliams picture coopwilliams  路  3Comments

Jianqiang picture Jianqiang  路  3Comments