What are you trying to achieve? What is the expected result? What are you seeing instead?
I am using this function to filter out low frequency and high frequency words. However, the result dictionary size does not match my expectation.
The word frequency and word count for the corpus is:
[(1, 1441563), (2, 211515), (3, 77050), (4, 38364), ...]
id2word = gensim.corpora.Dictionary(texts, prune_at=2e6)
id2word.filter_extremes(no_below=4, no_above=0.5, keep_n=None)
the removed word frequency and word count is
[(1, 1441563), (2, 211515), (3, 77050), (4, 9)]
I don't understand why there are 9 words that appear 4 times in the corpus are filtered out.
Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
import gensim, json
with open(bug_data_file, 'r', encoding='utf-8') as r:
unique_texts = json.load(r)
def stat_list(item_list):
dic = dict()
for item in item_list:
dic[item] = dic.get(item, 0) + 1
return dic
def merge_count(to_dic, from_dic):
for key, val in from_dic.items():
to_dic[key] = to_dic.get(key, 0) + val
def stat_freq_count(vocab_freq):
# return freq : # of words has this frequecy
return stat_list(vocab_freq.values())
def stat_vocabulary_freq(texts):
# compute the word frequency to facilitate vovabulary filtering
vocab_freq = dict()
for doc in texts:
merge_count(vocab_freq, stat_list(doc))
return vocab_freq
def _create_corpus_using(texts, id2word):
# create corpus using the given dictionary
# useful for creating corpus for subset of texts
corpus = [id2word.doc2bow(text) for text in texts]
filtered_corpus = list(item for item in corpus if item)
print('raw text doc count:', len(texts), \
'filtered:', len(filtered_corpus), \
'ratio %.2f' % (len(filtered_corpus) / len(texts)))
return filtered_corpus, id2word
def create_corpus(texts, filter_limit):
# Create Dictionary
# the vocabulary size should below default keep_n=100k
id2word = gensim.corpora.Dictionary(texts, prune_at=2e6)
total = len(id2word)
print('raw token count:', total)
if filter_limit:
no_below, no_above = filter_limit
id2word.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
print('filter vocabulary with (no_below, no_above)', filter_limit)
print('keep token count:', len(id2word), \
'keep ratio %.2f' % (len(id2word) / total))
# create corpus
# Term Document Frequency
return _create_corpus_using(texts, id2word)
### filter corpus
no_below = 4
no_above = 0.5
filtered_corpus, id2word = create_corpus(unique_texts, (no_below, no_above))
### before filtering
vocab_freq = stat_vocabulary_freq(unique_texts)
freq_count = stat_freq_count(vocab_freq)
sorted_fc = sorted(freq_count.items(), key=lambda x:x[0])
print('list of (word frequency, word count)')
print(sorted_fc)
### after filtering
print('expected vocabulary size:', sum(c for _, c in sorted_fc[no_below - 1:]))
chosen_vocab = set(id2word.values())
filtered_vf = dict()
for w, freq in vocab_freq.items():
if w not in chosen_vocab:
filtered_vf[w] = freq
sorted_fvf = sorted(stat_freq_count(filtered_vf).items(), key=lambda x:x[0])
print('list of (word frequency, word count)')
print(sorted_fvf)
filtered_w = list(v for v, f in filtered_vf.items() if f == 4)
print('filtered out word', filtered_w)
key message in the output
list of (word frequency, word count)
[(1, 1441563), (2, 211515), (3, 77050), (4, 38364), (5, 22585), ...]
raw token count: 1857642
filter vocabulary with (no_below, no_above) (4, 0.5)
keep token count: 127505 keep ratio 0.07
expected vocabulary size: 127514
list of (word frequency, word count)
[(1, 1441563), (2, 211515), (3, 77050), (4, 9)]
The result token count is 127505, however, I expect this number to be 127514 based on 4 and 0.5. The missing 9 words are all appeared 4 times in the corpus.
json file to the corpus is here
Please provide the output of:
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
Linux-4.4.0-130-generic-x86_64-with-debian-stretch-sid
Python 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 0
Thank you for your report.
Could you please reduce your example to the bare minimum required to reproduce your problem? Is all the code you posted really necessary?
Hi, most of the code is used to show the statistics. The minimum code related to gensim is
import gensim, json
with open(bug_data_file, 'r', encoding='utf-8') as r:
unique_texts = json.load(r)
# expect
dic = dict()
for doc in unique_texts:
for item in doc:
dic[item] = dic.get(item, 0) + 1
expect = 0
for k, v in dic.items():
if v >= 4:
expect += 1
# actual
id2word = gensim.corpora.Dictionary(unique_texts, prune_at=2e6)
id2word.filter_extremes(no_below=4, no_above=0.5, keep_n=None)
print('expect', expect, 'actual', len(id2word))
but this tells nothing about which words are missing.
Could this be related to the fact that filter_extremes works with document frequencies ("in how many documents does a word appear?"), whereas your code seems to calculate corpus frequencies ("how many times does a word appear in a corpus?"). The two are not identical.
Yes, but a simple observation is that all the missing words appear 4 times in the corpus. They can't not reach the limit of document frequency. The total number of documents is above 2000.
I checked your corpus and the 9 removed words with corpus frequency 4 all have a document frequency of 3:
import json, itertools, collections, gensim
corpus = json.load(open('bug_filter_extremes_data.json'))
corpus_freqs = collections.Counter(itertools.chain.from_iterable(corpus))
doc_freqs = collections.Counter(itertools.chain.from_iterable(set(doc) for doc in corpus))
d = gensim.corpora.Dictionary(corpus)
d.filter_extremes(no_below=4, no_above=0.5, keep_n=None)
missing = [token for token in corpus_freqs if corpus_freqs[token] == 4 and token not in d.token2id]
[(token, corpus_freqs[token], doc_freqs[token]) for token in missing]
[(u'valimutlu', 4, 3),
(u'indiegala', 4, 3),
(u'esperanzagomez', 4, 3),
(u'ginodemi', 4, 3),
(u'ca_dem', 4, 3),
(u'lolesports', 4, 3),
(u'holgermu', 4, 3),
(u'socialmedia4def', 4, 3),
(u'djchuckie', 4, 3)]
This means each of these words appears once in two documents and twice in one document, for a total corpus count of 4 but a document count of 3, and thus fall below your filter_extremes(no_below=4) threshold.
Closing this as "works as expected, no bug".
I see. Thank you.
Most helpful comment
Could this be related to the fact that
filter_extremesworks with document frequencies ("in how many documents does a word appear?"), whereas your code seems to calculate corpus frequencies ("how many times does a word appear in a corpus?"). The two are not identical.