Expected gensim.models.keyedvectors WordEmbeddingsKeyedVectors property similarity_matrix to respect the reordering of terms dictated by a supplied tfidf model. This does not seem to be the case.
This will be an unconventional bug report because I did not find this bug "in the wild". In adapting the code for another purpose, I noticed some logic which might be problematic. I will
for row_number, w1_index in enumerate(word_indices):
...
w1 = dictionary[w1_index]
...
# Traverse upper triangle columns.
if matrix_order <= nonzero_limit + 1: # Traverse all columns.
columns = (
(w2_index, self.similarity(w1, dictionary[w2_index]))
for w2_index in range(w1_index + 1, matrix_order)
if w1_index != w2_index and dictionary[w2_index] in self.vocab)
...
for w2_index, similarity in columns:
# Ensure that we don't exceed `nonzero_limit` by mirroring the upper triangle.
if similarity > threshold and matrix_nonzero[w2_index] <= nonzero_limit:
element = similarity**exponent
matrix[w1_index, w2_index] = element
matrix_nonzero[w1_index] += 1
matrix[w2_index, w1_index] = element
matrix_nonzero[w2_index] += 1
A red flag for me was that while we consider the "word_indices" mapping for the row index, we are never considering it for the column index. Even in the case of rows, notice that we are filling "matrix" at the end without regard for the reordering that was supposed to have occurred.
for row_number, w1_index in enumerate(word_indices):
...
# Traverse upper triangle columns.
# TODO: determine community preference for pylev.levenshtein or distance.levenshtein
import pylev
columns = []
for col_number in range(row_number + 1, matrix_order):
if row_number != col_number:
w2_index = word_indices[col_number]
w1 = dictionary[w1_index]
w2 = dictionary[w2_index]
similarity = pylev.levenshtein(w1, w2)/\
float(max(len(w1), len(w2)))
columns.append((col_number, similarity))
for col_number, similarity in columns:
element = alpha * (1 - similarity) ** beta
matrix[row_number, col_number] = element
matrix[col_number, row_number] = element
Consider the mini corpus:
mini_texts = [['abc'], ['lab', 'abc'], ['bad', 'abc']]
mini_dict = gensim.corpora.Dictionary(mini_texts)
mini_corpus = [mini_dict.doc2bow(text) for text in mini_texts]
mini_tfidf = gensim.models.TfidfModel(mini_corpus)
However, since 'abc' appears in every document, the tfidf coefficient will be zero. The ordering with respect to the tfidf weight will be: 'lab', 'bad', 'abc'. The default Levenshtein similarity scores are pretty symmetric, there are some reorderings for which the matrix will be the same. However, this example is contrived to so that the tfidf reordering will break the symmetry of the original Levenshtein similarity scores:
matrix([[1. , 0.00740741, 0.00740741],
[0.00740741, 1. , 0. ],
[0.00740741, 0. , 1. ]], dtype=float32)
```
We see that the revised logic correctly returns the reordered scores. The original logic returned the first matrix when the tfidf model was supplied to the function. A test case for the similarity_matrix function would need to be constructed so that a reordering of terms would lead to a definitive reordering of the relevant scores in the resulting matrix.
While I have not taken the time to prove definitively that there is a bug in the original context, a unit test should be written to cover the case of a supplied tfidf reordering the terms in similarity_matrix. This is a high-risk piece of logic. If there is a bug, and a supplied tfidf isn't being incorporated properly, eventually SoftCosineSimilarity will return nonsense scores when fed documents transformed by the same tfidf model.
similarity_matrix:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/keyedvectors.py#L440
existing unit tests:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/test/test_keyedvectors.py#L30
See also related: #1961
Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Apr 4 2017, 08:46:44) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
('NumPy', '1.14.1')
('SciPy', '1.0.0')
('gensim', '3.4.0')
('FAST_VERSION', 0)
Thank you for the report. So far, this does not seem to be a bug to me. Several things to note:
nonzero_limit + 1 and we will process all columns. In this scenario, the order of columns does not matter – we are filling everything.nonzero_limit + 1), then we go down the following path:Even the documentation specifies that “[…] The rows of the term similarity matrix will be build in an increasing order of importance of terms […]”. The order in which columns are processed is not affected.
It is entirely possible I am misunderstanding the purpose of the option of providing a tfidf model into this function. Let me offer a test case (this time in context) that highlights what believe its purpose to be.
import gensim
mini_texts = [['ab'], ['abc', 'ab'], ['bcd', 'ab']]
mini_dict = gensim.corpora.Dictionary(mini_texts)
mini_corpus = [mini_dict.doc2bow(text) for text in mini_texts]
mini_tfidf = gensim.models.TfidfModel(mini_corpus)
w2v_cbow = gensim.models.Word2Vec(mini_texts, sg=1, min_count=1)
w2v_cbow.wv.similarity_matrix(mini_dict, threshold=0.0, exponent=1.0).todense()
Out[365]:
matrix([[1. , 0.0664964, 0. ],
[0.0664964, 1. , 0. ],
[0. , 0. , 1. ]], dtype=float32)
w2v_cbow.wv.similarity_matrix(mini_dict, tfidf=mini_tfidf, threshold=0.0, exponent=1.0).todense()
Out[366]:
matrix([[1. , 0.0664964, 0. ],
[0.0664964, 1. , 0. ],
[0. , 0. , 1. ]], dtype=float32)
My understanding is that a document vector, in this case, is a 3-dimensional vector, and the elements in the vector are values that are ordered according to the term order ['ab', 'abc', 'bcd']. Any document vector in the tfidf transformed corpus will be represented by 3-dimensional vector of tfidf coefficients that take their place with respect to the new tfidf weighted term order ['abc', 'bcd', 'ab'].
With respect to the default ordering I would expect:
| | 'ab' | 'abc' | 'bcd' |
|-------|------|-------|-------|
| 'ab' | 1 | # | 0 |
| 'abc' | # | 1 | 0 |
| 'bcd' | 0 | 0 | 1 |
With respect to the tfidf ordering I would expect:
| | 'abc' | 'bcd' | 'ab' |
|-------|-------|-------|------|
| 'abc' | 1 | 0 | # |
| 'bcd' | 0 | 1 | 0 |
| 'ab' | # | 0 | 1 |
The current function returns the same matrix both times. Won't this cause term mismatches when computing soft cosine similarity with tfidf transformed documents?
Thank you for your patience. I see now I had a misconception about the gensim tfidf transformation reordering features. I am not sure where that originated, but I am very glad to have it corrected!
Just to make sure I understand, the tfidf parameter here is not meant to restructure the output in any way, it is supplied as a means to improve performance?
That is correct. When the matrix is empty, we can take a row and just fill all columns corresponding to the closest terms in the embedding space. When the matrix is already half-filled, some columns in a row (that do not necessarily correspond to the closest terms) will already be pre-filled (due to the symmetric matrix[w2_index, w1_index] = element assignment) and some columns corresponding to the closest terms will not be fillable (due to the if … and matrix_nonzero[w2_index] <= nonzero_limit: constraint).
The role of the tfidf parameter is to process the rows that correspond to important terms before the matrix has filled up; experiments suggest this improves performance compared to an arbitrary row processing order. Other algorithms for pruning the matrix (such as disregarding symmetry and just taking nonzero_limit closest terms in each row) are possible, but unimplemented. Not pruning the matrix is also an option with small datasets.
However, the documentation should be reworded, since the rows are processed in _decreasing_, not increasing, order of importance. I will make a commit that clarifies the documentation and closes this issue.